0% found this document useful (0 votes)

25 views

Unit 5 XML

XML is a standard for encoding structured data in a hierarchical, human-readable format. It allows data from diverse sources to be integrated using a common format. XML documents can be constrained using DTDs or XML Schema, which define elements, attributes, and relationships. Namespaces allow XML elements and attributes from different sources to be distinguished. XML provides a flexible yet standardized way to represent and exchange structured data.

Uploaded by

sathiyab.csbs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Unit 5 XML

Uploaded by

sathiyab.csbs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 73

CHAPTER 11: XML

PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Gaining Access to Diverse Data
We have focused on data integration A B
in the relational model a1 b1
Simplest model to understand a2 b2

Real-world data is often not in relational form

e.g., Excel spreadsheets, Web tables, Java objects, RDF, …
 One approach: convert using custom wrappers (Ch. 9)

 But suppose tools would adopt a standard export

(and import) mechanism?
 … This is the role of XML, the eXtensible Markup Language

2
What Is XML?

Hierarchical, human-readable format Procedural

 A “sibling” to HTML, always parsable language
XQuery (Java, JavaScript,
C++, …)
 “Lingua franca” of data: encodes
documents and structured data XPath
 Blends data and schema (structure)
SAX/DOM REST/
SOAP+
WSDL
Core of a broader ecosystem HTTP
DTD/
 Data – XML (also RDF, Ch. 12) Schema
XML
 Schema – DTD and XML Schema
 Programmatic access – DOM and SAX
 Query – XPath, XSLT, XQuery Database Document Web
Service
 Distributed programs – Web services
XML Anatomy
<?xml version="1.0" encoding="ISO-8859-1" ?> Processing Instr.
<dblp> Open-tag
<mastersthesis mdate="2002-01-03" key="ms/Brown92">
<author>Kurt P. Brown</author>
<title>PRPL: A Database Workload Specification Language</title>
<year>1992</year>
<school>Univ. of Wisconsin-Madison</school> Element
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
Attribute
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume> Close-tag
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>

4
XML Data Components
XML includes two kinds of data items:
Elements <article mdate="2002-01-03" …>
<editor>Paul R. McJones</editor> …
</article>
 Hierarchical structure with open tag-close tag pairs
 May include nested elements

 May include attributes within the element’s open-tag

 Multiple elements may have same name

 Order matters

Attributes mdate="2002-01-03"
 Named values – not hierarchical
 Only one attribute with a given name per element

 Order does NOT matter

Well-Formed XML: Always Parsable
Any legal XML document is always parsable by an XML
parser, without knowledge of tag meaning
 The start – preamble – tells XML about the char. encoding
<?xml version=“1.0” encoding=“utf-8”?>
 There’s a single root element
 All open-tags have matching close-tags (unlike many
HTML documents!), or a special:
<tag/> shortcut for empty tags (equivalent to <tag></tag>)
 Attributes only appear once in an element
 XML is case-sensitive

6
Outline
 XML data model
 Node types
 Encoding relations and semi-structured data
 Namespaces
 XML schema languages
 XML querying
 XML query processing
 XML schema mapping
XML as a Data Model
XML “information set” includes 7 types of nodes:
 Document (root)
 Element
 Attribute
 Processing instruction
 Text (content)
 Namespace
 Comment
XML data model includes this, plus typing info, plus
order info and a few other things

8
XML Data Model Visualized
(and simplified!) root attribute

p-i element
Root
text
?xml dblp

mastersthesis article

mdate mdate
key key
2002… author title year school editor title journal volume year ee ee
1992 2002…
ms/Brown92 The… 1997
tr/dec/…
PRPL…
Digital… db/labs/dec
Kurt P…. Univ…. Paul R.
SRC… http://www.

9
XML Easily Encodes Relations

Student-course-grade si cid exp-

d grade
<student-course-grade>
<tuple><sid>1</sid><cid>570103</cid>
<exp-grade>B</exp-grade></tuple>

1 570 B
<tuple><sid>23</sid><cid>550103</cid>
<exp-grade>A</exp-grade></tuple>
</student-course-grade> OR
<student-course-grade>
<tuple sid=“1” cid=“570103” exp-grade=“B”/>
<tuple sid=“23” cid=“550103” exp-grade=“A”/>
</student-course-grade>
103
2 550 A
3 103
10
XML is “Semi-Structured”
<parents>
<parent name=“Jean” >
<son>John</son>
<daughter>Joan</daughter>
<daughter>Jill</daughter>
</parent>
<parent name=“Feng”>
<daughter>Ella</daughter>
</parent>
…

11
Combining XML from Multiple Sources
with the Same Tags: Namespaces
 Namespaces allow us to specify a context for different tags
 Two parts:
 Binding of namespace to URI
 Qualified names Default namespace for
non-qualified names
<root xmlns=“http://www.first.com/aspace” xmlns:otherns=“…”>
<myns:tag xmlns:myns=“http://www.fictitious.com/mypath”>
<thistag>is in the default namespace
Defines “otherns”
(www.first.com/aspace)</thistag>
qualifier
<myns:thistag>is in myns</myns:thistag>
<otherns:thistag>is a different tag in otherns</otherns:thistag>
</myns:tag>
</root>

12
Outline
 XML data model
 XML schema languages
 DTDs
 XML Schema (XSD)
 XML querying
 XML query processing
 XML schema mapping
XML Isn’t Enough on Its Own
It’s too unconstrained for many cases!
 How will we know when we’re getting garbage?
 How will we know what to query for?
 How will we understand what we receieved?

We also need:
 An idea of (at least part of) the structure
 Some knowledge of how to interpret the tags…

14
Structural Constraints:
Document Type Definitions (DTDs)
The DTD is an EBNF grammar defining XML structure
 The XML document specifies an associated DTD, plus the
root element of the document
 DTD specifies children of the root (and so on)

DTD also defines special attribute types:

 IDs – special attributes that are analogous to keys for
elements
 IDREFs – references to IDs
 IDREFS – a list of IDREFs, space-delimited (!)
 All other attributes are essentially treated as strings
15
An Example DTD and How to
Reference It from XML
Example DTD:
<!ELEMENT dblp((mastersthesis | article)*)>
<!ELEMENT
mastersthesis(author,title,year,school,committeemember*)>
<!ATTLIST mastersthesis(mdate CDATA #REQUIRED
key ID #REQUIRED
advisor CDATA #IMPLIED>
<!ELEMENT author(#PCDATA)>
…

Example use of DTD in XML file:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE dblp SYSTEM “my.dtd">
<dblp>…

16
Links in XML: Restricted Foreign Keys
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE graph SYSTEM “special.dtd">
<graph>
<author id=“author1”> Suppose we have defined
<name>John Smith</name> this to be of type ID
</author>
<article>
<author ref=“author1” /> <title>Paper1</title>
</article> Suppose we have defined
<article> this to be of type IDREF
<author ref=“author1” /> <title>Paper2</title>
</article>
…

17
The Limitations of DTDs
DTDs capture grammatical structure, but have some
drawbacks:
 Don’t capture database datatypes’ domains
 IDs aren’t a good implementation of keys
 Why not?
 No way of defining OO-like inheritance

 “Almost XML” syntax – inconvenient to build tools for

them

18
XML Schema (XSD)
Aims to address the shortcomings of DTDs
 XML syntax
 Can define keys using XPaths (we’ll discuss later)
 Type subclassing that also includes restrictions on ranges
 “By extension” (adds new data) and “by restriction” (adds
constraints)
 … And, of course, domains and built-in datatypes

(Note there are other XML schema formats like RELAX NG)

19
Basics of XML Schema
Need to use the XML Schema namespace (generally named xsd)
 simpleTypes are a way of restricting domains on scalars
 Can define a simpleType based on integer, with values within a
particular range
 complexTypes are a way of defining element/attribute
structures
 Basically equivalent to !ELEMENT, but more powerful
 Specify sequence, choice between child elements
 Specify minOccurs and maxOccurs (default 1)
 Must associate an element/attribute with a simpleType, or an
element with a complexType

20
Simple XML Schema Example
Associates “xsd” namespace
with XML Schema
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name=“mastersthesis" type=“ThesisType"/>
This is the root element,
<xsd:complexType name=“ThesisType"> with type specified below
<xsd:attribute name=“mdate" type="xsd:date"/>
<xsd:attribute name=“key" type="xsd:string"/>
<xsd:attribute name=“advisor" type="xsd:string"/>
<xsd:sequence>
<xsd:element name=“author" type=“xsd:string"/>
<xsd:element name=“title" type=“xsd:string"/>
<xsd:element name=“year" type=“xsd:integer"/>
<xsd:element name=“school" type=“xsd:string”/>
<xsd:element name=“committeemember" type=“CommitteeType”
minOccurs=“0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema> 21
Designing an XML Schema/DTD
Not as formalized as relational data design
 Typically based on an existing underlying design, e.g., relational DBMS
or spreadsheet

We generally orient the XML tree around the “central” objects

Big decision: element vs. attribute

 Element if it has its own properties, or if you might have more than
one of them
 Attribute if it is a single property – though element is OK here too!

22
Outline
 XML data model
 XML schema languages
 XML querying
 DOM and SAX
 XPath
 XQuery
 XML query processing
 XML schema mapping
XML to Your Program: Document Object Model
(DOM) and Simple API for XML (SAX)
 A huge benefit of XML – standard parsers and standard (cross-
language) APIs for processing it

 DOM: an object-oriented representation of the XML parse tree

(roughly like the Data Model graph)
 DOM objects have methods like “getFirstChild()”, “getNextSibling”
 Common way of traversing the tree
 Can also modify the DOM tree – alter the XML – via insertAfter(), etc.

 Sometimes we don’t want all of the data: SAX

 Parser interface that calls a function each time it parses a processing-
instruction, element, etc.
 Your code can determine what to do, e.g., build a data structure, or
discard a particular portion of the data

24
Querying XML
Alternate approach to processing the data: a query language
 Define some sort of a template describing traversals from the root of
the directed graph

 Potential benefits in parallelism, views, schema mappings, and so on

 In XML, the basis of this template is called an XPath

 Can also declare some constraints on the values you want
 The XPath returns a node set of matches

25
XPaths
In its simplest form, an Xpath looks like a path in a file
system:
/mypath/subpath/*/morepath

 But XPath returns a node set representing the XML nodes

(and their subtrees) at the end of the path
 XPaths can have node tests at the end, filtering all except
node types
 text(), processing-instruction(), comment(),
element(), attribute()
 XPath is fundamentally an ordered language: it can query
in order-aware fashion, and it returns nodes in order

26
Recall Our Sample XML
<?xml version="1.0" encoding="ISO-8859-1" ?>
<dblp>
<mastersthesis mdate="2002-01-03" key="ms/Brown92">
<author>Kurt P. Brown</author>
<title>PRPL: A Database Workload Specification
Language</title>
<year>1992</year>
<school>Univ. of Wisconsin-Madison</school>
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>

27
Recall Our XML Tree root attribute

p-i element
Root
text
?xml dblp

mastersthesis article

28
Some Example XPath Queries
 /dblp/mastersthesis/title
 /dblp/*/editor
 //title
 //title/text()

29
Context Nodes and Relative Paths
XPath has a notion of a context node: it’s analogous to
a current directory
 “.” represents this context node
 “..” represents the parent node
 We can express relative paths:
subpath/sub-subpath/../.. gets us back to the context node

 By default, the document root is the context node

30
Predicates – Selection Operations
A predicate allows us to filter the node set based on
selection-like conditions over sub-XPaths:

/dblp/article[title = “Paper1”]

which is equivalent to:

/dblp/article[./title/text() = “Paper1”]

31
Axes: More Complex Traversals
Thus far, we’ve seen XPath expressions that go down
the tree (and up one step)
 But we might want to go up, left, right, etc. via axes:
 self::path-step
 child::path-step parent::path-step
 descendant::path-step ancestor::path-step
 descendant-or-self::path-step ancestor-or-self::path-step
 preceding-sibling::path-step following-sibling::path-step
 preceding::path-step following::path-step
 The previous XPaths we saw were in “abbreviated form”
/child::dblp/child::mastersthesis/child::title
/descendant-or-self::title
32
Querying Order
 We saw in the previous slide that we could query for
preceding or following siblings or nodes
 We can also query a node’s position according to
some index:
 fn::first() , fn::last() index of 0th & last element
matching the last step
 fn::position() relative count of the current node

child::article[fn::position() = fn::last()]

33
XPath Is Used within Many Standards
 XML Schema uses simple XPaths in defining keys and
uniqueness constraints
 XQuery
 XSLT
 XLink and Xpointer – hyperlinks for XML

34
XPath Is Used to Express XML
Schema Keys & Foreign Keys
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:complexType name=“ThesisType">
<xsd:attribute name=“key" type="xsd:string"/>
<xsd:sequence>
<xsd:element name=“author" type=“xsd:string"/> …
<xsd:element name=“school" type=“xsd:string”/> …
</xsd:sequence>
</xsd:complexType> Foreign key refers
<xsd:element name=“dblp”> <xsd:sequence> to key by its ID
<xsd:element name=“mastersthesis" type=“ThesisType">
<xsd:keyref name=“schoolRef” refer=“schoolId">
<xsd:selector xpath=“./school”/> <xsd:field
xpath=“./text()"/>
</xsd:keyref> </xsd:element>
<xsd:element name=“university"
type=“SchoolType“>…</xsd:element>
</xsd:sequence>
<xsd:key name=“schoolId">
<xsd:selector xpath=“university”/><xsd:field xpath="@key"/>
Item w/key = selector
</xsd:key> </xsd:element> </xsd:schema>
Field is its key 35
Beyond XPath: XQuery
A strongly-typed, Turing-complete XML manipulation language
 Attempts to do static typechecking against XML Schema
 Based on an object model derived from Schema

Unlike SQL, fully compositional, highly orthogonal:

 Inputs & outputs collections (sequences or bags) of XML nodes
 Anywhere a particular type of object may be used, may use the
results of a query of the same type
 Designed mostly by DB and functional language people

Can be used to define queries, views, and (using a subset)

schema mappings

36
XQuery’s Basic Form
 Has an analogous form to SQL’s
SELECT..FROM..WHERE..GROUP BY..ORDER BY
 The model: bind nodes (or node sets) to variables; operate
over each legal combination of bindings; produce a set of
nodes
 “FLWOR” statement [note case sensitivity!]:
for {iterators that bind variables}
let {collections}
where {conditions}
order by {order-paths}
return {output constructor}
 Mixes XML + XQuery syntax; use {} as “escapes”
37
Recall Our XML Tree root attribute

p-i element
Root
text
?xml dblp

mastersthesis article

38
“Iterations” in XQuery
A series of (possibly nested) FOR statements assigning the results of XPaths
to variables

for $root in doc (“http://my.org/my.xml”)

for $sub in $root/rootElement,
$sub2 in $sub/subElement, …

 Something like a template that pattern-matches, produces a “binding

tuple”
 For each of these, we evaluate the WHERE and possibly output the
RETURN template
 document() or doc() function specifies an input file as a URI
 Early versions used “document”; modern versions use “doc”

39
Two XQuery Examples
<root-tag> {
for $p in doc (“dblp.xml”)/dblp/article,
$yr in $p/yr
where $yr = “1997”
return <paper> { $p/title } </paper>
} </root-tag>

for $i in doc (“dblp.xml”)/dblp/article[author/text() = “John Smith”]

return <smith-paper>
<title>{ $i/title/text() }</title>
<key>{ $i/@key }</key>
{ $i/crossref }
</smith-paper>

40
Restructuring Data in XQuery
Nesting XML trees is perhaps the most common operation
In XQuery, it’s easy – put a subquery in the return clause where you want
things to repeat!

for $u in doc(“dblp.xml”)/dblp/university
where $u/country = “USA”
return <ms-theses-99>
{ $u/name } {
for $mt in doc(“dblp.xml”)/dblp/mastersthesis
where $mt/year/text() = “1999” and $mt/school =
$u/name
return $mt/title }
</ms-theses-99>
41
Collections & Aggregation in XQuery
In XQuery, many operations return collections
 XPaths, sub-XQueries, functions over these, …
 The let clause assigns the results to a variable
Aggregation simply applies a function over a collection, where
the function returns a value (very elegant!)

let $allpapers := doc (“dblp.xml”)/dblp/article

return <article-authors>
<count> { fn:count(fn:distinct-values($allpapers/authors)) } </count>
{ for $paper in doc(“dblp.xml”)/dblp/article
let $pauth := $paper/author
return <paper> {$paper/title}
<count> { fn:count($pauth) } </count>
</paper>
} </article-authors>

42
Collections, Ctd.
Unlike in SQL, we can compose aggregations and
create new collections from old:
<result> {
let $avgItemsSold := fn:avg(
for $order in doc(“my.xml”)/orders/order
let $totalSold = fn:sum($order/item/quantity)
return $totalSold)
return $avgItemsSold
} </result>

43
Distinct-ness
In XQuery, DISTINCT-ness happens as a function over a
collection
 But since we have nodes, we can do duplicate removal
according to value or node
 Can do fn:distinct-values(collection) to remove duplicate
values, or fn:distinct-nodes(collection) to remove
duplicate nodes

for $years in fn:distinct-values(doc(“dblp.xml”)//year/text())

return $years

44
Sorting in XQuery
 In XQuery, what we order is the sequence of “result
tuples” output by the return clause:

for $x in doc (“dblp.xml”)/proceedings

order by $x/title/text()
return $x

45
Querying & Defining Metadata
Can get a node’s name by querying name():
for $x in doc (“dblp.xml”)/dblp/*
return name($x)

Can construct elements and attributes using computed names:

for $x in doc (“dblp.xml”)/dblp/*,
$year in $x/year,
$title in $x/title/text()
return
element { name($x) } {
attribute { “year-” + $year } { $title }
}

46
Views in XQuery
 A view is a named query
 We use the name of the view to invoke the query
(treating it as if it were the relation it returns)

Using the view:

XQuery:
declare function V() as element(content)* { for $v in V()/content,
for $r in doc(“R”)/root/tree, $r in doc(“r”)/root/tree
$a in $r/a, $b in $r/b, $c in $r/c where $v/b = $r/b
where $a = “123” return $v
return <content>{$a, $b, $c}</content>
}

47
Outline
 XML data model
 XML schema languages
 XML querying
 XML query processing
 XML schema mapping
Streaming Query Evaluation
 In data integration scenarios, the query processor
must fetch remote data, parse the XML, and process

 Ideally: we can pipeline processing of the data as it

is “streaming” to the system

“Streaming XPath evaluation”

… which is also a building block to pipelined XQuery

evaluation…
Main Observations
 XML is sent (serialized) in a form that corresponds to
a left-to-right depth-first traversal of the parse tree

 The “core” part of XPath (child, descendent axes)

essentially corresponds to regular expressions over
edge labels
The First Enabler:
SAX (Simple API for XML)
 If we are to match XPaths in streaming fashion, we
need a stream of XML nodes

 SAX provides a series of event notifications

 Events include open-tag, close-tag, character data

 Events will be fired in depth-first, left-to-right traversal

order of the XML tree

51
The Second Key: Finite Automata
 Convert each XPath to an equivalent regular
expression

 Build a finite automaton (NFA or DFA) for the regexp

/dblp/article dblp article

//year year

∑
Matching an XPath
 Assume a “cursor” on active state in the automaton
 On matching open-tag: push advance active state
 On close-tag: pop active state

dblp article Stack:

1 2 3 1

<?xml version="1.0" encoding="ISO-8859-1" ?>

<dblp> event: start-element “dblp”
<mastersthesis>
…
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>
Matching an XPath
 Assume a “cursor” on active state in the automaton
 On matching open-tag: push advance active state
 On close-tag: pop active state

dblp article Stack:

dead 1 2 3 21
1

<?xml version="1.0" encoding="ISO-8859-1" ?>

<dblp>
<mastersthesis>
…
event: start-element “mastersthesis”
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>
Matching an XPath
 Assume a “cursor” on active state in the automaton
 On matching open-tag: push advance active state
 On close-tag: pop active state

dblp article Stack:

1 2 3 21
1

<?xml version="1.0" encoding="ISO-8859-1" ?>

<dblp>
<mastersthesis>
…
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018"> event: end-element “mastersthesis”
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>
Matching an XPath
 Assume a “cursor” on active state in the automaton
 On matching open-tag: push advance active state
 On close-tag: pop active state

dblp article Stack:

1 2 3 21
1

<?xml version="1.0" encoding="ISO-8859-1" ?>

<dblp>
<mastersthesis>
…
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor> event: start-element “article”
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee> match!
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>
Different Options
 Many different “streaming XPath” algorithms
 What kind of automaton to use
 DFA, NFA, lazy DFA, PDA, proprietary format
 Expressiveness of the path language
 Full regular path expressions, XPath, …
 Axes

 Which operations can be pushed into the operator

 XPath predicates, joins, position predicates, etc.

57
From XPaths to XQueries
 An XQuery takes multiple XPaths in the FOR/LET clauses,
and iterates over the elements of each XPath (binding the
variable to each)
FOR $rootElement in doc(“dblp.xml”)/dblp,
$rootChild in $rootElement/article[author=“Bob”],
$textContent in $rootChild/text()
 We can think of an XQuery as doing tree matching, which returns
tuples ($i, $j) for each tree matching $i and $j in a document

 Streaming XML path evaluator that supports a hierarchy of

matches over an XML document
XQuery Path FOR $rootElement in doc(“dblp.xml”)/dblp,
$rootChild in
Evaluation $rootElement/article[author=“Bob”],
$textContent in $rootChild/text()

 Multiple, dependent state machines outputting

binding tuples $rootElem $rootCh $textCont
ent ild ent
dblp Only activate $rootChild +
$ rootElement $textContent on a match to $rootElement

article
$ rootChild

author = “Bob”

text () Evaluate a pushed-down

$ textContent set
selection predicate
?
Beyond the Initial FOR Paths
 The streaming XML evaluator operator returns tuples
of bindings to nodes
$rootElem $rootCh $textCont
ent ild ent
 We can now use standard relational operators to join,
sort, group, etc.

 Also in some cases we may want to do further XPath

evaluation against one of the XML trees bound to a
variable
Creating XML
 To return XML, we need to be able to take streams
of binding tuples and:
 Add tags around certain columns
 Group tuples together and nest them under tags

 Thus XQuery evaluators have new operators for

performing these operations
An Example XQuery Plan
(<BobResult><editor>Paul R. McJones</editor>

XML output XML

<title>The 1995…</title><text>Paul R. McJones</text>
<text>The 1995…</text></BobResult>)

operator tagging
(<editor>Paul R. McJones</editor>,
<title>The 1995…</title>,

Π
<text>Paul R. McJones</text><text>The 1995…</text>)

XPath evaluation (<article>…</article>, [“Paul R. McJones”,”The 1995…”, …],

<editor>Paul R. McJones</editor>,
<title>The 1995…</title>,

against a binding
<text>Paul R. McJones</text><text>The 1995…</text>)

editor
hild $ editor
XPath $rootC
title
matcher $ title set

(<dblp>…</dblp>, <article>…</article>, [“Paul R. McJones”,”The 1995…”, …])

Relational-style ⊐⋈
...

(<text>Paul R. McJones</text><text>The 1995…</text>)

query operators XML

grouping
(outerjoin) (<article>…</article>,

...
[“Paul R. McJones”,”The 1995…”, …]) (<text>Paul R. McJones</text>)
(<text>The 1995…</text>)

XML
tagging
(“Paul R. McJones”)
(“The 1995…”)

nte nt $ txt
XPath $textCo
matcher
Π
Streaming XPath (<dblp>…</dblp>, <article>…</article>, [“Paul R. McJones”,”The 1995…”, …])
...
$ rootElement
dblp
l

evaluation bl p.xm article

Streaming d $ rootChild
author =
XPath text()
“B
ob”
$ textContent set
Σ

<dblp>...

dblp.xml
Optimizing XQueries
 An entire field in and of itself

 A major challenge versus relational query

optimization: estimating the “fan-out” of path
evaluation

 A second major challenge: full XQuery supports

arbitrary recursion and is Turing-complete
Outline
 XML data model
 XML schema languages
 XML querying
 XML query processing
 XML schema mapping
Schema Mappings for XML
 In Chapter 3 we saw how schema mappings were
described for relational data
 As a set of constraints between source and target
databases

 In the XML realm, we want a similar constraint

language, but must address:
 Nesting – XML is hierarchical
 Identity – how do we merge multiple partial results into a
single XML tree?
One Approach: Piazza XML Mappings
Derived from a subset of XQuery extended with node identity
 The latter is used to merge results with the same node ID
Directional mapping language based on annotations to XML
templates
An output element in the template, ~ XQuery RETURN

<output>
{: $var IN document(“doc”)/path WHERE condition :}
<tag>$var</tag> Create the element for each
</output> Populate with the match to this set of XPaths
value of a binding & conditions

 Translates between parts of data instances

 Supports special annotations and object fusion

66
Mapping Example between
Two XML Schemas
Target: Publications by book Source: Publications by author

Has an entity-relationship model representation like:

publication writtenBy author

title pub-type name

67
Example Piazza-XML Mapping
<pubs>
<book>
{: $a IN document(“…”)/authors/author,
$an IN $a/full-name,
$t IN $a/publication/title,
$typ IN $a/publication/pub-type
WHERE $typ = “book” :} Output one
book per
<title>{$t}</title> match to
author
<author><name>{$an}</name></author>
</book>
</pubs> Insert title and author
name subelements

68
Example Piazza-XML Mapping
<pubs> Merge elements if they are
<book piazza:id={$t}> for the same value of $t
{: $a IN document(“…”)/authors/author,
$an IN $a/full-name,
$t IN $a/publication/title,
$typ IN $a/publication/pub-type
WHERE $typ = “book” :} Output one
book per
<title piazza:id={$t}>{$t}</title> match to
author
<author><name>{$an}</name></author>
</book>
</pubs> Insert title and author
name subelements

69
A More Formal Model:
Nested TGDs
The underpinnings of the Piazza-XML mapping language
can be captured using nested tuple-generating
dependencies (nested TGDs)
 Recall relational TGDs from Chapter 3
 X , Y , S ( ( X , Y )   ( S )  Z , T ( ( X , Z )   (T )))

Formulas Formulas Formulas

Formulas
over source over set- over set-
over target
valued source valued target
variables variables, with
grouping keys
 As before, we’ll typically omit the  quantifiers…
Example Piazza-XML Mapping
as a Nested TGD
<pubs>
<book piazza:id={$t}>
{: $a IN document(“…”)/authors/author,
$an IN $a/full-name,
$t IN $a/publication/title,
$typ IN $a/publication/pub-type
WHERE $typ = “book” :}

authors(author)  author( f , publication)  publication(t , book) 

p ( pubs(book)  bookt (t , author' , publisher)  author't , f ( f )  publishert ( p ))

Grouping keys in target

71
Query Reformulation for XML
 Two main versions:
 Global-as-view-style:
 Query is posed over the target of a nested TGD, or a Piazza-XML
mapping
 Can answer the query through standard XQuery view unfolding

 Bidirectional mappings, more like GLAV mappings in the

relational world:
 An advanced topic – see the bibliographic notes
XML Wrap-up
 XML forms an important part of the data integration
picture – it’s a “bridge” enabling rapid connection to
external sources

 It introduces new complexities in:

 Query processing – need streaming XPath / XQuery evaluation
 Mapping languages – must support identity and nesting
 Query reformulation
 It also is a bridge to RDF and the Semantic Web (Chapter
12)

V Rail Product Sheet 2021
No ratings yet
V Rail Product Sheet 2021
4 pages
Chapter 11
No ratings yet
Chapter 11
73 pages
Chapter 11: XML: Data Integration
No ratings yet
Chapter 11: XML: Data Integration
73 pages
XML Schema
No ratings yet
XML Schema
28 pages
E Tensible Arkup Anguage Unit-3: Basic XML DTD XML Schema Dom Vs Sax Presenting XML
No ratings yet
E Tensible Arkup Anguage Unit-3: Basic XML DTD XML Schema Dom Vs Sax Presenting XML
39 pages
Monday, January 30, 2006
No ratings yet
Monday, January 30, 2006
34 pages
Lecture 09
No ratings yet
Lecture 09
110 pages
0432 XML DTD and XML Schema
No ratings yet
0432 XML DTD and XML Schema
32 pages
Extensible Markup Language
No ratings yet
Extensible Markup Language
74 pages
What Is An XML Schema?
No ratings yet
What Is An XML Schema?
3 pages
IWT unit-IV
No ratings yet
IWT unit-IV
10 pages
XML: Introduction To XML, Defining XML Tags, Their Attributes and Values, Document Type Definition, XML Schemas, Document Object Model, XHTML. Parsing XML Data - DOM and SAX Parsers in Java
No ratings yet
XML: Introduction To XML, Defining XML Tags, Their Attributes and Values, Document Type Definition, XML Schemas, Document Object Model, XHTML. Parsing XML Data - DOM and SAX Parsers in Java
36 pages
Schema Tutorial
No ratings yet
Schema Tutorial
32 pages
XML - DTD & Schema
No ratings yet
XML - DTD & Schema
200 pages
XML Schema Ket
No ratings yet
XML Schema Ket
28 pages
WT unit II (2)
No ratings yet
WT unit II (2)
33 pages
Introduction To XML For Iseries Developers: Applications Systems Group September 18, 2002
No ratings yet
Introduction To XML For Iseries Developers: Applications Systems Group September 18, 2002
30 pages
Unit 3
No ratings yet
Unit 3
80 pages
Implementing Advanced Features of XML: Srikanth Nalluri Computer Science & Engineering Assistant Professor
No ratings yet
Implementing Advanced Features of XML: Srikanth Nalluri Computer Science & Engineering Assistant Professor
51 pages
Soa
100% (1)
Soa
129 pages
XML and DTD: Mario Alviano
No ratings yet
XML and DTD: Mario Alviano
51 pages
Unit-1 XML To RWD
No ratings yet
Unit-1 XML To RWD
103 pages
Myers
No ratings yet
Myers
7 pages
Chapter 4 XML
No ratings yet
Chapter 4 XML
52 pages
Yazici XML Ex
No ratings yet
Yazici XML Ex
71 pages
Siam6 PDF
No ratings yet
Siam6 PDF
47 pages
Introduction To XML
No ratings yet
Introduction To XML
49 pages
Unit 4 STUDY MATERIALS
No ratings yet
Unit 4 STUDY MATERIALS
8 pages
SGML and XML
No ratings yet
SGML and XML
23 pages
UNIT 2C and Lab 6 XML Schema
No ratings yet
UNIT 2C and Lab 6 XML Schema
112 pages
Unit - I
No ratings yet
Unit - I
112 pages
Unit Ii
No ratings yet
Unit Ii
106 pages
XML Mod4
No ratings yet
XML Mod4
89 pages
WT - Unit 3
No ratings yet
WT - Unit 3
24 pages
CS549 Distributed Information Systems: Lecture 2: XML and Internet Databases
No ratings yet
CS549 Distributed Information Systems: Lecture 2: XML and Internet Databases
50 pages
XML Schema Tutorial
No ratings yet
XML Schema Tutorial
74 pages
Exemel
No ratings yet
Exemel
91 pages
Integrative Programming and Technologies (Itec4121)
No ratings yet
Integrative Programming and Technologies (Itec4121)
59 pages
WT Unit 3
No ratings yet
WT Unit 3
57 pages
CH4 WEB Lecture
No ratings yet
CH4 WEB Lecture
24 pages
Web Dev Final Book Page Num-1-30
No ratings yet
Web Dev Final Book Page Num-1-30
30 pages
Web Dev Final Book Page Num
No ratings yet
Web Dev Final Book Page Num
55 pages
DB Unit-3
No ratings yet
DB Unit-3
18 pages
Unit-Iv XML and Datawarehouse
No ratings yet
Unit-Iv XML and Datawarehouse
59 pages
Unit-III Introduction To XML
No ratings yet
Unit-III Introduction To XML
25 pages
WP-Unit5
No ratings yet
WP-Unit5
17 pages
Unit 3 - XML
No ratings yet
Unit 3 - XML
44 pages
Unit 3 - XML
No ratings yet
Unit 3 - XML
44 pages
Proejct Part C Homework 3: About
No ratings yet
Proejct Part C Homework 3: About
60 pages
Unit 2 - XML
No ratings yet
Unit 2 - XML
48 pages
XML Schema (W3C) : Thanks To Jussi Pohjolainen TAMK University of Applied Sciences
No ratings yet
XML Schema (W3C) : Thanks To Jussi Pohjolainen TAMK University of Applied Sciences
43 pages
XML Presentation
No ratings yet
XML Presentation
62 pages
XML Stands For Extensible Markup Language.: 2. XML Is Designed To Transport and Store Data
No ratings yet
XML Stands For Extensible Markup Language.: 2. XML Is Designed To Transport and Store Data
62 pages
It 6801 Soa QB
No ratings yet
It 6801 Soa QB
55 pages
Introduction To XML Extensible Markup Language: Prof.N.Nalini AP (SR) VIT
No ratings yet
Introduction To XML Extensible Markup Language: Prof.N.Nalini AP (SR) VIT
35 pages
XML
100% (2)
XML
95 pages
XML Schema
100% (1)
XML Schema
60 pages
AdvancedJavaProgramming-SLIDES03-UNIT1-FP2005-Ver 1.0
No ratings yet
AdvancedJavaProgramming-SLIDES03-UNIT1-FP2005-Ver 1.0
56 pages
XML
No ratings yet
XML
27 pages
A Complex XML Element Called : Slide 27-1
No ratings yet
A Complex XML Element Called : Slide 27-1
54 pages
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
From Everand
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
Christopher Right
2.5/5 (2)
Disconnect Cause Codes
No ratings yet
Disconnect Cause Codes
8 pages
Asynchronous Programming in Java - Baeldung
No ratings yet
Asynchronous Programming in Java - Baeldung
6 pages
CN Exp 4
No ratings yet
CN Exp 4
4 pages
MS Manual-SYST - EVENTs-Event Logging - MSDocs (6X)
No ratings yet
MS Manual-SYST - EVENTs-Event Logging - MSDocs (6X)
66 pages
Introduction To Python
No ratings yet
Introduction To Python
32 pages
Mashal Jan CV
No ratings yet
Mashal Jan CV
2 pages
Software Engineering Using Artificial Intelligence
No ratings yet
Software Engineering Using Artificial Intelligence
7 pages
Syllabus PDF
No ratings yet
Syllabus PDF
87 pages
Aruba Electronic Key License Installation Guide
No ratings yet
Aruba Electronic Key License Installation Guide
8 pages
List of Latest Bills For Reference..
No ratings yet
List of Latest Bills For Reference..
9 pages
IBwave Mobile User Guide
100% (1)
IBwave Mobile User Guide
106 pages
RADIALL - RP66393 Plug With Backshell Cap
No ratings yet
RADIALL - RP66393 Plug With Backshell Cap
1 page
Indian PHD Thesis Online
100% (3)
Indian PHD Thesis Online
7 pages
Digital Transformation Week 2 Notes
No ratings yet
Digital Transformation Week 2 Notes
3 pages
Jss 1 Term 1 Computer Notes
No ratings yet
Jss 1 Term 1 Computer Notes
10 pages
Laborator PL/SQL Tranzactii.: Urmariti Derularea Tranzactiilor in Situatiile Urmatoare
No ratings yet
Laborator PL/SQL Tranzactii.: Urmariti Derularea Tranzactiilor in Situatiile Urmatoare
3 pages
S.V.Public School: PPT On Java
No ratings yet
S.V.Public School: PPT On Java
28 pages
2022 Training On Intro To Data Analytics
No ratings yet
2022 Training On Intro To Data Analytics
2 pages
Yudisium TA. 18-19 Genap
No ratings yet
Yudisium TA. 18-19 Genap
122 pages
Intro - To - Programming - PPT Filename UTF-8''intro To Programming
No ratings yet
Intro - To - Programming - PPT Filename UTF-8''intro To Programming
33 pages
FORM 1 ICT MODULE 2024 for 2024 Syllabus
100% (1)
FORM 1 ICT MODULE 2024 for 2024 Syllabus
64 pages
Dragos
No ratings yet
Dragos
8 pages
Enterprise Networks Solutions Engineer in San Diego CA Resume Jeanie Anderson
No ratings yet
Enterprise Networks Solutions Engineer in San Diego CA Resume Jeanie Anderson
3 pages
2014 Isaacs Overviewof QRMfor PHreseasrchers
No ratings yet
2014 Isaacs Overviewof QRMfor PHreseasrchers
7 pages
Simulation Method in PC Crash Using Point Cloud Files: Article
No ratings yet
Simulation Method in PC Crash Using Point Cloud Files: Article
5 pages
Avaya_Analytics_4.1.1.0_104_009
No ratings yet
Avaya_Analytics_4.1.1.0_104_009
20 pages
Unit-V Input in Java: Computer Applications-Lorven Public School, Chandapura 1
No ratings yet
Unit-V Input in Java: Computer Applications-Lorven Public School, Chandapura 1
17 pages
Operating Instructions
No ratings yet
Operating Instructions
43 pages
History, Evolution, and Impact of Digital Libraries
No ratings yet
History, Evolution, and Impact of Digital Libraries
30 pages

Unit 5 XML

Uploaded by

Unit 5 XML

Uploaded by

CHAPTER 11: XML

Real-world data is often not in relational form

 But suppose tools would adopt a standard export

Hierarchical, human-readable format Procedural

 May include attributes within the element’s open-tag

 Multiple elements may have same name

 Order does NOT matter

Student-course-grade si cid exp-

DTD also defines special attribute types:

Example use of DTD in XML file:

 “Almost XML” syntax – inconvenient to build tools for

We generally orient the XML tree around the “central” objects

Big decision: element vs. attribute

 DOM: an object-oriented representation of the XML parse tree

 Sometimes we don’t want all of the data: SAX

 Potential benefits in parallelism, views, schema mappings, and so on

 In XML, the basis of this template is called an XPath

 But XPath returns a node set representing the XML nodes

 By default, the document root is the context node

which is equivalent to:

Unlike SQL, fully compositional, highly orthogonal:

Can be used to define queries, views, and (using a subset)

for $root in doc (“http://my.org/my.xml”)

 Something like a template that pattern-matches, produces a “binding

for $i in doc (“dblp.xml”)/dblp/article[author/text() = “John Smith”]

let $allpapers := doc (“dblp.xml”)/dblp/article

for $years in fn:distinct-values(doc(“dblp.xml”)//year/text())

for $x in doc (“dblp.xml”)/proceedings

Can construct elements and attributes using computed names:

Using the view:

 Ideally: we can pipeline processing of the data as it

“Streaming XPath evaluation”

… which is also a building block to pipelined XQuery

 The “core” part of XPath (child, descendent axes)

 SAX provides a series of event notifications

 Events will be fired in depth-first, left-to-right traversal

 Build a finite automaton (NFA or DFA) for the regexp

/dblp/article dblp article

dblp article Stack:

<?xml version="1.0" encoding="ISO-8859-1" ?>

dblp article Stack:

<?xml version="1.0" encoding="ISO-8859-1" ?>

dblp article Stack:

<?xml version="1.0" encoding="ISO-8859-1" ?>

dblp article Stack:

<?xml version="1.0" encoding="ISO-8859-1" ?>

 Which operations can be pushed into the operator

 Streaming XML path evaluator that supports a hierarchy of

 Multiple, dependent state machines outputting

text () Evaluate a pushed-down

 Also in some cases we may want to do further XPath

 Thus XQuery evaluators have new operators for

XML output XML

XPath evaluation (<article>…</article>, [“Paul R. McJones”,”The 1995…”, …],

(<dblp>…</dblp>, <article>…</article>, [“Paul R. McJones”,”The 1995…”, …])

(<text>Paul R. McJones</text><text>The 1995…</text>)

query operators XML

evaluation bl p.xm article

 A major challenge versus relational query

 A second major challenge: full XQuery supports

 In the XML realm, we want a similar constraint

 Translates between parts of data instances

Has an entity-relationship model representation like:

title pub-type name

Formulas Formulas Formulas

authors(author)  author( f , publication)  publication(t , book) 

Grouping keys in target

 Bidirectional mappings, more like GLAV mappings in the

 It introduces new complexities in:

You might also like