0% found this document useful (0 votes)
32 views32 pages

0432 XML DTD and XML Schema

Learn XML

Uploaded by

Dhrubo Mazumder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views32 pages

0432 XML DTD and XML Schema

Learn XML

Uploaded by

Dhrubo Mazumder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

XML, DTD, and

XML Schema
Introduction to Databases
CompSci 316 Fall 2014
2

Announcements (Tue. Oct. 21)


• Midterm scores and sample solution posted
• You may pick up graded exams outside my office
• Mean: 83.9
• Stdev: 11.0
• Max: 100+5

• PHP and Django example website code posted;


more to come
• Homework #3 to be assigned on Thursday
• Project milestone #1 feedback to be returned this
weekend
3

Structured vs. unstructured data


• Relational databases are highly structured
• All data resides in tables
• You must define schema before entering any data
• Every row confirms to the table schema
• Changing the schema is hard and may break many things
• Texts are highly unstructured
• Data is free-form
• There is no pre-defined schema, and it’s hard to
define any schema
• Readers need to infer structures and meanings

What’s in between these two extremes?


4


5

Semi-structured data
• Observation: most data have some structure, e.g.:
• Book: chapters, sections, titles, paragraphs, references,
index, etc.
• Item for sale: name, picture, price (range), ratings,
promotions, etc.
• Web page: HTML
• Ideas:
• Ensure data is “well-formatted”
• If needed, ensure data is also “well-structured”
• But make it easy to define and extend this structure
• Make data “self-describing”
6

HTML: language of the Web


!
" ! # ! $
" % !
&

• It’s mostly a “formatting” language


• It mixes presentation and content
• Hard to change presentation (say, for different displays)
• Hard to extract content
7

XML: eXtensible Markup Language


'

"
#
$
" %

'
' & '

• Text-based
• Capture data (content), not presentation
• Data self-describes its structure
• Names and nesting of tags have meanings!
8

Other nice features of XML


• Portability: Just like HTML, you can ship XML data
across platforms
• Relational data requires heavy-weight API’s
• Flexibility: You can represent any information
(structured, semi-structured, documents, …)
• Relational data is best suited for structured data
• Extensibility: Since data describes itself, you can
change the schema easily
• Relational schema is rigid and difficult to change
9
' *+ ,-.*+ ,1 . / -. 0 .

XML terminology "


#
$
" %

• Tag names: ', ,… ' &

• Start tags: ' , ,…


• End tags: ' , ,…
• An element is enclosed by a pair of start and end
tags: ' & '
• Elements can be nested:
' & & & '
• Empty elements: ( ) ' ( ) '
• Can be abbreviated: ( ) '
• Elements can also have attributes: '
*+ ,-.&. / -. 0 .

Ordering generally matters, except for attributes


10

Well-formed XML documents


A well-formed XML document
• Follows XML lexical conventions
• Wrong: / % 2 ) & /
• Right: / % 2 ) 3 4 & /
• Other special entities: becomes 3 4 and 3 becomes 3 5 4
• Contains a single root element
• Has properly matched tags and properly nested
elements
• Right:
/ & / & / & /
• Wrong:
/ & / & / & /
11

A tree representation

' ' &

/ &

" # $ "
%

/ / / &

* / * &
/ 2
/ & &
5 1
/
12

More XML features


• Processing instructions for apps: 6 & 6
• An XML file typically starts with a version declaration using this
syntax: 6)5 7 -. 0 .6
• Comments: 811 9 55 11
• CDATA section: 8:9 ";":; < ' !&==
• ID’s and references
-. >. 5 # 5 5 &
-. ?@. 5 A 5 &
-. . -. >. 5 -. ?@.
5 5 &
&

• Namespaces allow external schemas and qualified names


' )5 <5 9 + -. < & 5 +/ 5 .
5 9 + < & 5 9 + <
5 9 + < & 5 9 + < &
'

• And more…
13

Valid XML documents


• A valid XML document conforms to a
Document Type Definition (DTD)
• A DTD is optional
• A DTD specifies a grammar for the document
• Constraints on structures and values of elements, attributes, etc.
• Example
8 B9;CDE :
8EFEAE,; G 'HI
8EFEAE,; ' G ! ! 6! 6! / I
8";;F*+; ' *+ , 9 ";" JKELM*KE
8";;F*+; ' / 9 ";" J*ADF*E
8EFEAE,; GJD9 ";"I
8EFEAE,; GJD9 ";"I
8EFEAE,; GJD9 ";"I
8EFEAE,; GJD9 ";"I
8EFEAE,; GJD9 ";"I
8EFEAE,; / GJD9 ";"N I
8EFEAE,; / G ! / 6! / I
=
14

DTD explained
8 B9;CDE :
is the root element of the document
8EFEAE,; G 'HI One or more
consists of a sequence of one or more ' elements
8EFEAE,; ' G ! ! 6! 6! / I
Zero or one
Zero or more
' consists of a , zero or more ,
an optional , and zero or more / ’s, in sequence
8";;F*+; ' *+ , * JKELM*KE
' has a required *+ , attribute which is a unique identifier
8";;F*+; ' / 9 ";" J*ADF*E
' has an optional (J*ADF*E )
price attribute which contains
character data ' *+ ,-.*+ ,1 . / -. 0 .

"
#
Other attribute types include $
" %
* KE (reference to an * ),
* KE + (space-separated list of references), ' &
enumerated list, etc.
15

DTD explained (cont’d)


8EFEAE,; GJD9 ";"I D9 ";" is text that will be parsed
8EFEAE,; GJD9 ";"I • 3 4 etc. will be parsed as entities
8EFEAE,; GJD9 ";"I • Use a 9 ";" section to include text verbatim
8EFEAE,; GJD9 ";"I
8EFEAE,; GJD9 ";"I
, , , and contain parsed character data
8EFEAE,; / GJD9 ";"N I
/ contains mixed content: text optionally interspersed with elements
8EFEAE,; / G ! / 6! / I
Recursive declaration:
Each / begins with a ,
followed by an optional / , / * /
and then zero or more / * / 2 /
= (sub) / ’s
5 1 / &
/
/ OAF
/ OAF & /
/
/ ;
/
/ ; & /
/
/ M
/ C / ; & /
/
/
/
16

Using DTD
• DTD can be included in the XML source file
• 6)5 7 -. 0 .6
8 B9;CDE :
& &
=

& &

• DTD can be external


• 6)5 7 -. 0 .6
8 B9;CDE +C+;EA .00 0 .

& &

• 6)5 7 -. 0 .6
8 B9;CDE 5 DM F*9 .1 %?9 ; O#;AF 0 + / E,.
. < 22202?0 ;K ) 5 ; ) 5 1 / 0 .
5
& &
5
17

Annoyance: content grammar


• Consider this declaration:
8EFEAE,; 17
G G 5 ! ! 5 ! I N
G 5 ! 7 5 ! 5 ! I I
• “N” means “or”
• Syntactically legal, but won’t work
• Because of SGML compatibility issues
• When looking at 5 , a parser would not know which
way to go without looking further ahead
• Requirement: content declaration must be
“deterministic” (i.e., no look-ahead required)
• Can we rewrite it into an equivalent, deterministic one?
• Also, you cannot nest mixed content declarations
• Illegal: 8EFEAE,; + / G ! GJD9 ";"N I ! / I
18

Annoyance: element name clash


• Suppose we want to represent book titles and
section titles differently
• Book titles are pure text: GJD9 ";"I
• Section titles can have formatting tags:
GJD9 ";"N N N5 I
• But DTD only allows one declaration!
• Workaround: rename as '1 and
/ 1 ?
• Not nice—why can’t we just infer a ’s context?
19

Annoyance: lack of type support


• Too few attribute types: string (9 ";"), token (e.g.,
* , * KE ), enumeration (e.g., G N N I)
• What about integer, float, date, etc.?
• ID not typed
• No two elements can have the same id, even if they have
different types (e.g., ' vs. / )
• Difficult to reuse complex structure definitions
• E.g.: already defined element E as G ! !
6! ! &I; want to define E> to have the same
structure
• Parameter entities in DTD provide a workaround
• 8E,;*;C P E0 / QG ! ! 6! ! &IQ
• 8EFEAE,; E PE0 / 4
• 8EFEAE,; E> PE0 / 4
• Something less “hacky”?
20

XML Schema
• A more powerful way of defining the structure and
constraining the contents of XML documents
• An XML Schema definition is itself an XML
document
• Typically stored as a standalone .xsd file
• XML (data) documents refer to external .xsd files
• W3C recommendation
• Unlike DTD, XML Schema is separate from the XML
specification
21

XML Schema definition (XSD)


6)5 7 -. 0 .6
) < / 5 )5 <) -. < 22202?0 > OAF+/ 5 .
& & Defines ) to be the namespace
described in the URL

Uses of ) < within the ) < / 5 element now


refer to tags from this namespace

& &
) < / 5
22

XSD example
) < 5 5 -. '. We are now defining an element named '
) </ 5 ); Declares a structure with child elements/attributes as opposed to just text)
) < R / Declares a sequence of child elements, like “(…, …, …)” in DTD
) < 5 5 -. . -.) < . A leaf element with string content
) < 5 5 -. . -.) < .
5 B// -. . 5 )B// -. . Like in DTD
) < 5 5 -. . -.) < . Like 6 in DTD
5 B// -. . 5 )B// -. .
) < 5 5 -. . -.) < . A leaf element with integer content
5 B// -. . 5 )B// -. .
) < 5 -. / . Like / in DTD; / is defined elsewhere
5 B// -. . 5 )B// -. .
) < R /
) < 5 -.*+ ,. -.) < . -. R .
Declares an attribute under '… and this attribute is required
) < 5 -. / . -.) < / 5 . -. .
) </ 5 ); This attribute has a decimal value, and it is optional
) < 5
23

XSD example cont’d


) < 5 5 -. / .
) </ 5 );
) < R / Another title definition; can be different
) < 5 5 -. . -.) < . from '
) < 5 5 -./ . 5 B// -. . 5 )B// -. .
Declares mixed content
) </ 5 ); 5 ) -. .
(text interspersed with structure below)
A compositor like ) </ / 5 B// -. . 5 )B// -. . 5 /5 )B// can be
) < R / ; attached to compositors too
) < 5 5 -. . -.) < .
this one declares
a list of alternatives, ) < 5 5 -. . -.) < .
like “G&N&N&I” ) </ /
in DTD Like GJD9 ";"N N I in DTD
) </ 5 );
) < 5
) < 5 -. / . 5 B// -. . 5 )B// -. .
) < R / Recursive definition
) </ 5 );
) < 5
24

XSD example cont’d


• To complete 0) :
) < 5 5 -. .
) </ 5 );
) < R /
) < 5 -. '. 5 B// -. . 5 )B// -. .
) < R /
) </ 5 );
) < 5
• To use 0) in an XML document:
6)5 7 -. 0 .6
)5 <) -. < 22202?0 > OAF+/ 5 1 / .
) < , 5 / +/ 5 F / -. < 0) .
' & & '
' & & '
& &
25

Named types
• Define once:
) </ 5 ); 5 -. 5 ; ) ; . 5 ) -. .
) </ / 5 B// -. . 5 )B// -. .
) < 5 5 -. . -.) < .
) < 5 5 -. . -.) < .
) </ /
) </ 5 );

• Use elsewhere in XSD:


& &
) < 5 5 -. . -. 5 ; ) ; .
) < 5 5 -./ . -. 5 ; ) ; .
5 B// -. . 5 )B// -. .
& &
26

Restrictions
) < 5 ; 5 -. / ; .
) < / -.) < / 5 .
) <5 * / 7 7 -. 0 .
) < /
) < 5 ;

) < 5 ; 5 -. ; .
) < / -.) < .
) < 5 7 -. /'.
) < 5 7 -. /'.
) < 5 7 -. .
) < /
) < 5 ;
27

Keys
) < 5 5 -. .
) </ 5 ); & & ) </ 5 );
) <' 5 -. 'S .
) < / ) -.0 '.
) < ) -.T*+ ,.
) <'
) < 5
• Under any element, elements
reachable by selector “0 '” (i.e., ' child
elements) must have unique values for field “T*+ ,”
(i.e., *+ , attributes)
• In general, a key can consist of multiple fields (multiple
) < elements under ) <' )
• More on XPath in next lecture
28

Foreign keys
• Suppose content can reference books
) < 5 5 -./ . ) < 5 5 -. .
) </ 5 ); 5 ) -. . ) </ 5 ); & & ) </ 5 );
) </ / 5 B// -. . 5 )B// -. . ) <' 5 -. 'S .
) < 5 5 -. . -.) < . ) < / ) -.0 '.
) < 5 5 -. . -.) < . ) < ) -.T*+ ,.
) < 5 5 -. '1 . ) <'
) <' 5 -. ' S .
) </ 5 );
-. 'S .
) < 5 -.*+ ,. ) < / ) -.0 '1 .
-.) < . ) < ) -.T*+ ,.
) </ 5 ); ) <'
) < 5 ) < 5
) </ /
) </ 5 );
) < 5

• Under , for elements reachable by


selector “0 '1 ” (i.e., any '1 element
underneath), values for field “T*+ ,” (i.e., *+ , attributes)
must appear as values of 'S , the key referenced
• Make sure ' is declared in the same scope
29

Why use DTD or XML Schema?


• Benefits of not using them
• Unstructured data is easy to represent
• Overhead of validation is avoided
• Benefits of using them
• Serve as schema for the XML data
• Guards against errors
• Helps with processing
• Facilitate information exchange
• People can agree to use a common DTD or XML Schema to
exchange data (e.g., XHTML)
30

XML versus relational data


Relational data XML data
• Schema is always fixed in • Well-formed XML does not
advance and difficult to require predefined, fixed
change schema
• Simple, flat table structures • Nested structure;
* /* KE (+) permit
arbitrary graphs
• Ordering of rows and • Ordering forced by
columns is unimportant document format; may or
may not be important
• Exchange is problematic • Designed for easy exchange
• “Native” support in all • Often implemented as an
serious commercial DBMS “add-on” on top of relations
31

Case study
• Design an XML document representing cities,
counties, and states
• For states, record name and capital (city)
• For counties, record name, area, and location (state)
• For cities, record name, population, and location (county
and state)
• Assume the following:
• Names of states are unique
• Names of counties are only unique within a state
• Names of cities are only unique within a county
• A city is always located in a single county
• A county is always located in a single state
32

A possible design
(

5 ) <
/ (/ ( ) < …
5 ) <
) < / 5 / / …
) <
5 ) <
) <
/ / …
Declare S in ( with
Selector 0
Field T 5 Declare / * + S in with
Selector 0 /
Declare / * 9 S in / with
Field T 5
Selector 0 /
Field T 5
Declare / * S in ( with
Selector 0 / /
Field T
Declare / 9 * S K in ( referencing / * S , with
Selector 0
Field T/ (/ (

You might also like