0% found this document useful (0 votes)
5 views16 pages

Chapter2 CEF482

This document provides an overview of XML document rules, highlighting the differences between XML and HTML, and detailing the structure and requirements of XML documents. It explains the concepts of valid, invalid, and well-formed documents, as well as the importance of the root element, case sensitivity, and attribute rules. Additionally, it discusses XML declarations, namespaces, and methods for defining document content through Document Type Definitions (DTDs) and XML Schemas.

Uploaded by

sop lionnel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

Chapter2 CEF482

This document provides an overview of XML document rules, highlighting the differences between XML and HTML, and detailing the structure and requirements of XML documents. It explains the concepts of valid, invalid, and well-formed documents, as well as the importance of the root element, case sensitivity, and attribute rules. Additionally, it discusses XML declarations, namespaces, and methods for defining document content through Document Type Definitions (DTDs) and XML Schemas.

Uploaded by

sop lionnel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

School year 2024-2025/ Second semester/ FET / Computer Engineering

XML and Document Content Validation

Chapter 2: XML Document Rules

I. Overview: XML document rules Overview: XML


document rules
If you've looked at HTML documents, you're familiar with the basic concepts of using tags to
mark up the text of a document. This section discusses the differences between HTML
documents and XML documents. It goes over the basic rules of XML documents, and
discusses the terminology used to describe them.

One important point about XML documents: The XML specification requires a parser to
reject any XML document that doesn't follow the basic rules. Most HTML parsers will
accept sloppy markup, making a guess as to what the writer of the document intended. To
avoid the loosely structured mess found in the average HTML document, the creators of XML
decided to enforce document structure from the beginning.

(By the way, if you're not familiar with the term, a parser is a piece of code that attempts to
read a document and interpret its contents.)

a) Invalid, valid, and well-formed documents

There are three kinds of XML documents:

 Invalid documents don't follow the syntax rules defined by the XML specification. If
a developer has defined rules for what the document can contain in a DTD or schema,
and the document doesn't follow those rules, that document is invalid as well.
 Valid documents follow both the XML syntax rules and the rules defined in their
DTD or schema.
 Well-formed documents follow the XML syntax rules but don't have a DTD or
schema.

b) The root element

An XML document must be contained in a single element. That single element is called the
root element, and it contains all the text and any other elements in the document. In the
following example, the XML document is contained in a single element, the <greeting>

Proposed by Dr. SOP DEFFO Lionel Landry Page 1


School year 2024-2025/ Second semester/ FET / Computer Engineering

element. Notice that the document has a comment that's outside the root element; that's
perfectly legal.

1 <?xml version="1.0"?>
2 <!-- A well-formed document -->
3 <greeting>
4 Hello, World!
5 </greeting>

Here's a document that doesn't contain a single root element:

1 <?xml version="1.0"?>
2 <!-- An invalid document -->
3 <greeting>
4 Hello, World!
5 </greeting>
6 <greeting>
7 Hola, el Mundo!
8 </greeting>

An XML parser is required to reject this document, regardless of the information it might
contain.

Elements can't overlap

XML elements can't overlap. Here's some markup that isn't legal:

1 <!-- NOT legal XML markup -->


2 <p>
3 <b>I <i>really
4 love</b> XML.
5 </i>
6 </p>

Proposed by Dr. SOP DEFFO Lionel Landry Page 2


School year 2024-2025/ Second semester/ FET / Computer Engineering

If you begin a <i> element inside a <b> element, you have to end it there as well. If you want
the text XML to appear in italics, you need to add a second <i> element to correct the
markup:

1 <!-- legal XML markup -->


2 <p>
3 <b>I <i>really
4 love</i></b>
5 <i>XML.</i>
6 </p>

An XML parser will accept only this markup; the HTML parsers in most Web browsers will
accept both. End tags are required

You can't leave out any end tags. In the first example below, the markup is not legal
because there are no end paragraph ( </p>) tags. While this is acceptable in HTML (and, in
some cases, SGML), an XML parser will reject it.

1 <!-- NOT legal XML markup -->


2 <p>Yada yada yada...
3 <p>Yada yada yada...
4 <p>...

If an element contains no markup at all it is called an empty element; the HTML break (
<br>) and image ( <img>) elements are two examples. In empty elements in XML
documents, you can put the closing slash in the start tag. The two break elements and the two
image elements below mean the same thing to an XML parser:

1
<!-- Two equivalent break elements -->
2
<br></br>
3
<br />
4
<!-- Two equivalent image elements -->
5
<img src="../img/c.gif"></img>
6
<img src="../img/c.gif" />
7

Proposed by Dr. SOP DEFFO Lionel Landry Page 3


School year 2024-2025/ Second semester/ FET / Computer Engineering

Elements are case sensitive

XML elements are case sensitive. In HTML, <h1> and <H1> are the same; in XML, they're
not. If you try to end an <h1> element with a </H1> tag, you'll get an error. In the example
below, the heading at the top is illegal, while the one at the bottom is fine.

1
<!-- NOT legal XML markup -->
2
<h1>Elements are
3
case sensitive</H1>
4
<!-- legal XML markup -->
5
<h1>Elements are
6
case sensitive</h1>
7

Attributes must have quoted values

There are two rules for attributes in XML documents:

 Attributes must have values


 Those values must be enclosed within quotation marks

Compare the two examples below. The markup at the top is legal in HTML, but not in XML.
To do the equivalent in XML, you have to give the attribute a value, and you have to enclose
it in quotes.

1 <!-- NOT legal XML markup -->


2 <ol compact>
3
4 <!-- legal XML markup -->
5 <ol compact="yes">

You can use either single or double quotes, just as long as you're consistent.

Proposed by Dr. SOP DEFFO Lionel Landry Page 4


School year 2024-2025/ Second semester/ FET / Computer Engineering

If the value of the attribute contains a single or double quote, you can use the other kind of
quote to surround the value (as in name="Doug's car"), or use the entities &quot; for a double
quote and &apos; for a single quote. An entity is a symbol, such as &quot;, that the XML
parser replaces with other text, such as ".

II. XML declarations

a) Generalities

Most XML documents start with an XML declaration that provides basic information about
the document to the parser. An XML declaration is recommended, but not required. If there is
one, it must be the first thing in the document.

The declaration can contain up to three name-value pairs (many people call them attributes,
although technically they're not). The version is the version of XML used; currently this value
must be 1.X. The encoding is the character set used in this document. The ISO-8859-1
character set referenced in this declaration includes all of the characters used by most Western
European languages. If no encoding is specified, the XML parser assumes that the characters
are in the UTF-8 set, a Unicode standard that supports virtually every character and ideograph
from the world's languages.

1 <?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>

Finally, standalone, which can be either yes or no, defines whether this document can be
processed without reading any other files. For example, if the XML document doesn't
reference any other files, you would specify standalone="yes". If the XML document
references other files that describe what the document can contain (more about those files in a
minute), you could specify standalone="no". Because standalone="no" is the default, you
rarely see standalone in XML declarations.

b) Other things in XML documents

There are a few other things you might find in an XML document:

 Comments: Comments can appear anywhere in the document; they can even appear
before or after the root element. A comment begins with <!-- and ends with -->. A
comment can't contain a double hyphen ( -- ) except at the end; with that exception, a
Proposed by Dr. SOP DEFFO Lionel Landry Page 5
School year 2024-2025/ Second semester/ FET / Computer Engineering

comment can contain anything. Most importantly, any markup inside a comment is
ignored; if you want to remove a large section of an XML document, simply wrap that
section in a comment. (To restore the commented-out section, simply remove the
comment tags.) Here's some markup that contains a comment:

1 <!-- Here's a PI for Cocoon: -->


2 <?cocoon-process type="sql"?>

 Processing instructions: A processing instruction is markup intended for a particular


piece of code. In the example above, there's a processing instruction (sometimes called
a PI) for Cocoon, an XML processing framework from the Apache Software
Foundation. When Cocoon is processing an XML document, it looks for processing
instructions that begin with cocoon-process, then processes the XML document
accordingly. In this example, the type="sql" attribute tells Cocoon that the XML
document contains a SQL statement.

1 <!-- Here's an entity: -->


2 <!ENTITY dw "developerWorks">

 Entities: The example above defines an entity for the document. Anywhere the XML
processor finds the string &dw, it replaces the entity with the string developerWorks.
The XML spec also defines five entities you can use in place of various special
characters. The entities are:
 &lt; for the less-than sign
 &gt; for the greater-than sign
 &quot; for a double-quote
 &apos; for a single quote (or apostrophe)
 &amp; for an ampersand.

b) Namespaces

XML's power comes from its flexibility, the fact that you and I and millions of other people
can define our own tags to describe our data. Remember the sample XML document for a
person's name and address? That document includes the <title> element for a person's

Proposed by Dr. SOP DEFFO Lionel Landry Page 6


School year 2024-2025/ Second semester/ FET / Computer Engineering

courtesy title, a perfectly reasonable choice for an element name. If you run an online
bookstore, you might create a <title> element for the title of a book. If you run an online
mortgage company, you might create a <title> element for the title to a piece of property. All
of those are reasonable choices, but all of them create elements with the same name. How do
you tell if a given <title> element refers to a person, a book, or a piece of property? With
namespaces.

To use a namespace, you define a namespace prefix and map it to a particular string. Here's
how you might define namespace prefixes for our three <title> elements:

1 <?xml version="1.0"?>
2 <customer_summary
3 xmlns:addr="http://www.xyz.com/addresses/"
4 xmlns:books="http://www.zyx.com/books/"
5 xmlns:mortgage="http://www.yyz.com/title/"
6>
7 ... <addr:name><title>Mrs.</title> ... </addr:name> ...
8 ... <books:title>Lord of the Rings</books:title> ...
9 ... <mortgage:title>NC2948-388-1983</mortgage:title> ...

In this example, the three namespace prefixes are addr, books, and mortgage. Notice that
defining a namespace for a particular element means that all of its child elements belong to
the same namespace. The first <title> element belongs to the addr namespace because its
parent element, <addr:Name>, does.

One final point: The string in a namespace definition is just a string. Yes, these strings
look like URLs, but they're not. You could define xmlns:addr="mike" and that would work
just as well. The only thing that's important about the namespace string is that it's unique;
that's why most namespace definitions look like URLs. The XML parser does not go to
http://www.zyx.com/books/ to search for a DTD or schema, it simply uses that text as a string.
It's confusing, but that's how namespaces work.

Proposed by Dr. SOP DEFFO Lionel Landry Page 7


School year 2024-2025/ Second semester/ FET / Computer Engineering

III. Defining document content

a) Overview

So far in this course you've learned about the basic rules of XML documents; that's all well
and good, but you need to define the elements you're going to use to represent data. You'll
learn about two ways of doing that in this section.

 One method is to use a Document Type Definition, or DTD. A DTD defines the
elements that can appear in an XML document, the order in which they can appear,
how they can be nested inside each other, and other basic details of XML document
structure. DTDs are part of the original XML specification and are very similar to
SGML DTDs.
 The other method is to use an XML Schema. A schema can define all of the document
structures that you can put in a DTD, and it can also define data types and more
complicated rules than a DTD can. The W3C developed the XML Schema
specification a couple of years after the original XML spec.

b) Document Type Definitions

A DTD allows you to specify the basic structure of an XML document. The next couple of
sections look at fragments of DTDs. First of all, here's a DTD that defines the basic structure
of the address document example in the section :

1 <!-- address.dtd -->


2 <!ELEMENT address (name, street, city, state, postal-code)>
3 <!ELEMENT name (title? first-name, last-name)>
4 <!ELEMENT title (#PCDATA)>
5 <!ELEMENT first-name (#PCDATA)>
6 <!ELEMENT last-name (#PCDATA)>
7 <!ELEMENT street (#PCDATA)>
8 <!ELEMENT city (#PCDATA)>
9 <!ELEMENT state (#PCDATA)>
10 <!ELEMENT postal-code (#PCDATA)>

Proposed by Dr. SOP DEFFO Lionel Landry Page 8


School year 2024-2025/ Second semester/ FET / Computer Engineering

This DTD defines all of the elements used in the sample document. It defines three basic
things:

 An <address> element contains a <name>, a <street>, a <city>, a <state>, and a


<postal-code>. All of those elements must appear, and they must appear in that order.
 A <name> element contains an optional <title> element (the question mark means the
title is optional), followed by a <first-name> and a <last-name> element.
 All of the other elements contain text. ( #PCDATA stands for parsed character data;
you can't include another element in these elements.)

Although the DTD is pretty simple, it makes it clear what combinations of elements are legal.
An address document that has a <postal-code> element before the <state> element isn't legal,
and neither is one that has no <last-name> element.

Also, notice that DTD syntax is different from ordinary XML syntax. (XML Schema
documents, by contrast, are themselves XML, which has some interesting consequences.)
Despite the different syntax for DTDs, you can still put an ordinary comment in the DTD
itself.

c) Symbols in DTDs

There are a few symbols used in DTDs to indicate how often (or whether) something may
appear in an XML document. Here are some examples, along with their meanings:

 <!ELEMENT address (name, city, state)>

The <address> element must contain a <name>, a <city>, and a <state> element, in
that order. All of the elements are required. The comma indicates a list of items.

 <!ELEMENT name (title?, first-name, last-name)>

This means that the <name> element contains an optional <title> element, followed by
a mandatory <first-name> and a <last-name> element. The question mark indicates
that an item is optional; it can appear once or not at all.

 <!ELEMENT addressbook (address+)>

An <addressbook> element contains one or more <address> elements. You can have
as many <address> elements as you need, but there has to be at least one. The plus
sign indicates that an item must appear at least once, but can appear any number
of times.

Proposed by Dr. SOP DEFFO Lionel Landry Page 9


School year 2024-2025/ Second semester/ FET / Computer Engineering

 <!ELEMENT private-addresses (address*)>

A <private-addresses> element contains zero or more <address> elements. The


asterisk indicates that an item can appear any number of times, including zero.

 <!ELEMENT name (title?, first-name, (middle-initial | middle-name)?, last-name)>

A <name> element contains an optional <title> element, followed by a <first-name>


element, possibly followed by either a <middle-initial> or a <middle-name> element,
followed by a <last-name> element. In other words, both <middle-initial> and
<middle-name> are optional, and you can have only one of the two. Vertical bars
indicate a list of choices; you can choose only one item from the list. Also notice
that this example uses parentheses to group certain elements, and it uses a question
mark against the group.

 <!ELEMENT name ((title?, first-name, last-name) | (surname, mothers-name, given-


name))>

The <name> element can contain one of two sequences: An optional <title>, followed
by a <first-name> and a <last-name>; or a <surname>, a <mothers-name>, and a
<given-name>.

d) A word about flexibility

Before going on, a quick note about designing XML document types for flexibility. Consider
the sample name and address document type; We clearly wrote it with U.S. postal addresses
in mind. If you want a DTD or schema that defines rules for other types of addresses, you
would have to add a lot more complexity to it. Requiring a <state> element might make sense
in Australia, but it wouldn't in the UK. A Canadian address might be handled by the sample
DTD in Document Type Definitions, but adding a <province> element is a better idea.
Finally, be aware that in many parts of the world, concepts like title, first name, and last name
don't make sense.

The bottom line: If you're going to define the structure of an XML document, you should put
as much forethought into your DTD or schema as you would if you were designing a database
schema or a data structure in an application. The more future requirements you can foresee,
the easier and cheaper it will be for you to implement them later.

Proposed by Dr. SOP DEFFO Lionel Landry Page 10


School year 2024-2025/ Second semester/ FET / Computer Engineering

d) Defining attributes

This introductory part doesn't go into great detail about how DTDs work, but there's one more
basic topic to cover here: defining attributes. You can define attributes for the elements that
will appear in your XML document. Using a DTD, you can also:

 Define which attributes are required


 Define default values for attributes
 List all of the valid values for a given attribute

Suppose that you want to change the DTD to make state an attribute of the <city> element.
Here's how to do that:

1 <!ELEMENT city (#PCDATA)>


2 <!ATTLIST city state CDATA #REQUIRED>

This defines the <city> element as before, but the revised example also uses an ATTLIST
declaration to list the attributes of the element. The name city inside the attribute list tells the
parser that these attributes are defined for the <city> element. The name state is the name of
the attribute, and the keywords CDATA and #REQUIRED tell the parser that the state
attribute contains text and is required (if it's optional, CDATA #IMPLIED will do the trick).

To define multiple attributes for an element, write the ATTLIST like this:

1 <!ELEMENT city (#PCDATA)>


2 <!ATTLIST city state CDATA #REQUIRED
3 postal-code CDATA #REQUIRED>

This example defines both state and postal-code as attributes of the <city> element.

Finally, DTDs allow you to define default values for attributes and enumerate all of the valid
values for an attribute:

1 <!ELEMENT city (#PCDATA)>


2 <!ATTLIST city state CDATA (AZ|CA|NV|OR|UT|WA) "CA">

Proposed by Dr. SOP DEFFO Lionel Landry Page 11


School year 2024-2025/ Second semester/ FET / Computer Engineering

The example here indicates that it only supports addresses from the states of Arizona (AZ),
California (CA), Nevada (NV), Oregon (OR), Utah (UT), and Washington (WA), and that the
default state is California. Thus, you can do a very limited form of data validation. While this
is a useful function, it's a small subset of what you can do with XML schemas (see XML
schemas).

IV. XML schemas


a) overview

With XML schemas, you have more power to define what valid XML documents look like.
They have several advantages over DTDs:

 XML schemas use XML syntax. In other words, an XML schema is an XML
document. That means you can process a schema just like any other document. For
example, you can write an XSLT style sheet that converts an XML schema into a Web
form complete with automatically generated JavaScript code that validates the data as
you enter it.
 XML schemas support datatypes. While DTDs do support datatypes, it's clear those
datatypes were developed from a publishing perspective. XML schemas support all of
the original datatypes from DTDs (things like IDs and ID references). They also
support integers, floating point numbers, dates, times, strings, URLs, and other
datatypes useful for data processing and validation.
 XML schemas are extensible. In addition to the datatypes defined in the XML
schema specification, you can also create your own, and you can derive new datatypes
based on other datatypes.
 XML schemas have more expressive power. For example, with XML schemas you
can define that the value of any <state> attribute can't be longer than 2 characters, or
that the value of any <postal-code> element must match the regular expression [0-
9]{5}(-[0-9]{4})?. You can't do either of those things with DTDs.

b) A sample XML schema

Here's an XML schema that matches the original name and address DTD. It adds two
constraints: The value of the <state> element must be exactly two characters long and the
value of the <postal-code> element must match the regular expression [0-9]{5}(-[0-9]{4})?.

Proposed by Dr. SOP DEFFO Lionel Landry Page 12


School year 2024-2025/ Second semester/ FET / Computer Engineering

Although the schema is much longer than the DTD, it expresses more clearly what a valid
document looks like. Here's the schema:

<?xml version="1.0" encoding="UTF-8"?>


<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
1
<xsd:element name="address">
2
<xsd:complexType>
3
<xsd:sequence>
4
<xsd:element ref="name"/>
5
<xsd:element ref="street"/>
6
<xsd:element ref="city"/>
7
<xsd:element ref="state"/>
8
<xsd:element ref="postal-code"/>
9
</xsd:sequence>
10
</xsd:complexType>
11
</xsd:element>
12
13
<xsd:element name="name">
14
<xsd:complexType>
15
<xsd:sequence>
16
<xsd:element ref="title" minOccurs="0"/>
17
<xsd:element ref="first-Name"/>
18
<xsd:element ref="last-Name"/>
19
</xsd:sequence>
20
</xsd:complexType>
21
</xsd:element>
22
23
<xsd:element name="title" type="xsd:string"/>
24
<xsd:element name="first-Name" type="xsd:string"/>
25
<xsd:element name="last-Name" type="xsd:string"/>
26
<xsd:element name="street" type="xsd:string"/>
27
<xsd:element name="city" type="xsd:string"/>
28

<xsd:element name="state">

Proposed by Dr. SOP DEFFO Lionel Landry Page 13


School year 2024-2025/ Second semester/ FET / Computer Engineering

<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:length value="2"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>

<xsd:element name="postal-code">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:pattern value="[0-9]{5}(-[0-9]{4})?"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:schema>

c) Defining elements in schemas

The XML schema in a sample XML schema defined a number of XML elements with the
<xsd:element> element. The first two elements defined, <address> and <name>, are
composed of other elements. The <xsd:sequence> element defines the sequence of elements
that are contained in each. Here's an example:

1 <xsd:element name="address">
2 <xsd:complexType>
3 <xsd:sequence>
4 <xsd:element ref="name"/>
5 <xsd:element ref="street"/>
6 <xsd:element ref="city"/>
7 <xsd:element ref="state"/>
8 <xsd:element ref="postal-code"/>

Proposed by Dr. SOP DEFFO Lionel Landry Page 14


School year 2024-2025/ Second semester/ FET / Computer Engineering

9 </xsd:sequence>
10 </xsd:complexType>
11 </xsd:element>

As in the DTD version, the XML schema example defines that an <address> contains a
<name>, a <street>, a <city>, a <state>, and a <postal-code> element, in that order. Notice
that the schema actually defines a new datatype with the <xsd:complexType> element.

Most of the elements contain text; defining them is simple. You merely declare the new
element, and give it a datatype of xsd:string:

1 <xsd:element name="title" type="xsd:string"/>


2 <xsd:element name="first-Name" type="xsd:string"/>
3 <xsd:element name="last-Name" type="xsd:string"/>
4 <xsd:element name="street" type="xsd:string"/>
5 <xsd:element name="city" type="xsd:string"/>

c) Defining element content in schemas

The sample schema defines constraints for the content of two elements: The content of a
<state> element must be two characters long, and the content of a <postal-code> element
must match the regular expression [0-9]{5}(-[0-9]{4})?. Here's how to do that:

1 <xsd:element name="state">
2 <xsd:simpleType>
3 <xsd:restriction base="xsd:string">
4 <xsd:length value="2"/>
5 </xsd:restriction>
6 </xsd:simpleType>

Proposed by Dr. SOP DEFFO Lionel Landry Page 15


School year 2024-2025/ Second semester/ FET / Computer Engineering

7 </xsd:element>
8
9 <xsd:element name="postal-code">
10 <xsd:simpleType>
11 <xsd:restriction base="xsd:string">
12 <xsd:pattern value="[0-9]{5}(-[0-9]{4})?"/>
13 </xsd:restriction>
14 </xsd:simpleType>
15 </xsd:element>

For the <state> and <postal-code> elements, the schema defines new data types with
restrictions. The first case uses the <xsd:length> element, and the second uses the
<xsd:pattern> element to define a regular expression that this element must match.

This summary only scratches the surface of what XML schemas can do; there are entire books
written on the subject. For the purpose of this introduction, suffice to say that XML schemas
are a very powerful and flexible way to describe what a valid XML document looks like.

Proposed by Dr. SOP DEFFO Lionel Landry Page 16

You might also like