Chapter 4. Java XML Processing
Chapter 4. Java XML Processing
1
Parsers
• What is a parser?
- A program that analyses the grammatical
structure of an input, with respect to a given
formal grammar
- The parser determines how a sentence can be
constructed from the grammar of the language by
describing the atomic elements of the input and the
relationship among them
• How should an XML parser work?
2
XML-Parsing Standards
3
XML Examples
4
<?xml version="1.0"?> world.xml
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;">
<name>Israel</name>
<population year="2001">6,199,008</population>
<city capital="yes"><name>Jerulsalem</name></city>
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
5
XML Tree Model
countries
country
continent
name city
Asia city
capital
capital name
Israel name
population country
no Ashdod population
year
6,199,008 year
yes continent name
Jerusalem 60,424,213
2001
Europe
France 2004
6
<!ELEMENT countries (country*)> world.dtd
<!ELEMENT country (name,population?,city*)>
<!ATTLIST country continent CDATA #REQUIRED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT city (name)>
<!ATTLIST city capital (yes|no) "no">
<!ELEMENT population (#PCDATA)>
<!ATTLIST population year CDATA #IMPLIED>
<!ENTITY eu "Europe">
<!ENTITY as "Asia">
<!ENTITY af "Africa">
<!ENTITY am "America">
<!ENTITY au "Australia">
7
<?xml version="1.0"?> sales.xml
<forsale date="12/2/03"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<book>
<title> <xhtml:em>DBI:</xhtml:em>
<![CDATA[Where I Learned <xhtml>.]]>
</title>
<comment
xmlns="http://www.cs.huji.ac.il/~dbi/comments">
<par>My <xhtml:b> favorite </xhtml:b> book!</par>
</comment>
</book>
</forsale> 8
<?xml version="1.0"?> sales.xml
<forsale date="12/2/03"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<book>
<title> <xhtml:h1> DBI </xhtml:h1>
<![CDATA[Where I Learned <xhtml>.]]>
</title>
<comment
xmlns="http://www.cs.huji.ac.il/~dbi/comments">
Namespace: “http://www.w3.org/1999/xhtml”
<par>My <xhtml:b> favorite </xhtml:b> book!</par>
Local name: “h1”
</comment>
Qualified name: “xhtml:h1”
</book>
</forsale> 9
<?xml version="1.0"?> sales.xml
<forsale date="12/2/03"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<book>
Namespace: “http://www.cs.huji.ac.il/~dbi/comments”
<title> <xhtml:h1> DBI </xhtml:h1>
Local name: “par” I Learned <xhtml>.]]>
<![CDATA[Where
Qualified name: “par”
</title>
<comment
xmlns="http://www.cs.huji.ac.il/~dbi/comments">
<par>My <xhtml:b> favorite </xhtml:b> book!</par>
</comment>
</book>
</forsale> 10
<?xml version="1.0"?> sales.xml
<forsale date="12/2/03"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<book>
<title> <xhtml:h1>DBI</xhtml:h1>
<![CDATA[Where I Learned <xhtml>.]]>
</title>
<comment
xmlns="http://www.cs.huji.ac.il/~dbi/comments">
<par>My <xhtml:b> favorite </xhtml:b> book!</par>
Namespace: “”
</comment>
Local name: “title”
</book>
Qualified name: “title”
</forsale> 11
SAX – Simple API for
XML
12
SAX Parser
<?xml version="1.0"?>
.
.
. When you see
the start of the
document do …
SAX Parser When you see
the start of an
element do … When you see
the end of an
element do …
24
Used to create a
SAX Parser Handles document
events: start tag,
XML-Reader end tag, etc.
Factory
Handles
Content Parser
Handler Errors
Error
XML Handler Handles
XML Reader DTD
DTD
Handler
Entity Handles
Resolver Entities
25
Creating a Parser
• The SAX interface is an accepted standard
• There are many implementations of many
vendors
- Standard API does not include an actual
implementation, but Sun provides one with JDK
• Like to be able to change the implementation
used without changing any code in the program
- How is this done?
26
Factory Design Pattern
• Have a “factory” class that creates the actual parsers
- org.xml.sax.helpers.XMLReaderFactory
• The factory checks configurations, such as the of a
system property, that specify the implementation
- Can be set outside the Java code: a configuration file, a
command-line argument, etc.
• In order to change the implementation, simply change
the system property
27
Creating a SAX Parser
import org.xml.sax.*;
import org.xml.sax.helpers.*;
public class EchoWithSax {
public static void main(String[] args) throws Exception {
System.setProperty("org.xml.sax.driver",
"org.apache.xerces.parsers.SAXParser");
XMLReader reader =
XMLReaderFactory.createXMLReader();
reader.parse("world.xml");
}
} 28
Implementing the Content Handler
29
ContentHandler Methods
• startDocument - parsing begins
• endDocument - parsing ends
• startElement - an opening tag is encountered
• endElement - a closing tag is encountered
• characters - text (CDATA) is encountered
• ignorableWhitespace - white spaces that should
be ignored (according to the DTD)
• and more ...
30
The Default Handler
31
A Content Handler Example
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.*;
int depth = 0;
34
Fixing The Parser
public class EchoWithSax {
public static void main(String[] args) throws Exception {
XMLReader reader =
XMLReaderFactory.createXMLReader();
reader.setContentHandler(new EchoHandler());
reader.parse("world.xml");
}
}
35
Empty Elements
36
Attributes Interface
• The Attributes interface provides an access to all
attributes of an element
- getLength(), getQName(i), getValue(i),
getType(i), getValue(qname), etc.
• The following are possible types for attributes:
CDATA, ID, IDREF, IDREFS, NMTOKEN, NMTOKENS,
ENTITY, ENTITIES, NOTATION
• There is no distinction between attributes that are
defined explicitly from those that are specified in the
DTD (with a default value)
37
ErrorHandler Interface
• We implement ErrorHandler to receive error events
(similar to implementing ContentHandler)
• DefaultHandler implements ErrorHandler in
an empty fashion, so we can extend it (as before)
• An ErrorHandler is registered with
- reader.setErrorHandler(handler);
• Three methods:
- void error(SAXParseException ex);
- void fatalError(SAXParserExcpetion ex);
- void warning(SAXParserException ex);
38
Parsing Errors
• Fatal errors disable the parser from continuing parsing
- For example, the document is not well formed, an unknown
XML version is declared, etc.
• Errors occur the parser is validating and validity
constrains are violated
• Warnings occur when abnormal (yet legal) conditions
are encountered
- For example, an entity is declared twice in the DTD
39
EntityResolver and DTDHandler
40
Features and Properties
• Features:
- namespaces - are namespaces supported?
- validation - does the parser validate (against the
declared DTD) ?
- http://apache.org/xml/features/nonvalidating/load-external-dtd
• Ignore the DTD? (spec. to Xerces implementation)
• Properties:
- xml-string - the actual text that cased the current
event (read-only)
- lexical-handler - see the next slide...
42
Lexical Events
43
LexicalHandler Methods
• comment(char[] ch, int start, int length)
• startCDATA()
• endCDATA()
• startEntity(java.lang.String name)
• endEntity(java.lang.String name)
• and more...
44
DOM – Document Object
Model
45
DOM Parser
46
<?xml version="1.0"?>
<!DOCTYPE countries SYSTEM "world.dtd">
<countries>
<country continent="&as;">
<name>Israel</name>
<population year="2001">6,199,008</population>
<city capital="yes"><name>Jerulsalem</name></city>
<city><name>Ashdod</name></city>
</country>
<country continent="&eu;">
<name>France</name>
<population year="2004">60,424,213</population>
</country>
</countries>
47
Document
The DOM Tree
countries
country
continent
name city
Asia city
capital
capital name
Israel name
population country
no Ashdod population
year
6,199,008 year
yes continent name
Jerusalem 60,424,213
2001
Europe
France 2004 48
Using a DOM Tree
A
P Application
I
XML File DOM Parser DOM Tree
49
50
51
Creating a DOM Tree
53
The Node Interface
• The nodes of the DOM tree include
- a special root (denoted document)
- element nodes
- text nodes and CDATA sections
- attributes
- comments
- and more ...
• Every node in the DOM tree implements the Node
interface
54
Node Navigation
55
Node Navigation (cont)
getPreviousSibling()
getFirstChild()
getParentNode() getChildNodes()
getLastChild()
getNextSibling()
56
Node Properties
• Every node has
- a type
- a name
- a value
- attributes
• The roles of these properties differ according to the
node types
• Nodes of different types implement different interfaces
(that extend Node)
57
Figure as appears in : “The XML Companion” - Neil Bradley
Entity NamedNodeMap
EntityReference
ProcessingInstruction
DocumentType 58
Interfaces in the DOM Tree
Document
Comment Text
59
Names, Values and Attributes
attributes nodeValue nodeName Interface
null value of attribute name of attribute Attr
null content of the Section "#cdata-section" CDATASection
null content of the comment "#comment" Comment
null null "#document" Document
null null "#document-fragment" DocumentFragment
null null doc-type name DocumentType
NodeMap null tag name Element
null null entity name Entity
null null name of entity referenced EntityReference
null null notation name Notation
null entire content target ProcessingInstruction
null content of the text node "#text" Text
60
Node Types - getNodeType()
ELEMENT_NODE = 1 PROCESSING_INSTRUCTION_NODE = 7
ATTRIBUTE_NODE = 2 COMMENT_NODE = 8
TEXT_NODE = 3 DOCUMENT_NODE = 9
CDATA_SECTION_NODE = 4 DOCUMENT_TYPE_NODE = 10
ENTITY_REFERENCE_NODE = 5 DOCUMENT_FRAGMENT_NODE = 11
ENTITY_NODE = 6 NOTATION_NODE = 12
if (myNode.getNodeType() == Node.ELEMENT_NODE) {
//process node
…
}
61
import org.w3c.dom.*;
import javax.xml.parsers.*;
depth++;
for (Node child = n.getFirstChild(); child != null;
child = child.getNextSibling()) echo(child);
depth--;
} 63
private int depth = 0;
private String[] NODE_TYPES = {
"", "ELEMENT", "ATTRIBUTE", "TEXT", "CDATA",
"ENTITY_REF", "ENTITY", "PROCESSING_INST",
"COMMENT", "DOCUMENT", "DOCUMENT_TYPE",
"DOCUMENT_FRAG", "NOTATION" };
private void print(Node n) {
for (int i = 0; i < depth; i++) System.out.print(" ");
System.out.print(NODE_TYPES[n.getNodeType()] + ":");
System.out.print("Name: "+ n.getNodeName());
System.out.print(" Value: "+ n.getNodeValue()+"\n");
}} 64
Another Example
public class WorldParser {
67
Node Manipulation
• Children of a node in a DOM tree can be manipulated -
added, edited, deleted, moved, copied, etc.
• To constructs new nodes, use the methods of
Document
- createElement, createAttribute, createTextNode,
createCDATASection etc.
• To manipulate a node, use the methods of Node
- appendChild, insertBefore, removeChild, replaceChild,
setNodeValue, cloneNode(boolean deep) etc.
68
Figure as appears in “The XML Companion” - Neil Bradley
Old
New New
Ref
insertBefore replaceChild
deep = 'false'
69
SAX vs. DOM
70
Parser Efficiency
• The DOM object built by DOM parsers is usually
complicated and requires more memory storage than the
XML file itself
- A lot of time is spent on construction before use
- For some very large documents, this may be impractical
• SAX parsers store only local information that is
encountered during the serial traversal
• Hence, programming with SAX parsers is, in general,
more efficient
71
Programming using SAX is
Difficult
• In some cases, programming with SAX is
difficult:
- How can we find, using a SAX parser, elements e1
with ancestor e2?
- How can we find, using a SAX parser, elements e1
that have a descendant element e2?
- How can we find the element e1 referenced by the
IDREF attribute of e2?
72
Node Navigation
• SAX parsers do not provide access to elements other
than the one currently visited in the serial (DFS)
traversal of the document
• In particular,
- They do not read backwards
- They do not enable access to elements by ID or name
• DOM parsers enable any traversal method
• Hence, using DOM parsers is usually more comfortable
73
More DOM Advantages
74
Which should we use?
DOM vs. SAX
• If your document is very large and you only need
a few elements – use SAX
• If you need to manipulate (i.e., change) the XML
– use DOM
• If you need to access the XML many times – use
DOM (assuming the file is not too large)
75