Handout 3
Handout 3
In Practical 2 you looked at ways of producing XML output from Fortran codes. In this af-
ternoon’s exercises you will look using the XML in ways beyond reading it into a web
browser or Google Earth. The aim of the session is to give you an understanding of XPath,
a rather high level interface to data from an XML document used by many XML technolo-
gies. It’s worth designing your documents so that they can be easily used with XPath ex-
pressions. However, there is no Fortran XPath interface, so this exercise will be done us-
ing other tools including Python or Perl.
Don’t worry though, you don’t need to know either scripting language, and XPath is a
language-agnostic interface; for this exercise you will only be doing XPath, not writing
code. But this also reflects practicalities, in that even for data produced by Fortran codes,
much analysis and post-processing is done using scripting languages such as Perl or Py-
thon.
The files needed for this practical can be found in the directory ~/Practical_3. Some of
the exercises in this practical can be performed in more then one language (Perl, Python
and sometimes Shell examples are given), unless you have a particular reason to choose
one of these we recommend you work on the Python examples as Python has a cleaner
interface to the XPath library you will be using.
xmllint document.xml
if the document is well formed then the XML file will be printed to the screen (useful for pip-
ing into a second process), but if it is not an error message will be reported including a line
number of where the error occurred. This can be an incredibly useful tool when debugging!
Is the provided document well formed? If not, how can you fix it?
3-1
For the record, there is an error (a missing ‘>’) at the end of the last line of the provided
document. Once you have fixed this you can use xmllint to interactively explore the XML
file, do the following:
/ >
You can use this prompt in the same way as a normal shell prompt - type commands into
it, and then press return for them to be executed.
By starting up xmllint in this way, it lets you walk around the XML document, as if it were a
filesystem tree, as described in the lecture. At the start of a session, you are placed at the
top of the document tree.
So you are now in the XML document at the top level. You can navigate it as if were a di-
rectory tree, using 'cd' to change directory, and 'ls' to look at directory contents. So, as a
first step, do:
/ > cd parent
Note that as you do this, the prompt changes to show you where you are in the tree.
parent > ls
You will see a list of results. These are all the XML elements which are the top-most chil-
dren of the XML file. That is, the top-level 'directories' if this were a filesystem tree. There
should be three 'child' elements. (You can ignore the 1st and 2nd columns of output.)
Let's look at the first element. There are two ways to do this:
The arguments you have been providing are actually XPath expressions. So you can get
all of the grandchild names by doing:
or:
3-2
Here the ‘@’ is asking for an attribute of with the name ‘name’. But what if you want to se-
lect the date of birth (born attribute) of, say, Elizabeth I? or the names of the children of
Margaret Tudor? To do this you will need to add a ‘predicate’, a condition that must be met
for the XPath expression to match. These are included in square brackets. The date of
birth of Elizabeth I can be found by doing:
In this exercise we have only scratched the surface of xmllint’s capabilities. It can be used
for a wide range of XML related tasks including validation, changing the text encoding,
canonicalisation and processing Xinclude statements but rather then looking at this in any
depth we’ll now move on to look at building XPath expressions in scripting environments.
One thing we have not touched on is namespaces in xmllint. This is not because they are
not supported, but because the commands needed rapidly get quite involved and (as we’ll
see below) namespaces are easily handled from Python.
python xpather.py
You should see “[‘Monty Python and the Holy Grail’]” - which is python’s way
of printing a one element list (array) with a single string element.
For the rest of this part of the exercise you should change the XPath query to extract other
data. To change the xpather.py script you just need to edit and save it- there is no need
to compile a python script. The only line you will need to change is:
answer = docRoot.xpath("/film/@name")
1. The date of the film encoded in the “<data date='1975'/>” element. Hint: you will need
to add an additional location step in the XPath query, and modify the attribute name.
2. The names of all five of the listed Pythons. Hint: The fact that this requires a list of
names to be returned does not add to the complexity of the XPath query needed, again
you just need to change a location step and modify the attribute name.
3-3
3. The quotations of all the characters. Hint: you won’t need any attributes, just three loca-
tion steps, and you will need /text() to recover the text.
4. You can also complete part 3 with a single location step. Can you construct such an
XPath expression?
5. Modify your solution to part 2 so that only the name of the character played by Terry
Gilliam is reported. Hint: You will need to use a predicate (in square brackets).
First try some of the solutions to exercise 3.2 with the file monty_ns.xml. A copy of
xpather.py is included in the directory. Note that the file name monty.xml has been
changed to monty_ns.xml on the line to load the XML:
docRoot = lxml.etree.parse(source="monty_ns.xml")
You should find that all the expressions return an empty list (‘[]’). The XPath expressions
have not matched any nodes. This is because we have not specified the namespace and
so the XPath library is only looking for nodes with no defined namespace. All nodes in
monty_ns.xml have a namespace defined, either directly with an ‘xmlns=’ declaration,
or by inheritance.
3-4
The file xpather_ns.py is a python script set up to do the same work as xpather.py,
but for namespaced documents. Take a look at the xpather_ns.py file. You should
note two changes. First the line:
namespaces = {'q':'http://www.example.com/quotes',
'f':'http://www.example.com/films'}
has been added. This declares a python dictionary of namespaces and related local short
names (f and q) to enable their use in XPath expressions with much less typing. Secondly,
the call to the XPath library has been modified:
to tell the library to use the dictionary of namespaces. One way to think of this is that the
dictionary lists all namespaces this script is designed to know about. XPath expressions
will simply ignore namespaces that the script does not understand.
Run xpather_ns.py - does it work? You should now repeat Exercise 3.2 with this
namespace aware version.
1. The date of the film encoded in the “<data date='1975'/>” element. Do you need
to give a namespace to each data element in the search path, or is the namespace in-
herited through the query?
3. The quotations of all the characters. Hint: remember that <quote> elements are in a
different namespace to <film>, <data> and <comic>.
4. You can also complete part 3 with a single location step. Can you construct such an
XPath expression?
5. Modify your solution to part 2 so that only the name of the character played by Terry
Gilliam.
3-5
Exercise 3.4: Matching Nodes
The remainder of this morning’s practical involves using XPath to extract data from the
mixed namespace KML document produced by yesterday’s final exercise. In case you did
not finish the final part of that exercise, a suitable XML document named hypoDD.kml is
provided in the exercise_4 directory along with all the python scripts needed for the the
remainder of this practical.
Hint: Absolute XPath expressions (starting with a / including all the elements needed to
find the information) can be used for this exercise. The XPath function count() can be used
to find the number of nodes returned by an XPath query.
3-6
Second XPath query: ______________________________________________________
1. Extract a list of placemarks within the folder with the name ‘Initial positions’.
2. Extract the unique_id from the quakeML location element embedded in each place-
mark in turn. This expression involves a change of namespace.
4. Find the latitude and longitude of the event with a matching unique_id from the ‘Final
positions’ folder.
The script should then print the distance that hypoDD has moved each earthquake during
its refinement process.
3-7