XML in R
XML in R
XML
Wednesday
DTL
Preferred Approach
DOM (with internal C representation and XPath)
Given a node, several operations
xmlName() - element name (w/w.o. namespace prefix)
xmlNamespace()
xmlAttrs() - all attributes
xmlGetAttr() - particular value
xmlValue() - get text content.
xmlChildren(), node[[ i ]], node [[ "el-name" ]]
xmlSApply()
xmlNamespaceDefinitions()
Examples
PubMed articles/abstracts
European Bank exchange rates
itunes - CDs, tracks, play lists, ...
PMML - predictive modeling markup language
CIS - Current Index of Statistics/Google Scholar
Google - Page Rank, Natural Language Processing
Wikipedia - History of changes, ....
SBML - Systems biology markup language
Books - Docbook
SOAP - eBay, KEGG, ...
Yahoo Geo/places - given name, get most likely location
PubMed
Professionally archived collection of "medically-related"
articles.
Vast collection of information, including
article abstracts
submission, acceptance and publication date
authors
...
PubMed
We'll use a sample PubMed example article for
simplicity.
Can get very large, rich <ArticleSet> with many articles
via an HTTP query done from within R/XML package
directly.
Take a look at the data, see what is available or read
the documentation
Or explore the contents.
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?
rid=helppubmed.section.publisherhelp.XML_Tag_Descripti
ons
ArticleTitle
"ArticleTitle"
ELocationID
"ELocationID"
AuthorList
"AuthorList"
History
"History"
FirstPage
"FirstPage"
ELocationID
"ELocationID"
GroupList
"GroupList"
Abstract
"Abstract"
names(art)
[1] "Author" "Author" "Author" "Author" "Author"
"Author"
names(art[[1]])
[1] "FirstName"
[5] "Affiliation"
"MiddleName" "LastName"
"Suffix"
Pubmed Dates
In the <History> element, have date
received, accepted, aheadofprint
May want to look at time publication lag (i.e. received to
publication time) for different journals.
So get these dates for all the articles
<History>
<PubDate PubStatus="received">
<year>...</year><Month>06</Month><Day>15</Day>
<PubDate>
<PubDate PubStatus="accepted">
<year>.....</day>
</PubDate>
XPath
XPath is a language for expressing such node subsetting
with rich semantics for identifying nodes
by name
with specific attributes present
with attributes with particular values
with parents, ancestors, children
XPath = YALTL (Yet another language to learn)
XPath language
/node - top-level node
//node - node at any level
node[@attr-name] - node that has an attribute
named "attr-name"
node[@attr-name='bob'] - node that has attribute
named attr-name with value 'bob'
node/@x - value of attribute x in node with such
attr.
Returns a collection of nodes, attributes, etc.
So I use library(RCurl)
reply = getForm("http://www.zillow.com/webservice/GetSearchResults.htm",
'zws-id' = "AB-XXXXXXXXXXX_10312q",
address = "1093 Zuchini Way",
citystatezip = "Berkeley, CA, 94212")
reply is text from the Web server containing XML
http://www.princeton.edu/~rvdb/JAVA/election2004/
read.table ?
Within the noise/ads, look for a table whose first cell is
"County"
Actually a
<td><b>County</b></td>
How do we know this? Look at one or two HTML files
out of the 50. Verify the rest.
Then, given the associated <table> element,
we can extract the values row by row and get a
data.frame/....
XPath expression
<table>........<tr>
<td class="notch_medium" width="153"><b>County</
b></td><td class="notch_medium" align="Right"
width="65"><b>Total Precincts</b></td><td
class="notch_medium" align="Right" width="70"><b>Precincts
Reporting</b></td><td class="notch_medium" align="Right"
width="60"><b>Bush</b></td><td class="notch_medium"
align="Right" width="60"><b>Kerry</b></td><td
class="notch_medium" align="Right" width="60"><b>Nader</
b></td>
</tr><
Now that we have the <table> node, read the data into
an R data structure
rows = xmlApply(v[[1]],
function(x)
xmlSApply(x, xmlValue))
i.e. for each row, loop over the <td> and get its value.
Got some "\n\t\t\t" and last row is "Updated...."
first row is the County, Total Precincts, ....
So discard the rows without 7 entries
then remove the 7th entry ("\n\t\t\t")
Learning XPath
XPath is another language
part of the XML technologies
XInclude
XPointer
XSL
XQuery
Can't we extract the data from the XML tree/DOM
(Document Object Model) without it and just use R
programming - Yes
doc = xmlTreeParse("pubmed.xml")
Now have a tree in R
recursive - list of children which are lists of children
or recursive tree of C-level nodes
Write an R function which "visits" each node and
extracts and stores the data from those nodes that are
relevant
e.g. the <Author>, <PubDate> nodes
Handlers
Alternative approach
when we read the XML tree into R and convert it to
a list of lists of children ...
when convert each C-level node, see if caller has a
function registered corresponding to the name/type
of node
if so call it and allow it to extract and store the
data.
Efficient Parsing
Problem with previous styles is we have the entire tree
in memory and then extract the data
=> 2 times the data in memory at the end
Bad news for large datasets
All of Wikipedia pages - 11Gigabytes
Need to read the XML as it passes as a stream,
extracting and storing the contents
and discarding the XML.
SAX parsing - "Simple API for XML"!
xmlEventParse(content,
list(startElement = function(node, ...)....,
endElement = function(node, ...) ...,
text = function(x) ...,
comment = function(x) ... , ....))
Whenever XML parser sees start/end/text/comment
node, calls R function which maintains state.
Awkward to write, but there to handle very large data.
Schema....
Just like a database has a schema describing the
characteristics of columns in all tables within a
database, XML documents often have an XML Schema
(or Document Type Definition - DTD) describing the
"template" tree and what elements can/must go where,
attributes, etc.
The XML Schema is written in XML, so we can read it!
And we can actually create R data types to represent
the same elements in XML directly in R.
So we can automate some of the reading of XML
elements into useful, meaning R objects
harder to programmatically flatten into data frames.
RCurl
xmlTreeParse() & xmlEventParse() can read from files,
compressed files, URLs, direct text - but limited
connection support.
RCurl package provides very rich ways that extend R's
ability to access content from URLs, etc. over the
Internet.
HTTPS - encrypted/secure HTTP
passwords/authentication
efficient, persistent connections
multiplexing
different protocols
Pass results to XML parser or other consumers.
Exceptions/Conditions