0% found this document useful (0 votes)

34 views

XML in R

This document discusses parsing and extracting information from XML documents using R. It provides an overview of the DOM and SAX parsing models in R and describes how to use the XML package to query nodes, extract information, and process XML files. Examples discussed include parsing PubMed articles and abstracts, scraping real estate data from Zillow, and extracting election results from HTML files.

Uploaded by

xAudiophile

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

XML in R

Uploaded by

xAudiophile

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Extracting data from

XML

Wednesday
DTL

Parsing - XML package

2 basic models - DOM & SAX
Document Object Model (DOM)
Tree stored internally as C, or as regular R objects
Use XPath to query nodes of interest, extract info.
Write recursive functions to "visit" nodes,
extracting information as it descends tree
extract information to R data structures via
handler functions that are called for particular
XML elements by matching XML name
For processing very large XML files with low-level
state machine via R handler functions - closures.

Preferred Approach
DOM (with internal C representation and XPath)
Given a node, several operations
xmlName() - element name (w/w.o. namespace prefix)
xmlNamespace()
xmlAttrs() - all attributes
xmlGetAttr() - particular value
xmlValue() - get text content.
xmlChildren(), node[[ i ]], node [[ "el-name" ]]
xmlSApply()
xmlNamespaceDefinitions()

Scraping HTML - (you name it!)

zillow - house price estimates

Examples

PubMed articles/abstracts
European Bank exchange rates
itunes - CDs, tracks, play lists, ...
PMML - predictive modeling markup language
CIS - Current Index of Statistics/Google Scholar
Google - Page Rank, Natural Language Processing
Wikipedia - History of changes, ....
SBML - Systems biology markup language
Books - Docbook
SOAP - eBay, KEGG, ...
Yahoo Geo/places - given name, get most likely location

PubMed
Professionally archived collection of "medically-related"
articles.
Vast collection of information, including
article abstracts
submission, acceptance and publication date
authors
...

PubMed
We'll use a sample PubMed example article for
simplicity.
Can get very large, rich <ArticleSet> with many articles
via an HTTP query done from within R/XML package
directly.
Take a look at the data, see what is available or read
the documentation
Or explore the contents.
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?
rid=helppubmed.section.publisherhelp.XML_Tag_Descripti
ons

doc = xmlTreeParse("pubmed.xml", useInternal = TRUE)

top = xmlRoot(doc)
xmlName(top)
[1] "ArticleSet"
names(top) - child nodes of this root
[1] "Article" "Article"
- so 2 articles in this set.

Let's fetch the author list for each article.

Do it first for just one and then use "apply" to iterate
names( top[[ 1 ]] )
Journal
"Journal"
LastPage
"LastPage"
Language
"Language"
ArticleIdList
"ArticleIdList"
ObjectList
"ObjectList"

ArticleTitle
"ArticleTitle"
ELocationID
"ELocationID"
AuthorList
"AuthorList"
History
"History"

art = top[[ 1 ]] [[ "AuthorList" ]]

what we want

FirstPage
"FirstPage"
ELocationID
"ELocationID"
GroupList
"GroupList"
Abstract
"Abstract"

names(art)
[1] "Author" "Author" "Author" "Author" "Author"
"Author"
names(art[[1]])
[1] "FirstName"
[5] "Affiliation"

"MiddleName" "LastName"

"Suffix"

So how do we get these values, e.g. to put in a data

frame.
Each element is a node with text content.

So loop over the nodes and get the content as a string

xmlSApply(art[[1]], xmlValue)
To do this for all authors of the article
xmlSApply(art, function(x) xmlSApply(x, xmlValue))
How do we deal with the different types of fields in the
names?
e.g. First, Middle, Last, Affiliation
CollectiveName
data representation/analysis question from here.

Pubmed Dates
In the <History> element, have date
received, accepted, aheadofprint
May want to look at time publication lag (i.e. received to
publication time) for different journals.
So get these dates for all the articles
<History>
<PubDate PubStatus="received">
<year>...</year><Month>06</Month><Day>15</Day>
<PubDate>
<PubDate PubStatus="accepted">
<year>.....</day>
</PubDate>

Find the element PubDate within History which has an

attribute whose value is "received"
Can use art[["History"]][["PubDate"]] to get all 3
elements.
But what if we want to access the 'received' dates for
all the articles in a single operation, then the
accepted, ...
Need a language to identify nodes with a particular
characteristic/condition

XPath
XPath is a language for expressing such node subsetting
with rich semantics for identifying nodes
by name
with specific attributes present
with attributes with particular values
with parents, ancestors, children
XPath = YALTL (Yet another language to learn)

XPath language
/node - top-level node
//node - node at any level
node[@attr-name] - node that has an attribute
named "attr-name"
node[@attr-name='bob'] - node that has attribute
named attr-name with value 'bob'
node/@x - value of attribute x in node with such
attr.
Returns a collection of nodes, attributes, etc.

Let's find the date when the articles were received

nodes = getNodeSet(top,
"//History/PubDate[@PubStatus='received']")
2 nodes - 1 per article
Extract year, month, day
lapply(nodes, function(x) xmlSApply(x, xmlValue))
Easy to get date "accepted" and "aheadofprint"

Text mining of abstract

Content of abstract as words
abstracts = xpathApply(top, "//Abstract", xmlValue)
Now, break up into words, stem the words, remove the
stop-words,
abstractWords = lapply(abstracts, strsplit, "[[:space:]]")
library(Rstem)
abstractWords = lapply(abstractWords,
function(x) wordStem[[1]])
Remove stop words
lapply(abstractWords, function(x) x[x %in% stopWords])

Zillow - house prices

Thanks to Roger, yesterday evening I found the Zillow
XML API - (Application Programming Interface)
Can register with Zillow, make queries to find estimated
house prices for a given house, comparables,
demographics, ...
Put address, city-state-zip & Zillow login in URL request
Can put this at the end of a URL within xmlTreeParse()
"http://www.zillow.com/...../...?zwsid=...&address=1029%20Bob's
%20Way&citstatezip=Berkeley"
But spaces are problematic, as are other characters.

So I use library(RCurl)
reply = getForm("http://www.zillow.com/webservice/GetSearchResults.htm",
'zws-id' = "AB-XXXXXXXXXXX_10312q",
address = "1093 Zuchini Way",
citystatezip = "Berkeley, CA, 94212")
reply is text from the Web server containing XML

<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<SearchResults:searchresults

xsi:schemaLocation=\"http://www.zillow.com/static/xsd/SearchResults.xsd /vstatic/
71a179109333d30cfb3b2de866d9add9/static/xsd/SearchResults.xsd\" xmlns:xsi=
\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:SearchResults=\"http://
www.zillow.com/static/xsd/SearchResults.xsd\">\n\n
<request>\n
<address>112 Bob's Way Avenue</address>\n
<citystatezip>Berkeley, CA,
94212</citystatezip>\n
</request>\n
\n
<message>\n
<text>Request
successfully processed</text>\n
<code>0</code>\n\t\t\n
</message>\n\n
\n
<response>\n\t\t<results>\n\t\t\t\n\t\t\t<result>\n\t\t\t\t
\t<zpid>24842792</zpid>\n\t<links>\n\t\t<homedetails>http://www.zillow.com/
HomeDetails.htm?city=Berkeley&state=CA&zprop=24842792&s_cid=Pa-Cv-X1CLz1carc3c49ms_htxqb&partner=X1-CLz1carc3c49ms_htxqb</homedetails>\n\t
\t<graphsanddata>http://www.zillow.com/Charts.htm?
chartDuration=5years&zpid=24842792&cbt=8965965681136447050%7E1%7E43-17yrvL
7nIj-Y5pqbsoqb_nh1QW4CVIhubJRAXIOkwbPosbEGChw**&s_cid=Pa-Cv-X1CLz1carc3c49ms_htxqb&partner=X1-CLz1carc3c49ms_htxqb</graphsanddata>\n\t
\t<mapthishome>http://www.zillow.com/search/RealEstateSearch.htm?
zpid=24842792#src=url&s_cid=Pa-Cv-X1-CLz1carc3c49ms_htxqb&partner=X1CLz1carc3c49ms_htxqb</mapthishome>\n\t\t<myestimator>http://www.zillow.com/
myestimator/Edit.htm?zprop=24842792&s_cid=Pa-Cv-X1CLz1carc3c49ms_htxqb&partner=X1-CLz1carc3c49ms_htxqb</myestimator>\n\t
\t<myzestimator deprecated=\"true\">http://www.zillow.com/myestimator/Edit.htm?
zprop=24842792&s_cid=Pa-Cv-X1-CLz1carc3c49ms_htxqb&partner=X1CLz1carc3c49ms_htxqb</myzestimator>\n\t</links>\n\t<address>\n\t\t<street>1292
Bob's way</street>\n\t\t<zipcode>94</zipcode>\n\t\t<city>Berkeley</city>\n\t
\t<state>CA</state>\n\t\t<latitude>34.882544</latitude>\n\t
\t<longitude>-123.11111</longitude>\n\t</address>\n\t\n\t\n\t<zestimate>\n\t
\t<amount currency=\"USD\">803000</amount>\n\t\t<last-updated>07/14/2008</lastupdated>\n\t\t\n\t\t\n\t\t\t<oneWeekChange deprecated=\"true\"></oneWeekChange>\n
\t\t\n\t\t\n\t\t\t<valueChange currency=\"USD\" duration=\"31\">-33500</
valueChange>\n\t\t\n\t\t\n\t\t<valuationRange>\n\t\t\t<low currency=\"USD
\">650430</low>\n\t\t\t

<?xml version="1.0" encoding="utf-8"?>

<SearchResults:searchresults xsi:schemaLocation="http://
www.zillow.com/static/xsd/SearchResults.xsd /vstatic/
71a179109333d30cfb3b2de866d9add9/static/xsd/SearchResults.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:SearchResults="http://www.zillow.com/static/xsd/
SearchResults.xsd">
<request>
<address>123 Bob's Way</address>
<citystatezip>Berkeley, CA, 94217</citystatezip>
</request>
<message>
<text>Request successfully processed</text>
<code>0</code>
</message>
<response>
<results>
<result>
<zpid>1111111</zpid>
<links>

Processing the result

We want to get the value of the element
<amount>803000</amount
doc =
xmlTreeParse(reply, asText = TRUE, useInternal = TRUE)
xmlValue(doc[["//amount"]])
[1] "803000"
Other information too

2004 Election Results

http://www.princeton.edu/~rvdb/JAVA/election2004/

Where are the data?

Within days of the election ?

USA Today, CNN, ...
http://www.usatoday.com/news/politicselections/
vote2004/results.htm
By state, by county, by senate/house, ...

read.table ?
Within the noise/ads, look for a table whose first cell is
"County"
Actually a
<td><b>County</b></td>
How do we know this? Look at one or two HTML files
out of the 50. Verify the rest.
Then, given the associated <table> element,
we can extract the values row by row and get a
data.frame/....

XPath expression
<table>........<tr>
<td class="notch_medium" width="153"><b>County</
b></td><td class="notch_medium" align="Right"
width="65"><b>Total Precincts</b></td><td
class="notch_medium" align="Right" width="70"><b>Precincts
Reporting</b></td><td class="notch_medium" align="Right"
width="60"><b>Bush</b></td><td class="notch_medium"
align="Right" width="60"><b>Kerry</b></td><td
class="notch_medium" align="Right" width="60"><b>Nader</
b></td>
</tr><

Little bit of trial and error

getNodeSet(nj, "//table[tr/td/b/text()='Total Precincts']")
Could be more specific, e.g. tr[1] - first row

Now that we have the <table> node, read the data into
an R data structure
rows = xmlApply(v[[1]],
function(x)
xmlSApply(x, xmlValue))
i.e. for each row, loop over the <td> and get its value.
Got some "\n\t\t\t" and last row is "Updated...."
first row is the County, Total Precincts, ....
So discard the rows without 7 entries
then remove the 7th entry ("\n\t\t\t")

v = getNodeSet(nj, "//table[tr/td/b/text()='Total Precincts']")

rows = xmlApply(v[[1]], function(x) xmlSApply(x, xmlValue))
# only the rows with 7 elements
rows = rows[sapply(rows, length) == 7]
# Remove the 7th element, and transpose to put back into
# counties as rows, precinct, candidates, ... as columns.
# So get a matrix of # counties by 6 matrix of character
# vectors.
rows = t(sapply(rows, "[", -7))

Learning XPath
XPath is another language
part of the XML technologies
XInclude
XPointer
XSL
XQuery
Can't we extract the data from the XML tree/DOM
(Document Object Model) without it and just use R
programming - Yes

doc = xmlTreeParse("pubmed.xml")
Now have a tree in R
recursive - list of children which are lists of children
or recursive tree of C-level nodes
Write an R function which "visits" each node and
extracts and stores the data from those nodes that are
relevant
e.g. the <Author>, <PubDate> nodes

Recursive functions are sometimes difficult to write

Have to store the results "globally"/non-locally
leads to closures/lexical scoping - "advanced R"
Have to traverse the entire tree via R code - SLOW!

Handlers
Alternative approach
when we read the XML tree into R and convert it to
a list of lists of children ...
when convert each C-level node, see if caller has a
function registered corresponding to the name/type
of node
if so call it and allow it to extract and store the
data.

Efficient Parsing
Problem with previous styles is we have the entire tree
in memory and then extract the data
=> 2 times the data in memory at the end
Bad news for large datasets
All of Wikipedia pages - 11Gigabytes
Need to read the XML as it passes as a stream,
extracting and storing the contents
and discarding the XML.
SAX parsing - "Simple API for XML"!

xmlEventParse(content,
list(startElement = function(node, ...)....,
endElement = function(node, ...) ...,
text = function(x) ...,
comment = function(x) ... , ....))
Whenever XML parser sees start/end/text/comment
node, calls R function which maintains state.
Awkward to write, but there to handle very large data.

Schema....
Just like a database has a schema describing the
characteristics of columns in all tables within a
database, XML documents often have an XML Schema
(or Document Type Definition - DTD) describing the
"template" tree and what elements can/must go where,
attributes, etc.
The XML Schema is written in XML, so we can read it!
And we can actually create R data types to represent
the same elements in XML directly in R.
So we can automate some of the reading of XML
elements into useful, meaning R objects
harder to programmatically flatten into data frames.

RCurl
xmlTreeParse() & xmlEventParse() can read from files,
compressed files, URLs, direct text - but limited
connection support.
RCurl package provides very rich ways that extend R's
ability to access content from URLs, etc. over the
Internet.
HTTPS - encrypted/secure HTTP
passwords/authentication
efficient, persistent connections
multiplexing
different protocols
Pass results to XML parser or other consumers.

Exceptions/Conditions

DOM
No ratings yet
DOM
7 pages
2013 - Notes - R Trinker'S - Notes
No ratings yet
2013 - Notes - R Trinker'S - Notes
274 pages
XMLLec1(1)
No ratings yet
XMLLec1(1)
22 pages
XML Parsers: When A Software Program Reads An XML Document and Takes Actions
No ratings yet
XML Parsers: When A Software Program Reads An XML Document and Takes Actions
7 pages
Python Language 581 860
No ratings yet
Python Language 581 860
280 pages
XML Update
No ratings yet
XML Update
10 pages
Automatically Generating DTD-Specific XML Parsers: Karl Nyberg
No ratings yet
Automatically Generating DTD-Specific XML Parsers: Karl Nyberg
6 pages
XML Javascript
No ratings yet
XML Javascript
62 pages
Chapter 12 AWP
No ratings yet
Chapter 12 AWP
18 pages
Introduction to XML: Nhóm thực hiện
No ratings yet
Introduction to XML: Nhóm thực hiện
46 pages
XML Parsers(Dom Sax)
No ratings yet
XML Parsers(Dom Sax)
20 pages
JSON
No ratings yet
JSON
12 pages
Haskell For Lisp Programmers
No ratings yet
Haskell For Lisp Programmers
27 pages
Introduction To R
No ratings yet
Introduction To R
36 pages
XML Databases Notes
No ratings yet
XML Databases Notes
24 pages
NumPy, SciPy, Pandas, Quandl Cheat Sheet
100% (3)
NumPy, SciPy, Pandas, Quandl Cheat Sheet
4 pages
ScalaMulti
No ratings yet
ScalaMulti
48 pages
Ipt Prelim
No ratings yet
Ipt Prelim
3 pages
XML Dom
No ratings yet
XML Dom
59 pages
A Mapping of XML Schema Types To C#: 1 General Design Guidelines
No ratings yet
A Mapping of XML Schema Types To C#: 1 General Design Guidelines
15 pages
Data Mapping Techniques
No ratings yet
Data Mapping Techniques
7 pages
Filna 5
No ratings yet
Filna 5
46 pages
JavaScript Built-In Objects
No ratings yet
JavaScript Built-In Objects
40 pages
How To Do Math and Tcs
No ratings yet
How To Do Math and Tcs
65 pages
2 Data Formats Relational DB
No ratings yet
2 Data Formats Relational DB
44 pages
INF532 - TP02 - APIs For XML
No ratings yet
INF532 - TP02 - APIs For XML
3 pages
We Can Insert or Delete Nodes We Can't Insert or Delete A Node
No ratings yet
We Can Insert or Delete Nodes We Can't Insert or Delete A Node
5 pages
Reading and Writing XML Data
No ratings yet
Reading and Writing XML Data
25 pages
Castor
No ratings yet
Castor
10 pages
Sizzle Manual
No ratings yet
Sizzle Manual
19 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
Developing Web Applications Using Microsoft Visual Studio 2008
No ratings yet
Developing Web Applications Using Microsoft Visual Studio 2008
24 pages
Unit4 - Ccs375-Webtechnologies (1)
No ratings yet
Unit4 - Ccs375-Webtechnologies (1)
48 pages
XML Schemas: Problems With Dtds
No ratings yet
XML Schemas: Problems With Dtds
52 pages
Unit 4 STUDY MATERIALS
No ratings yet
Unit 4 STUDY MATERIALS
8 pages
R Programming For NGS Data Analysis
No ratings yet
R Programming For NGS Data Analysis
5 pages
XMLDOM Parser
No ratings yet
XMLDOM Parser
2 pages
Unit-V Bdaur-Bcom
No ratings yet
Unit-V Bdaur-Bcom
9 pages
XML Dom CSS
No ratings yet
XML Dom CSS
20 pages
SAX Parsing With Python
No ratings yet
SAX Parsing With Python
3 pages
Soa XSLT Xpath Xquery
No ratings yet
Soa XSLT Xpath Xquery
8 pages
Json - Introduction: Javascript Object Notation, Commonly Known As
No ratings yet
Json - Introduction: Javascript Object Notation, Commonly Known As
16 pages
MIS 21 - Introduction To Applications Development
No ratings yet
MIS 21 - Introduction To Applications Development
32 pages
8A Programming Arrays With Map, Filter and Reduce
No ratings yet
8A Programming Arrays With Map, Filter and Reduce
51 pages
Clojure Guides_ Parsing XML in Clojure
No ratings yet
Clojure Guides_ Parsing XML in Clojure
9 pages
Day 1
No ratings yet
Day 1
13 pages
XML Parsing in QTP Using XMLUtil - A Simple Example
No ratings yet
XML Parsing in QTP Using XMLUtil - A Simple Example
8 pages
Clase XML Oracle
No ratings yet
Clase XML Oracle
72 pages
R Reference Card
100% (4)
R Reference Card
4 pages
SQL/XML For Developers: Lewis Cunningham Senior Solutions Architect Enterprisedb
100% (3)
SQL/XML For Developers: Lewis Cunningham Senior Solutions Architect Enterprisedb
56 pages
JAXP
No ratings yet
JAXP
20 pages
Most Probable Questions
No ratings yet
Most Probable Questions
6 pages
Ian Talks JS A-Z: WebDevAtoZ, #1
From Everand
Ian Talks JS A-Z: WebDevAtoZ, #1
Ian Eress
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
jQuery 1.4 Reference Guide
From Everand
jQuery 1.4 Reference Guide
Jonathan Chaffer
3.5/5 (2)
Digital Signal Processing: 1 Preface: Evolution of The Function
No ratings yet
Digital Signal Processing: 1 Preface: Evolution of The Function
2 pages
Miscellaneous: 1 Abstract Algebra
No ratings yet
Miscellaneous: 1 Abstract Algebra
11 pages
Miscellaneous: 1 Abstract Algebra
No ratings yet
Miscellaneous: 1 Abstract Algebra
1 page
Spectral Theorem
No ratings yet
Spectral Theorem
2 pages
Set Theory
No ratings yet
Set Theory
1 page
Assignment One Final
No ratings yet
Assignment One Final
27 pages
Film Plan
No ratings yet
Film Plan
1 page
Literature Review
No ratings yet
Literature Review
7 pages
Assignment One Commented Draft
No ratings yet
Assignment One Commented Draft
26 pages
Sherlock Figured World
No ratings yet
Sherlock Figured World
1 page
Literary Space
No ratings yet
Literary Space
1 page