A Parallel Approach To XML Parsing: Wei Lu, Kenneth Chiu, Yinfei Pan
A Parallel Approach To XML Parsing: Wei Lu, Kenneth Chiu, Yinfei Pan
Abstract— A language for semi-structured documents, XML software pipelining is often hard to implement well, due to
has emerged as the core of the web services architecture, synchronization, load-balance and memory access costs.
and is playing crucial roles in messaging systems, databases, More promising is a data-parallel approach. Here, the XML
and document processing. However, the processing of XML
documents has a reputation for poor performance, and a number document would be divided into some number of chunks, and
of optimizations have been developed to address this performance each thread would work on the chunks independently. As the
problem from different perspectives, none of which have been chunks are parsed, the results are merged.
entirely satisfactory. In this paper, we present a seemingly To divide the XML document into chunks, we could simply
quixotic, but novel approach: parallel XML parsing. Parallel treat it as a sequence of characters, and then divide the
XML parsing leverages the growing prevalence of multicore
architectures in all sectors of the computer market, and yields document into equal-sized chunks, assigning one chunk to
significant performance improvements. This paper presents our each thread. This requires that each thread begin parsing from
design and implementation of parallel XML parsing. Our design an arbitrary point in the XML document, however, which is
consists of an initial preparsing phase to determine the structure problematic. Since an XML document is the serialization of a
of the XML document, followed by a full, parallel parse. The tree-structured data model (called XML Infoset [3]) traversed
results of the preparsing phase are used to help partition
the XML document for data parallel processing. Our parallel in left-to-right, depth-first order, such a division will create
parsing phase is a modification of the libxml2 [1] XML parser, chunks corresponding to arbitrary parts of the tree, and thus
which shows that our approach applies to real-world, production the parsing results will be difficult to merge back into a
quality parsers. Our empirical study shows our parallel XML single tree. Correctly reconstructing namespace scopes and
parsing algorithm can improved the XML parsing performance references will also be challenging. Furthermore, most chunks
significantly and scales well.
will begin in the middle of some string whose grammatical
I. I NTRODUCTION role is unknown. It could be a tag name, an attribute name, an
attribute value, element content, etc. This could be resolved
XML’s emergence as the de facto standard for encoding by extensive backtracking and communication, but that would
tree-oriented, semi-structured data has brought significant in- incur overhead that may negate the advantages of parallel
teroperability and standardization benefits to grid computing. parsing. Apparently, instead of the equal-sized physical de-
Performance, however, is still a lingering concern for some composition, the ability of decomposing the XML document
applications of XML. A number of approaches have been used based on its logical structure is the key toward the efficient
to address these performance concerns, ranging from binary parallel XML parsing.
XML to schema-specific parsing to hardware acceleration. The results of parsing XML can vary from a DOM-style,
As manufacturers have encountered difficulties to further data structure representing the XML document, to a sequence
exponential increases in clock speeds, they are increasingly of events manifest as callbacks, as in SAX-style parsing.
utilizing the march of Moore’s law to provide multiple cores Our parallel approach in this paper focuses on DOM-style
on a single chip. Tomorrow’s computers will have more cores parsing, where a tree data structure is created in memory
rather than exponentially faster clock speeds, and software will that represents the document. Our targeted application area is
increasingly have to rely on parallelism to take advantage of scientific computing, but we believe our approach is broadly
this trend [2]. applicable. Our implementation is based on the production
In this paper, we investigate the seemingly quixotic idea of quality libxml2 [1] parser, which shows that our work applies
parsing XML in parallel on a shared memory computer, and to real-world parsers, not just research implementations.
develop an approach that scales reasonably well to four cores. Current programming models for multicore architectures
Concurrency could be used in a number of ways to improve provide access to multiple cores via threads. Thus, in the rest
XML parsing performance. One approach would be to use of the paper, we use the term thread rather than core. To avoid
pipelining. In this approach, XML parsing could be divided scheduling issues that are outside the scope of this paper, we
into a number of stages. Each stage would be executed by assume that each thread is executing on a separate core.
a different thread. This approach may provide speedup, but The rest of the paper is organized as follows. Section II
... root
Second
of the preparsing, and then performed experiments to measure
the performance improvement and the scalability of the paral-
1
lel XML parsing (static and dynamic partition) algorithm over
the different XML documents. The experiments are running
on a Linux 2.6.9 machine which has two 2 dual-core AMD
0.5
Opteron processors and 4GB of RAM. Every test is run five
times to get the average time and the measurement of the first
time is discarded, so as to measure performance with the file
data already cached, rather than being read from disk. The 0
0 5 10 15 20 25 30 35 40 45
programs are compiled by g++ 3.4.5 with the option -O3. and Size (MB)
2.5 2.5
Speedup
Speedup
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
Threads Threads
Fig. 6. This graph shows the upper bound of the speedup of the PXP Fig. 7. This graph shows the speedup of the dynamic PXP for up to four
algorithms for up to four threads, when used to parse a big XML document threads, when used to parse two same-size XML documents, one with irregular
which only contains an array structure. tree shape and one with regular array shape.
dominates the overhead, and the static PXP presents the upper contents. In a typed parsing scenario, where schema or other
bound of the parallel performance. information can be used to interpret the element content, we
would obtain even better scalability. For example, if we are
The speedups of dynamic PXP are slightly lower than the
parsing a large array of doubles including the ASCII-to-double
ones of the static PXP, which indicates the cost of communica-
conversion, each thread has an increased workload relative to
tion and synchronization starts to be a factor, but is relatively
the preparsing stage and other overheads, and thus speedup
minor. When the threads number is increased the speedup of
would be improved.
the PXP (dynamic or static) become less, that is because when
the work load of every thread decreases, the overhead of the VI. R ELATED WORK
preparsing becomes more significant than before. Also the As mentioned earlier, parallel XML parsing can essentially
dynamic PXP obtains less speedup than the static PXP due be viewed as a particular application of the graph partition-
to the increasing communication cost. Furthermore, even the ing [6] and parallel graph search algorithms [7]. But the
speedup of the static PXP omitting the preparsing cost starts to document parsing and DOM building introduces some new
drop away from the theoretical limit. We speculate that shared issues, such as preparsing, namespace reference, and so on,
memory or cache conflicts are playing a role here. which are not addressed by those general parallel algorithms.
Unlike the static PXP, dynamic PXP is able to parse the There are a number of approaches trying to address the
XML documents with any tree shape. So to further study performance bottleneck of XML parsing. The typical software
the performance improvement of dynamic PXP, we modified solutions include the pull-based parsing [9], lazy parsing [10]
the previous XML document with big array structure to be and schema-specific parsing [11], [12], [13]. Pull-based XML
irregular tree shape, which consists of a five top-level elements parsing is driven by the user, and thus provides flexible
under the root, each with a randomly chosen number of performance by allowing the user to build only the parts of
children. Each of these children is an element from the array the data model that are actually needed by the application.
of the first test, and so the total number of these child elements Schema-specific parsing leverages XML schema information,
in the modified document is same as the one of the original by which the specific parser (automaton) is built to accelerate
document. the XML parsing. For the XML documents conforming to
We compare the dynamic PXP on this modified XML the schema, the schema-specific parsing will run very quickly,
document against the dynamic PXP on the original array XML whereas for other documents the extra penalty will be paid.
document. This comparison can show how the dynamic PXP Most closely related to our work in this paper is lazy parsing
scales for the XML documents with irregular shape or regular because it also need a skeleton-similar structure of the XML
shape. From the results shown in Figure 7 we can see there is document for the lazy evaluation. That is firstly a skeleton
little difference between two XML documents, which imply is built from the XML document to indicate the basic tree
that dynamic PXP (and our task partitioning of dividing the structure, thereafter based on the user’s access requirements,
remaining work in half) is able to effectively handle the large the corresponding piece of the XML document will be located
XML file with irregular shape. by looking up the skeleton and be fully parsed. However, the
These tests did not actually further parse the element purpose of the lazy parsing and parallel parsing are totally
different, so the structure and the use of the skeleton in the both [11] K. Chiu and W. Lu, “A compiler-based approach to schema-specific xml
algorithms differs fundamentally from each other. Hardware parsing,” in The First International Workshop on High Performance XML
Processing, 2004.
based solutions[14], [15] also are promising, particularly in [12] W. M. Lowe, M. L. Noga, and T. S. Gaul, “Foundations of fast
the industrial arena. But by our best knowledge, there is no communication via xml,” Ann. Softw. Eng., vol. 13, no. 1-4, 2002.
such work leveraging the data-parallelism model as PXP. [13] R. van Engelen, “Constructing finite state automata for high performance
xml web services,” in Proceedings of the International Symposium on
Web Services(ISWS), 2004.
VII. C ONCLUSION AND F UTURE W ORK [14] J. van Lunteren, J. Bostian, B. Carey, T. Engbersen, and C. Larsson,
“Xml accelerator engine,” in The First International Workshop on High
In this paper, we have described our approach to parallel Performance XML Processing, 2004.
XML parsing, and shown that it performs well for up to four [15] “Datapower,” http://www.datapower.com/.
cores. An efficient parallel XML parsing scheme needs an
effective data decomposition method, which implies a better APPENDIX
understanding of the tree structure of the XML document. Structure of the XML document ns att test.xml
Preparsing is designed to extract the minimal tree structure <xml xmlns:rs=’urn:schemas-microsoft-com:rowset’
(i.e., skeleton) from the XML document as quickly as possible. xmlns:z=’#RowsetSchema’
xmlns:tb0=’table0’ xmlns:tb1=’table1’
The key to the high performance of the preparsing is its xmlns:tb2=’table2’ xmlns:tb3=’table3’>
highly simplified syntax as well as the obviation of full well- <z:row tb1:PRODUCT=... tb0:CCIDATE=...
formedness constraints checking. Aided by the skeleton, the tb0:CLASS=... tb2:ADNUMBER=...
tb0:PRODUCTIONCATEGORYID_FK=...
algorithm can partition the XML document into chunks and tb3:ADVERTISERACCOUNT=...
parse them in parallel. Depending upon when the document tb1:YPOSITION=... tb2:CHEIGHT=...
is partitioned, we have the static PXP and dynamic PXP tb2:CWIDTH=... tb2:MHEIGHT=...
tb2:MWIDTH=... tb2:BHEIGHT=...
algorithms. The former is only for the XML documents tb2:BWIDTH=... tb3:SALESPERSONNUMBER=...
with array structures and can give the best case benefit of tb3:SALESPERSONNAME=...
parallelism, while the latter is appliable to any structures, tb1:PAGENAME=... tb1:PAGENUMBER=...
tb2:BOOKEDCOLOURINFO=... tb1:EDITION=...
but with some communication and synchronization cost. Our tb1:MOUNTINGCOMMENT=... tb1:TSNLSALESSYSTEM=...
experiments shows the preparsing is much faster than full tb1:TSNLCLASSID_FK=... tb1:TSNLSUBCLASS=...
XML parsing (either SAX or DOM), and based on it the tb1:TSNLACTUALDEPTH=... tb1:XPOSITION=...
tb0:TSNLCEESRECORDTYPEID_FK=...
parallel parsing algorithms can speedup the parsing and DOM tb0:PRODUCTZONE=... ROWID=.../>
building significantly and scales well. Since the preparsing <z:row ... />
becomes the bottleneck as the number of threads increase, our <z:row ... />
...
future work will investigate the feasibility of the parallelism </xml>
between the preparsing and real parsing. Also new approaches
for very large XML documents will be studied under the
shared memory model.
ACKNOWLEDGMENT
We would like to thank professor Randall Bramley for
his insightful suggestion and help on the graph partition and
Metis. We also thank for Zongde Liu and Srinath Perera for
the useful comment and discussion.
R EFERENCES
[1] D. Veillard, “Libxml2 project web page,” http://xmlsoft.org/, 2004.
[2] H. Sutter, “The free lunch is over: A fundamental turn toward concur-
rency in software,” Dr. Dobb’s Journal, vol. 30, 2005.
[3] W3C, “Xml information set (second edition),” http://www.w3.org/TR/
xml-infoset/, 2003.
[4] J. E. Hopcroft, R. Motwani, and J. D. Ullman, Introduction to Automata
Theory, Languages, and Computation. Addison Wesley, 2000.
[5] W3C, “Extensible Markup Language (XML) 1.0 (Third Edition),” http:
//www.w3.org/TR/2004/REC-xml-20040204/, 2004.
[6] G. Karypis and V. Kumar, “Parallel multilevel k-way partitioning scheme
for irregular graphs,” in Supercomputing, 1996.
[7] V. N. Rao and V. Kumar, “Parallel depth first search. part i. implemen-
tation,” Int. J. Parallel Program., vol. 16, no. 6, pp. 479–499, 1987.
[8] V. Kumar and V. N. Rao, “Parallel depth first search. part ii. analysis,”
Int. J. Parallel Program., vol. 16, no. 6, pp. 501–519, 1987.
[9] A. Slominski, “Xml pull paring,” http://http://www.xmlpull.org/, 2004.
[10] M. L. Noga, S. Schott, and W. Lowe, “Lazy xml processing,” in
DocEng ’02: Proceedings of the 2002 ACM symposium on Document
engineering, 2002.