0% found this document useful (0 votes)
45 views

A Parallel Approach To XML Parsing: Wei Lu, Kenneth Chiu, Yinfei Pan

1) The document proposes a parallel approach called PXP (Parallel XML Parsing) to improve the performance of XML parsing on multicore processors. 2) PXP uses a preparsing phase to generate a skeleton of the XML document's structure, which is then used to partition the document into logical chunks for parallel parsing by multiple threads. 3) An evaluation of PXP implemented using the libxml2 parser shows significant performance improvements over sequential parsing and good scalability to 4 cores.

Uploaded by

siddharthareddy
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

A Parallel Approach To XML Parsing: Wei Lu, Kenneth Chiu, Yinfei Pan

1) The document proposes a parallel approach called PXP (Parallel XML Parsing) to improve the performance of XML parsing on multicore processors. 2) PXP uses a preparsing phase to generate a skeleton of the XML document's structure, which is then used to partition the document into logical chunks for parallel parsing by multiple threads. 3) An evaluation of PXP implemented using the libxml2 parser shows significant performance improvements over sequential parsing and good scalability to 4 cores.

Uploaded by

siddharthareddy
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Parallel Approach to XML Parsing

Wei Lu #1 , Kenneth Chiu ∗2 , Yinfei Pan ∗3


#
Computer Science Department, Indiana University
150 S. Woodlawn Ave. Bloomington, IN 47405, US
1
[email protected]

Department of Computer Science, State University of New York -Binghamton
P.O. Box 6000, Binghamton, NY 13902, US
2 3
[email protected] [email protected]

Abstract— A language for semi-structured documents, XML software pipelining is often hard to implement well, due to
has emerged as the core of the web services architecture, synchronization, load-balance and memory access costs.
and is playing crucial roles in messaging systems, databases, More promising is a data-parallel approach. Here, the XML
and document processing. However, the processing of XML
documents has a reputation for poor performance, and a number document would be divided into some number of chunks, and
of optimizations have been developed to address this performance each thread would work on the chunks independently. As the
problem from different perspectives, none of which have been chunks are parsed, the results are merged.
entirely satisfactory. In this paper, we present a seemingly To divide the XML document into chunks, we could simply
quixotic, but novel approach: parallel XML parsing. Parallel treat it as a sequence of characters, and then divide the
XML parsing leverages the growing prevalence of multicore
architectures in all sectors of the computer market, and yields document into equal-sized chunks, assigning one chunk to
significant performance improvements. This paper presents our each thread. This requires that each thread begin parsing from
design and implementation of parallel XML parsing. Our design an arbitrary point in the XML document, however, which is
consists of an initial preparsing phase to determine the structure problematic. Since an XML document is the serialization of a
of the XML document, followed by a full, parallel parse. The tree-structured data model (called XML Infoset [3]) traversed
results of the preparsing phase are used to help partition
the XML document for data parallel processing. Our parallel in left-to-right, depth-first order, such a division will create
parsing phase is a modification of the libxml2 [1] XML parser, chunks corresponding to arbitrary parts of the tree, and thus
which shows that our approach applies to real-world, production the parsing results will be difficult to merge back into a
quality parsers. Our empirical study shows our parallel XML single tree. Correctly reconstructing namespace scopes and
parsing algorithm can improved the XML parsing performance references will also be challenging. Furthermore, most chunks
significantly and scales well.
will begin in the middle of some string whose grammatical
I. I NTRODUCTION role is unknown. It could be a tag name, an attribute name, an
attribute value, element content, etc. This could be resolved
XML’s emergence as the de facto standard for encoding by extensive backtracking and communication, but that would
tree-oriented, semi-structured data has brought significant in- incur overhead that may negate the advantages of parallel
teroperability and standardization benefits to grid computing. parsing. Apparently, instead of the equal-sized physical de-
Performance, however, is still a lingering concern for some composition, the ability of decomposing the XML document
applications of XML. A number of approaches have been used based on its logical structure is the key toward the efficient
to address these performance concerns, ranging from binary parallel XML parsing.
XML to schema-specific parsing to hardware acceleration. The results of parsing XML can vary from a DOM-style,
As manufacturers have encountered difficulties to further data structure representing the XML document, to a sequence
exponential increases in clock speeds, they are increasingly of events manifest as callbacks, as in SAX-style parsing.
utilizing the march of Moore’s law to provide multiple cores Our parallel approach in this paper focuses on DOM-style
on a single chip. Tomorrow’s computers will have more cores parsing, where a tree data structure is created in memory
rather than exponentially faster clock speeds, and software will that represents the document. Our targeted application area is
increasingly have to rely on parallelism to take advantage of scientific computing, but we believe our approach is broadly
this trend [2]. applicable. Our implementation is based on the production
In this paper, we investigate the seemingly quixotic idea of quality libxml2 [1] parser, which shows that our work applies
parsing XML in parallel on a shared memory computer, and to real-world parsers, not just research implementations.
develop an approach that scales reasonably well to four cores. Current programming models for multicore architectures
Concurrency could be used in a number of ways to improve provide access to multiple cores via threads. Thus, in the rest
XML parsing performance. One approach would be to use of the paper, we use the term thread rather than core. To avoid
pipelining. In this approach, XML parsing could be divided scheduling issues that are outside the scope of this paper, we
into a number of stages. Each stage would be executed by assume that each thread is executing on a separate core.
a different thread. This approach may provide speedup, but The rest of the paper is organized as follows. Section II
... root

XML Parsing <root xmlns="www.indiana.edu"> xmlns foo bar


<foo id="0">hello</foo>
chunk
XML <bar>
document skeleton <!−− comment −−>
chunk XML Parsing <?name pidata ?> id hello comment pidata a
Preparsing PXP Parallel threads
<a>world</a>
</bar>
chunk
XML Parsing </root> world

Fig. 1. The PXP architecture first uses a preparser to generate a skeleton


of the XML document. This is then used to guide the partitioning of the <root xmlns="www.indiana.edu"> root
document into chunks, which are then parsed in parallel. <foo id="0">hello</foo>
<bar> xmlns foo bar
<!−− comment −−>
<?name pidata ?>
<a>world</a> id hello comment pidata a
describe the general architecture of our approach, PXP. Then </bar>
</root>
in the section III and IV we present the algorithm design and
world
implementation details. We present in Section V performance
results. Related work is discussed in Section VI.
Fig. 2. The top diagram shows the XML Infoset model of a simple XML
II. PXP document. The bottom diagram shows the skeleton of the same document.

Any kind of parsing is based on some kind of machine


abstraction. The problems of an arbitrary division scheme arise
from a lack of information about the state of the parsing III. P REPARSING
machine at the beginning of each chunk. Without this state,
The goal of preparsing is to determine the tree structure
the machine does not know how to start parsing the chunk.
of the XML document so that it can be used to guide the
Unfortunately, the full state of the parser after the N th
data-parallel, full parsing.
character cannot be provided without first considering each
of the preceding N − 1 characters.
This thus leads us to the PXP (Parallel XML Parsing) A. Skeleton
approach presented in this paper. We first use an initial pass Conceptually the XML Infoset represents the tree structure
to determine the logical tree structure of an XML document. of the XML document. However since only internal nodes (i.e.,
This structure is then used to divide the XML document such the element item) determine the topology of the tree, which is
the divisions between the chunks occur at well-defined points meaningful for XML data decomposition, those leaf nodes in
in the XML grammar. This provides enough context so that the XML Infoset, such as attribute information items, comment
each chunk can be parsed starting from an unambiguous state. information items, and even character information items, can
This seems counterproductive at first glance, since the be ignored by the skeleton. Further the element tag names
primary purpose of XML parsing is to build a tree-structured are also ignored by the skeleton since they don’t affect the
data model (i.e, XML Infoset) from the XML document. topology of the tree at all. So as shown in the Figure 2, the
However the tree structure needed to guide the parallel parsing skeleton essentially is a tree of unnamed nodes, isomorphic to
can be significantly smaller and simpler than that ultimately the original XML document, and constructed from all start-
generated by a normal XML parser, and does not need to tag/end-tag pairs. To facilitate the XML data decomposition,
include all the information in the XML Infoset data model. We Our skeleton records the location of the start tag and end tag of
call this simple tree structure, specifically designed for XML each element, the parent-child relationships, and the number
data decomposition, the skeleton of the XML document. of children of every element.
To distinguish from the actual XML parsing, the procedure
to parse and generate the skeleton from the XML document
B. Implementation
is called preparsing. Once the preparsing is complete and we
know the logical tree structure of the XML document, we Well-formed XML is not a regular language [4], and it can-
are able to divide the document into balanced chunks and not be parsed by a finite-state automaton, but rather requires at
then launch multiple threads to parse the chunks in parallel. least a push-down automaton. So even determining the funda-
Consequently, this parallelism can significantly improve per- mental structure of the XML document, just for preparsing,
formance. Our overall architecture is shown in Figure 1. requires executing a push-down automaton. However since
For simplicity and performance, PXP currently maps the preparsing is an additional processing step for parallel parsing,
entire document into memory with the mmap() system call. it is an additional overhead not normally incurred during XML
Nothing precludes our general approach from working on parsing. Furthermore, since it is sequential, it fundamentally
streamed documents, or documents too large to fit into mem- limits the parallel parsing performance. Hence, a fundamental
ory, but the design and implementation would be significantly premise of our work is that preparsing can build the skeleton
more complex. at minimal cost.
>
According to the XML specification [5] a non-validating 1
XML parser must determine whether or not a XML document
< /
is well-formed. A XML document is considered well-formed Content lt EndTag
if it satisfies both requirements below:
? or ! PI, Comment
1) It conforms to the syntax production rules defined in the
XML specification. CDATA
2) It meets all the well-formedness constraints given in the > " or ’
StartTag AttVal
specification.
However, since preparsing will be followed by a full-fledged " or ’
/
XML parsing stage, the preparsing itself can ignore many er-
rors. That is, for a well-formed XML document, the preparser
>
must generate the correct result, but for a ill-formed XML EmptyTag
document, the preparser does not need to detect any errors.
Thus, our preparser only detects weak conformance to the Fig. 3. This automaton accepts the syntax needed by preparsing. (To
emphasize the major states, we omit the states for the PI, Comment, and
XML specification, and hence is simpler to implement and CDATA productions by enclosing them in the dashed line box.)
optimize.
As the skeleton only contains the location of the element
nodes in the XML document, preparsing only needs to con-
sider the element tag pairs, and can ignore other syntactic units In addition to the simplified syntax preparsing also benefits
and production rules for such as comments, character data, and from omitting other well-formedness constraints. Usually in
attributes. Consequently, the preparsing has a much simpler set order to check the well-formedness constraints, a general
of production rules compared to standard XML. For example XML parser will perform a number of additional comparisons,
the production rule of the start tag in XML 1.0 is defined as: transformations, sorting, and buffering, all of which can re-
sult in significant performance bottlenecks. For instance, the
STag ::= ’<’ Name (S Attribute)* S? ’>’ fundamental well-formedness constraint is that the name in
Attribute ::= Name Eq AttValue
Name ::= (Letter | ’_’ | ’:’) (NameChar)* the end-tag of an element must match the name in the start-
AttValue ::= ’"’ ([ˆ<&"] | Reference)* ’"’ tag. To check this constraint, the general XML parser might
| "’" ([ˆ<&’] | Reference)* "’" push the start tag name onto a stack whenever a start tag is
Because preparsing can ignore Attribute and encountered, and pop the stack to match the name of the end
AttValue, and even the entire Name production rule, tag. The preparser, however, treats the XML document as a
the syntax could seemingly be simplified to just: sequence of unnamed open and close tag pairs. Therefore, it
can merely increment the top pointer of the stack for any start
STag ::= ’<’ ([ˆ>])* ’>’
tag, and decrement for any end tag. Finally, if the top pointer
However the above simplified production rule is incorrect points to the bottom of the stack, the preparser considers
due to ambiguity, because AttValue allows the > character the XML document to be correct without an expensive string
by its production rule, which, if it appears, will cause the comparison.
preparser to misidentify the location of the actual right angle Another well-formedness constraint example is that the
bracket of the tag. Therefore, the correct rules are: attribute name must not appear more than once in the same
STag ::= ’<’([ˆ’"])* AttValue* ’>’ start-tag. To verify that, a full XML parser must perform an
AttValue ::= ’"’ ([ˆ’"])* ’"’ | "’" ([ˆ’"])* "’" expensive uniqueness test, which is not required for prepars-
With same concern of possible ambiguity, the PI, ing.
Comment, and CDATA elements should be preserved in the Finally, preparsing obviously does not need to resolve
preparsing production rules set because they are allowed to namespace prefixes, since it completely ignores the tag name.
contained any string, including the < character, which would However, a full XML parser supporting namespaces, requires
otherwise cause the preparser to misidentify the location of expensive lookup operations to resolve namespace prefixes.
the end tag. The rest of production rules of standard XML are The only constraint the preparsing requires is that the open
ignored by the preparsing. tag has to be paired with a close tag. A simple stack is adopted
The simplified preparsing syntax results in a much simpler for this checking, and the skeleton nodes are generated as the
parsing automaton (Figure 3), which only requires six major result of pushing and popping of the stack.
states, than the one needed by complete XML parsing. Pre- Another important source of performance advantages of
dictably, the preparsing automaton runs much faster than the preparsing compared to full parsing is that the skeleton is
general XML parsing automaton. much lighter-weight than the DOM structure. Thus, preparsing
1 DTD and validating XML parsing are not supported by our current system
is able to generate the skeleton substantially faster than full
for simplicity. Also DTD is being replaced by the XML Schema validation, XML parsing is able to generate the DOM. When compared
which is usually a separate process after the XML parsing. to SAX, the preparser benefits from avoiding callbacks.
IV. PARALLEL PARSING That is because a linear array can easily be divided into equal-
sized ranges (i.e., subgraphs) without an expensive graph-
During the parallel parsing phase, we use the structural partitioning step. The division is based on the left to right
information in the skeleton to divide the document into chunks, order, so every range is contiguous in the XML document.
each of which contains a forest of subtrees of the XML We have developed the static PXP algorithm, a simple
document. Each chunk is parsed by a thread. For any data static partitioning and parallel parsing algorithm capable of
parallel technique to be effective, load-balancing must be used parsing XML document with array structures. This serves
to prevent idle threads. Ideally, we could divide the document to provide a baseline against which we can compare more
into chunks such that there is one chunk for each thread realistic techniques. Conveniently, we are able to leverage a
and such that each chunk takes exactly the same amount function from libxml2 [1], which is a widely-used and efficient
of time to parse. Depend on when and how the partitioning XML parsing library written in C, to perform the parsing.
is performed, we have two strategies: static partitioning and
xmlParseInNodeContext(xmlNodePtr node,
dynamic partitioning. const char * data,
int datalen,
A. Static Partitioning int options,
xmlNodePtr * lst)
Naturally, we can statically partition a tree into several
This function can parse a “well-balanced chunk” of an XML
equally-sized subparts by using a graph partitioning tool (e.g.,
document within the context (DTD, namespaces, etc.) of the
Metis [6]), which can divide the graph/tree into N equally-
given node. A well-balanced chunk is defined as any valid
sized parts. The advantage of static partitioning is it can
content allowed by the XML grammar. Since the regions
generate a very well-balanced load for every thread, thus
generated by our static partitioning are well-balanced, we can
leading to good parallelism.
use the above function to parse each chunk. Obviously any
However since the static partitioning occurs before the element range generated by static array partitioning is a well-
actual XML parsing, it knows little about the parsing context balanced chunk. Then the static PXP algorithm consists of the
(e.g., namespace declarations). In other words, cuts made by following steps:
the static partitioning will create following problems:
1) Construct a faked XML document in memory containing
1) The characters of the XML document corresponding to just an empty root element. by copying the open/close
the subgraph may no longer be contiguous. Metis will tag pair of the root element from the original XML
create connected subgraphs, but a connected subgraph of document, Since we assume that the size of the root
the logical tree structure does not necessarily correspond element is much smaller than the whole document, the
to a contiguous sequence of characters in the XML cost of any memory operations used by this step are
document. In order to parse the resulting characters, acceptable.
we must either reconstruct a contiguous sequence by 2) Call the libxml2 function xmlParseMemory() to
memory copying, or modify the XML parser to han- parse the faked XML document, thus obtaining the
dle non-contiguous character sequences, which may be root XML node. This node contains the namespace
challenging. declarations required by its children, and will be treated
2) The namespace scope may be split between subgraphs, as the context for the following parse of the ranges of
which means a namespace prefix may be used in one the array.
subgraph, but defined in another. These inter-chunk 3) The number of elements in each chunk is calculated
references will create strong memory and synchroniza- by simply dividing the total number of elements in the
tion dependencies between threads, which will degrade array, which was calculated during the preparsing stage,
performance. by the number of available threads, so that every thread
The static partitioning strategy also suffers because the has a balanced work load. The start positions and data
static partitioning algorithm must be executed sequentially length of the chunk can be inferred from the location
before the parallel parsing, thus the performance gained by information of its first element and the last element.
the parallelism will very easily be offset by the cost of the 4) Create a thread to parse each chunk in parallel.
static partitioning algorithm, which usually is not trivial. Each thread invokes xmlParseInNodeContext()
However for XML documents representing an array struc- to parse and build the DOM structure.
ture, such as 5) Finally the parsed results of each thread will be spliced
back under the root node.
<data>
<item>....</item> In summary, the static partitioning strategy is not really
... practical for XML documents with irregular tree structures,
<item>....</item> due to strong dependencies between the different processing
</data>
steps. However for those XML documents containing an array,
which are responsible for the bulk of most large XML docu- it provides an upper bound on the performance gain of parallel
ments, static partitioning is able to provide the best parallelism. parsing, and is useful for evaluation of other parallel parsing
scheme is referred to as a donator initiated subtask distribution
scheme. For parallel XML parsing, we desire that the parsing
... ... ... thread will parse as much XML data as possible without any
current interruption, unless other threads are idle and asking for tasks,
node first half second half so as to achieve a better performance. Also, any thread can be
the donator or the requester. We adopt the requester initiated
Parsing Parsing subtask distribution as the partition strategy in the PXP.
Task Task To implement parallel parsing with dynamic partitioning,
Fig. 4. The left diagram illustrates the general node splitting strategy. Each we again use libxml2. Since dynamic partitioning requires
node becomes a subtask. The right diagram illustrates the split-in-half strategy. the parser do the task partitioning and subtask generation
The nodes of the current parsing task are split in half, with the first half given
to the requesting task, while the current task finishes the second half. during the parsing, however, we cannot simply apply the
libxml2 xmlParseInNodeContext() function as in the
static partitioning scheme. Instead, we need to change the
approaches as the guideline. xmlParseInNodeContext()2 source code to integrate
the dynamic partitioning and generation logic into the original
B. Dynamic Partitioning parsing code. The modified algorithm is called dynamic PXP
In contrast with static partitioning, the dynamic partitioning and its basic steps are:
strategy partitions the XML document and generates the 1) Create multiple threads, and assign the root node of
subtasks during the actual XML parsing. After the preparser skeleton as the initial parsing task to the first thread.
generates the skeleton, the tree structure is traversed in parallel Other threads are idle.
to complete the parsing. Whenever a node is visited by a thread 2) When a thread is idle, it posts its request on an request
its corresponding serialization (start tag) will be parsed and the queue, and waits for the request be filled by some
related DOM node will be built. donator thread.
The parallel tree traversal is equivalent to a complete, 3) Every thread, once it begins parsing, parses normally as
parallel depth-first search (DFS) (in which the desired node is libxml2 does, except when an open tag is being parsed.
not found), which partitions the tree dynamically and searches At that time, it checks the request queue for threads that
for a specific goal in parallel using multiple threads. need work. If such a requester thread exists, the thread
After Rao [7], dynamic partitioning consists of two phases: splits the current workload (i.e., the unparsed sibling
nodes) into two regions. The first half is donated to
• Task partitioning
the requester thread, and the thread resumes parsing at
• Subtask distribution
the beginning location of the second half region. Since
Task partitioning refers to how a thread splits its current task every skeleton node records the number of its children
into subtasks when another thread needs work. A common elements, as well as its location information, it is easy
strategy is node splitting [8], in which each of the n nodes to figure out the begin location and data length of the
spawned by a node in a tree are themselves given away as subtask. Also to avoid excessively small tasks, the user
individual subtasks. However for parallel XML parsing, node can set a threshold to prevent task partitioning if the
splitting may generate too many small tasks since most of remaining work is less than the threshold.
nodes represents a single leaf element in the XML document, 4) Once the requester thread obtains the parsing task, it be-
thus increasing the communications cost. gins the parsing at the beginning location of the donated
Since XML is a depth-first, left-to-right serialization of a subtask. Due to the dynamic nature, the donator is able
tree, a sequence of sibling element nodes in the skeleton to pass its current parsing context (e.g., the namespace
corresponds to a contiguous chunk of the XML document. declarations) to the requester as the requester’s initial
Therefore, if each parsing task covers a sequence of sibling parsing context, which will in turn makes a clone of
element nodes, this will maximize the size of each workload, the parsing context for itself before parsing to avoid
with little communication cost. In dynamic partitioning, we the synchronization cost. Also the donator will create
adopt a simple but effective policy of splitting the workload a dummy node as the “placeholder” for the parsing
in half, as shown in Figure 4. That is, the running thread task, the subtrees generated by the requester will be
splits the unparsed siblings of the current element node into inserted under the placeholder and once the parsing task
two halves in the left-to-right order, whenever the partitioning is completed, the placeholder will be spliced within the
is requested. entire DOM tree.
Subtask distribution refers to how and when subtasks are 5) This process continues until all threads are idle.
distributed for the donator thread to the requester thread.
In summary, dynamic partitioning load-balances during the
If work splitting is performed only when an idle processor
parsing, and it can be applied to any irregular tree structure
requests for work, it is called requester initiated subtask
distribution. In contrast if the generation of subtasks is in- 2 In fact the actual modified function is xmlParseContent(), which is
dependent of the work requests from idle processors the invoked by xmlParseInNodeContext() to parse the XML content.
2.5
without the need of the extra partitioning algorithm. However,
Libxml2 DOM
the dynamic nature incurs a synchronization and communica-
Libxml2 SAX with empty handler
tion cost among the threads, which is not needed by the static Preparsing
2
partitioning scheme.
V. M EASUREMENT
1.5
We first performed experiments to measure the performance

Second
of the preparsing, and then performed experiments to measure
the performance improvement and the scalability of the paral-
1
lel XML parsing (static and dynamic partition) algorithm over
the different XML documents. The experiments are running
on a Linux 2.6.9 machine which has two 2 dual-core AMD
0.5
Opteron processors and 4GB of RAM. Every test is run five
times to get the average time and the measurement of the first
time is discarded, so as to measure performance with the file
data already cached, rather than being read from disk. The 0
0 5 10 15 20 25 30 35 40 45
programs are compiled by g++ 3.4.5 with the option -O3. and Size (MB)

the libxml2 library we are using is 2.6.16.


Fig. 5. Performance comparison of preparsing.
During our initial experiments, we noticed poor speedup
during a number of tests that should have performed well.
We attributed this to lock contention in malloc(). To avoid
According to Figure 5, we see that preparsing is nearly
this, we wrote a simple, thread-optimized allocator around
12 times faster than sequential parsing with libxml2 to build
malloc(). This allocator maintains a separate pool of mem-
DOM. Even for libxml2 SAX parsing, preparsing is over 6
ory for each thread. Thus, as long as the allocation request
times faster. Even though the preparser builds a tree, the tree
can be satisfied from this pool, no locks need to be acquired.
is simple and does not require expensive memory management.
To fill the pool initially, we simply run the test once, then free
These results show that even the preparsing does not occupy
all memory, returning it to each pool.
much time, and the time left for actual parallel parsing is
Our allocator is intended simply to avoid lock contention.
enough to result in significant speedup.
A production allocator would use other techniques to reduce
lock contention. One possibility is to simply use a two-stage
B. Parallel XML Parsing Performance Measurement
technique, where large chunks of memory are obtained from
a global pool, and then managed individually for each thread Speedup measures how well a parallel algorithm scales, and
in a thread-local pool. is important for evaluating the efficiency of parallel algorithms.
It is calculated by dividing the sequential time by the parallel
A. Preparsing Performance Measurement time. For our experiments, the sequential time refers to the
Preparsing generates the skeleton which is necessary for time needed by libxml2 xmlParseInNodeContext() to
PXP. However, this is an additional step compared to normal parse the whole XML document. To be consistent, static PXP,
XML parsing, which, unfortunately, also needs to be per- dynamic PXP, and the sequential program are all configured to
formed sequentially before the actual parallel parsing. Thus, use the thread-optimized memory allocator. Each program is
to help determine whether or not this cost is acceptable, run five times and the timing result of the first time is discarded
and understand the overall PXP performance, we measured to warm the cache.
preparsing time and also compared it to full libxml2 parsing. We first measure the upper bound of the speedup that the
Since preparsing linearly traverses the XML document with- PXP algorithms could achieve. To do that, we select a big
out backtracking or other bookkeeping, the time complexity XML document used in the previous preparsing experiment as
is linear in the size of the document, and independent of the parsing test document. The array in the XML document
the structural complexity of the document. We thus designed has around 50,000 elements and every element includes up to
the preparsing test to maximize the performance of a full 28 attributes and the size of the file is 35 MB. Since the test
sequential parser, and used a simple array of elements which document just contains a simple array structure, we are able
varied in size. The test document is shown in the Appendix. to apply both static PXP and dynamic PXP algorithms on it.
First, we varied elements in the array to increase document Figure 6 shows how the static/dynamic PXP algorithms scales
size. Then for the comparison, we measured the costs of two with the number of threads when parsing this test document.
widely-used parsing methods: building DOM with libxml2, The diagonal dashed line shows the theoretical ideal speedup.
and parsing with the SAX implementation in libxml2. In From the graph we can see that when the threads number is
addition, for the libxml2 SAX implementation, we used empty one or two the speedups of the PXP is sublinear, but if we
callback routines. Thus, libxml2 SAX is expected to be subtract the preparsing time from the total time the speedups
extremely fast. The results are shown in Figure 5. of static PXP is close to linear. This indicates the preparsing
4 4
static pxp w/o including preparsing non−array w/o including preparsing
3.5 dynamic pxp w/o including preparsing 3.5 array w/o including preparsing
static pxp non−array
dynamic pxp array
3 3
linear

2.5 2.5
Speedup

Speedup
2 2

1.5 1.5

1 1

0.5 0.5

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
Threads Threads

Fig. 6. This graph shows the upper bound of the speedup of the PXP Fig. 7. This graph shows the speedup of the dynamic PXP for up to four
algorithms for up to four threads, when used to parse a big XML document threads, when used to parse two same-size XML documents, one with irregular
which only contains an array structure. tree shape and one with regular array shape.

dominates the overhead, and the static PXP presents the upper contents. In a typed parsing scenario, where schema or other
bound of the parallel performance. information can be used to interpret the element content, we
would obtain even better scalability. For example, if we are
The speedups of dynamic PXP are slightly lower than the
parsing a large array of doubles including the ASCII-to-double
ones of the static PXP, which indicates the cost of communica-
conversion, each thread has an increased workload relative to
tion and synchronization starts to be a factor, but is relatively
the preparsing stage and other overheads, and thus speedup
minor. When the threads number is increased the speedup of
would be improved.
the PXP (dynamic or static) become less, that is because when
the work load of every thread decreases, the overhead of the VI. R ELATED WORK
preparsing becomes more significant than before. Also the As mentioned earlier, parallel XML parsing can essentially
dynamic PXP obtains less speedup than the static PXP due be viewed as a particular application of the graph partition-
to the increasing communication cost. Furthermore, even the ing [6] and parallel graph search algorithms [7]. But the
speedup of the static PXP omitting the preparsing cost starts to document parsing and DOM building introduces some new
drop away from the theoretical limit. We speculate that shared issues, such as preparsing, namespace reference, and so on,
memory or cache conflicts are playing a role here. which are not addressed by those general parallel algorithms.
Unlike the static PXP, dynamic PXP is able to parse the There are a number of approaches trying to address the
XML documents with any tree shape. So to further study performance bottleneck of XML parsing. The typical software
the performance improvement of dynamic PXP, we modified solutions include the pull-based parsing [9], lazy parsing [10]
the previous XML document with big array structure to be and schema-specific parsing [11], [12], [13]. Pull-based XML
irregular tree shape, which consists of a five top-level elements parsing is driven by the user, and thus provides flexible
under the root, each with a randomly chosen number of performance by allowing the user to build only the parts of
children. Each of these children is an element from the array the data model that are actually needed by the application.
of the first test, and so the total number of these child elements Schema-specific parsing leverages XML schema information,
in the modified document is same as the one of the original by which the specific parser (automaton) is built to accelerate
document. the XML parsing. For the XML documents conforming to
We compare the dynamic PXP on this modified XML the schema, the schema-specific parsing will run very quickly,
document against the dynamic PXP on the original array XML whereas for other documents the extra penalty will be paid.
document. This comparison can show how the dynamic PXP Most closely related to our work in this paper is lazy parsing
scales for the XML documents with irregular shape or regular because it also need a skeleton-similar structure of the XML
shape. From the results shown in Figure 7 we can see there is document for the lazy evaluation. That is firstly a skeleton
little difference between two XML documents, which imply is built from the XML document to indicate the basic tree
that dynamic PXP (and our task partitioning of dividing the structure, thereafter based on the user’s access requirements,
remaining work in half) is able to effectively handle the large the corresponding piece of the XML document will be located
XML file with irregular shape. by looking up the skeleton and be fully parsed. However, the
These tests did not actually further parse the element purpose of the lazy parsing and parallel parsing are totally
different, so the structure and the use of the skeleton in the both [11] K. Chiu and W. Lu, “A compiler-based approach to schema-specific xml
algorithms differs fundamentally from each other. Hardware parsing,” in The First International Workshop on High Performance XML
Processing, 2004.
based solutions[14], [15] also are promising, particularly in [12] W. M. Lowe, M. L. Noga, and T. S. Gaul, “Foundations of fast
the industrial arena. But by our best knowledge, there is no communication via xml,” Ann. Softw. Eng., vol. 13, no. 1-4, 2002.
such work leveraging the data-parallelism model as PXP. [13] R. van Engelen, “Constructing finite state automata for high performance
xml web services,” in Proceedings of the International Symposium on
Web Services(ISWS), 2004.
VII. C ONCLUSION AND F UTURE W ORK [14] J. van Lunteren, J. Bostian, B. Carey, T. Engbersen, and C. Larsson,
“Xml accelerator engine,” in The First International Workshop on High
In this paper, we have described our approach to parallel Performance XML Processing, 2004.
XML parsing, and shown that it performs well for up to four [15] “Datapower,” http://www.datapower.com/.
cores. An efficient parallel XML parsing scheme needs an
effective data decomposition method, which implies a better APPENDIX
understanding of the tree structure of the XML document. Structure of the XML document ns att test.xml
Preparsing is designed to extract the minimal tree structure <xml xmlns:rs=’urn:schemas-microsoft-com:rowset’
(i.e., skeleton) from the XML document as quickly as possible. xmlns:z=’#RowsetSchema’
xmlns:tb0=’table0’ xmlns:tb1=’table1’
The key to the high performance of the preparsing is its xmlns:tb2=’table2’ xmlns:tb3=’table3’>
highly simplified syntax as well as the obviation of full well- <z:row tb1:PRODUCT=... tb0:CCIDATE=...
formedness constraints checking. Aided by the skeleton, the tb0:CLASS=... tb2:ADNUMBER=...
tb0:PRODUCTIONCATEGORYID_FK=...
algorithm can partition the XML document into chunks and tb3:ADVERTISERACCOUNT=...
parse them in parallel. Depending upon when the document tb1:YPOSITION=... tb2:CHEIGHT=...
is partitioned, we have the static PXP and dynamic PXP tb2:CWIDTH=... tb2:MHEIGHT=...
tb2:MWIDTH=... tb2:BHEIGHT=...
algorithms. The former is only for the XML documents tb2:BWIDTH=... tb3:SALESPERSONNUMBER=...
with array structures and can give the best case benefit of tb3:SALESPERSONNAME=...
parallelism, while the latter is appliable to any structures, tb1:PAGENAME=... tb1:PAGENUMBER=...
tb2:BOOKEDCOLOURINFO=... tb1:EDITION=...
but with some communication and synchronization cost. Our tb1:MOUNTINGCOMMENT=... tb1:TSNLSALESSYSTEM=...
experiments shows the preparsing is much faster than full tb1:TSNLCLASSID_FK=... tb1:TSNLSUBCLASS=...
XML parsing (either SAX or DOM), and based on it the tb1:TSNLACTUALDEPTH=... tb1:XPOSITION=...
tb0:TSNLCEESRECORDTYPEID_FK=...
parallel parsing algorithms can speedup the parsing and DOM tb0:PRODUCTZONE=... ROWID=.../>
building significantly and scales well. Since the preparsing <z:row ... />
becomes the bottleneck as the number of threads increase, our <z:row ... />
...
future work will investigate the feasibility of the parallelism </xml>
between the preparsing and real parsing. Also new approaches
for very large XML documents will be studied under the
shared memory model.

ACKNOWLEDGMENT
We would like to thank professor Randall Bramley for
his insightful suggestion and help on the graph partition and
Metis. We also thank for Zongde Liu and Srinath Perera for
the useful comment and discussion.

R EFERENCES
[1] D. Veillard, “Libxml2 project web page,” http://xmlsoft.org/, 2004.
[2] H. Sutter, “The free lunch is over: A fundamental turn toward concur-
rency in software,” Dr. Dobb’s Journal, vol. 30, 2005.
[3] W3C, “Xml information set (second edition),” http://www.w3.org/TR/
xml-infoset/, 2003.
[4] J. E. Hopcroft, R. Motwani, and J. D. Ullman, Introduction to Automata
Theory, Languages, and Computation. Addison Wesley, 2000.
[5] W3C, “Extensible Markup Language (XML) 1.0 (Third Edition),” http:
//www.w3.org/TR/2004/REC-xml-20040204/, 2004.
[6] G. Karypis and V. Kumar, “Parallel multilevel k-way partitioning scheme
for irregular graphs,” in Supercomputing, 1996.
[7] V. N. Rao and V. Kumar, “Parallel depth first search. part i. implemen-
tation,” Int. J. Parallel Program., vol. 16, no. 6, pp. 479–499, 1987.
[8] V. Kumar and V. N. Rao, “Parallel depth first search. part ii. analysis,”
Int. J. Parallel Program., vol. 16, no. 6, pp. 501–519, 1987.
[9] A. Slominski, “Xml pull paring,” http://http://www.xmlpull.org/, 2004.
[10] M. L. Noga, S. Schott, and W. Lowe, “Lazy xml processing,” in
DocEng ’02: Proceedings of the 2002 ACM symposium on Document
engineering, 2002.

You might also like