Distributed and Scalable XML Document Processing Architecture For E-Commerce Systems
Distributed and Scalable XML Document Processing Architecture For E-Commerce Systems
David Cheung, S.D. Lee, Thomas Lee, William Song, C.J. Tan
E-Business Technology Institute,
The University of Hong Kong,
Hong Kong
{dcheung, sdlee, ytlee, wsong, ctan}@eti.hku.hk
0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.
3 System Architecture add this capability to the DI. This is accomplished by
including XSLT (see below) in the DI.
The architecture of our Generic XML Document
3.1.1 XML Transformation (XSLT). XML Stylesheet
Processing System is depicted in the following diagram.
Language Transformation [2] specifies a stylesheet language,
high-level Application
application
XML
Document Document Integrator
Processing
System
0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.
3.2.1. Database Access TM (DATM). DATM is an via external programs or external network connects, using
instance of TM, which provides access to ordinary external (non-XML) formats.
relational databases. The input XML document contains
information specifying which tables and fields of which 3.2.3. EDI Gateway TM (EDIGTM). Many corporations are
database on which database server are to be accessed. In using EDI to efficiently exchange messages. An EDIGTM
case of database inserts and updates, the input XML also accepts input XML documents as invoice, purchase order, or
contains the data to be used for these operations. The any kind of EDI message. It then converts the document
DATM inserts new records or updates existing records in contents into EDI format and sends it out using EDI channels.
the database appropriately. The DATM may returns an This acts as a gateway for outgoing messages between our
XML document to the DI to indicate whether the XML Document Processing System and EDI systems.
operator was successful, and possibly the cause of error
in case of failure. In case of database query, the query is 3.2.4. Logging TM (LogTM). Activities can be logged by
specified in the input XML document. The DATM implementing a TM, which writes any submitted XML file
queries the database server, and returns the query result onto the file system. Then, script files can be modified to
to the DI, after converting it to some XML format. select suitable contents from the input or intermediate XML
documents and send them to a LogTM.
It is up to the particular implementation of DATM to
design the formats of the input and returned XML 3.3 Document flow
documents. For example, it is possible to directly
include SQL statements in the input to specify the The actual means by which the DI processes an input XML
database operation. Query results can be formatted into is driven by the script files written by the application
XML according to hard-coded logic in the DATM, or programmer. The processing logic is not hardcoded. The DI
according to the specification of DATM-specific script only provides the capabilities (with the help of TM’s) for the
files. application programmer to manipulate the input XML file and
any intermediately generated XML files.
3.2.2. Message Generating TM (MGTM). An MGTM
interprets the input XML document as a message. The following diagram illustrates an example on how the DI
According to the message headers (or other appropriate processes the XML files.
logic), it sends the message out. It returns an XML
Application
1
10
2 5
Document Integrator
4 9 6
(3)
(8)
document to the DI to indicate whether the message was In this example, the DI first receives XML message 1 from
sent successfully. An MGTM may send out a message the application program. This message contains two data
via various media, such as e-mail, fax, Usenet items (shown in the diagram as different solid shapes). DI
newsgroup posting, pager message, a print job or even a consults its scripts and decides that it should first send the first
mobile phone short message. Since the outgoing data item to TM1, encapsulated in message 2. TM1 receives
message must conform to the message format of the the message, and interacts (message (3)) with external
desired medium, an MGTM must be able to convert the program A. Then, it returns message 4 to DI, containing data
incoming XML message to the format of the target in a new XML document. (As a concrete example, imaging
medium before sending it out. The method of sending XML message 2 to be containing an SQL query, TM1 be a
out the message is also dependent on the medium type. DATM, interactions (3) be appropriate relational database
Most probably, an MGTM has to send out the message operations and message 4 be the query results encapsulated as
0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.
an XML document.) While TM1 is processing message ‘target’ in a Makefile) accordingly. The scripting language
2, DI may concurrently send document 5 (which contains is illustrated below. It should be noted that the language is an
the second data item from message 1) to TM3 for application of XML. Therefore, a script file is itself an XML
processing. Thus TM3 can operate in parallel with TM1. file.
TM3 returns document 6 as its result. (A concrete
example: TM3 could be a module which validates web A script file is a XML document which contains the
“certificate”.) Now, DI sends document 7, which is a descriptions on how to make (or generate) a list of XML
combination of the data in documents 4 and 6, to TM2 Documents (i.e. X0…n-1). The following tags are defined:
for further processing. TM2 invokes external program B <DOC id=”#name” action=”action” href=”URL”>
to perform its job, and returns its result as message 9. …
After receiving message 9, DI further uses its XSLT </DOC>
engine to reformat the XML document to produce
document 10, which is returned to the application. This defines a rule section describing how a document is
composed. It has the following attributes.
It should be reminded that the above is just an example Within a <DOC>…</DOC> element, the following tags
of the workflow of the XML Document Processing can be used to specify how the DI would compose the
System. The actual processing logic is programmable by document to be sent (in the case of action=”submit”) to
means of writing script files. a TM or transformed using XSLT (in the case of
action=”tranform”).
4 Implementation
<INCLUDE href=”#name”/>
We are implementing the architecture as described <INLINE>…</INLINE>
above. We chose Java [4] as the programming language
because of its platform-independence and richness of <WAITFOR href=”#name”/>
libraries for handling XML and network connections.
Both DI and the TM’s are implemented as Java Servlets <INFO>…</INFO>
[5]. The communication between the application
program and DI is done via the HTTP protocol. The <INCLUDE> tag specifies an XML segment to be
Similarly, DI and each TM communicate using the inserted to the temporarily constructed XML file as a subtree
HTTP protocol. of XML elements. If the href attribute is specified, the
temporary XML document identified by name will be
4.1 Implementation of DI inserted. Note that this implies a dependency—the document
identified by name must have already been composed before
Our implementation of DI is conceptually analogous to this current document (with a name specified by the current
the make utility (generally available on UNIX <DOC> element) is constructed. After examining all rules
platforms). It requires a script file as input. This file The DI can thus determines the sequence of operations
describes rules on how to handle the input XML required to construct the final result document. The
documents, based on the message types. It resembles a application programmer thus only needs to declare the set of
Makefile for the make utility. A makefile contains rules for composing the result and intermediate documents.
a list of targets. A target may depend on other targets. The DI will figure out what to do.
Therefore, before a target is made, make must have
made the depending targets of that target. The DI uses The <INLINE> tag simply inserts the XML segment to be
the same idea. The script file for the DI instructs the DI inserted. Note that a property of XML is that elements must
how to handle the input XML document by creating a be properly nested. So, the segment between
set of intermediate XML documents. To do that, it <INLINE>…</INLINE> must be well-formed XML.
composes each intermediate document (analogous to
0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.
The <WAITFOR> tag specifies that before the current Suppose the application now sends the DI a query:
document is made, the DI must have made the document
name first. Multiple declaration of this tag is allowed. It <PRODUCT_QUERY>
<MODEL>IBM-300GL</MODEL>
is similar to the <INCLUDE> tag except the content of
<MODEL>IBM Intellistation</MODEL>
the waited document will not be inserted into the </PRODUCT_QUERY>
working document.
Then, DI determines from the root element3 that this
The <INFO> tag gives human-readable comments, document has a type of “PRODUCT_QUERY”. Then, it will
which is not processed by the DI2. fetch a suitable script file for this document type. Suppose the
script file is the one shown above, then according to the rules
The following example illustrates how a script file can specified, the DI would perform the following operations:
be written with the above tags to specify how the DI
should process the input XML document. • Send the input document to the TM servlet at the URL
http://myDATM.eti.hku.hk/servlets/myDATM, and store
<DOC id="descriptions" action="submit"
href="http://myDATM.eti.hku.hk/servle the returned result (an XML document) as document
ts/myDATM"> “descriptions”. The TM is supposed to be a DATM,
<INFO>fetch item descriptions from which queries a database and returns the query result after
database</INFO> formatting it into XML. The database name and table
<INCLUDE doc="#input"/> name are passed by DI to TM as specified in the script
<INLINE> file. TM receives the file:
<DATABASE name=”catalog”/>
<TABLE name=”description”/> <DOC>
</INLINE> <PRODUCT_QUERY>
</DOC> <MODEL>IBM-300GL</MODEL>
<MODEL>IBM Intellistation</MODEL>
<DOC id="price" action="submit" </PRODUCT_QUERY>
href="http://myDATM.eti.hku.hk/servle <DATABASE name=”catalog”/>
ts/myDATM?query=price"> <TABLE name=”description”/>
<INFO>fetch price from </DOC>
database</INFO>
<INCLUDE doc="#input"/> • At the same time, DI sends the input document to the TM
<INLINE> servlet (which in this case happens to be the same one as
<DATABASE name=”catalog”/> the above) to query another table of (possibly) another
<TABLE name=”description”/> database. The returned document is stored and named
</INLINE> “price”. Note that this step is independent of step 1,
</DOC>
and hence can be performed concurrently with it.
<DOC id=”merged_list” • Next, DI concatenates the input XML document and the
transform=”http://xsllib.eti.hku.hk/l
ib/merge.xsl”>
result documents from steps 1 and 2 and then applies
<INFO>merge model no., description XSLT to transform it into a new document, named
and price information</INFO> “merged_list”. The XSLT is done according to the
<INCLUDE doc=”#input”/> XSL file located at URL
<INCLUDE doc=”#description”/> “http://xsllib.eti.hku.hk/lib/merge.xsl”.
<INCLUDE doc=”#price”/>
</DOC> • The merge result (“merge_list”) is then passed to the
servlet at URL
<DOC id="send_result" action="submit" “http://MGTM.eti.hku.hk/servlet/MGTM”
href="http://MGTM.eti.hku.hk/servlet/
for processing, and the returned result is an XML
MGTM">
<INFO> </INFO> document named “send_result”. This servlet is
<INCLUDE doc="#merged_list"/> supposed to send the document out as e-mail using an
</DOC> external program (or via SMTP).
2 3
Except that this comment may be inserted into log files An XML requirement is that each document contains exactly
for tracing and debugging. one root element.
0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.
There are several possibilities for integrating our XML as EDI gateways. Natively coded TM’s will affect portability.
Document Processing System with web servers
providing various web services. Firstly, a web server
can be an application using our system. As such, it 5.3 Performance
accepts queries from users, generated with web forms,
and then formats the query into an XML document and Performance is concern in this architecture. The response
sends this XML document to the DI. The DI then does time may be slow for a query that requires multiple data
the processing according to the pre-written script file. retrievals from other servers. This architecture attempts to
The script file returns the results to the web server, address this issue by allowing parallel data retrieval whenever
which then presents it to the user as the web query result. possible. The DI has intelligence to discover such potentials
Note the document returned to the web server by our DI from the rules of the script file. Since HTTP is not the most
is an XML file, but the web server may need to reformat efficient communication protocol, performance may be
it so that it can be displayed properly to the client. improved by replacing HTTP with Remote Method Invocation
Nevertheless, this reformatting may be performed using (RMI).
our DI. We can program our DI so that it applies a
suitable XSL file to the result just before it returns it to 6 Conclusions
the web server. This XSL file should transform the
XML result into a form suitable for presentation. It In this paper, we identified the need to process XML
should use XHTML for this presentation, as XHTML is documents in E-commerce systems. We argue that instead of
just an application of XML. In a similar manner, WAP designing dedicated programs to handle each kind of XML
(Wireless Application Protocol) servers can be integrated document for each specific application, we could design a
with our XML Document Processing System by acting generic architecture for handling XML documents. The
as an application in this architecture. With WAP architecture is designed to be generic and flexible. It uses a
services, the whole system becomes accessible from Document Integrator (DI) to control the process flow, but
mobile phones or any other WAP devices. delegates most capabilities to various Transformation
Modules (TM). For efficiency concerns, the DI is designed to
The other possibility of integration comes from our use have the capabilities of merging and transforming XML
of the HTTP protocol between DI and TM’s. In the documents with XSLT. We have discussed on how to
above, we have been saying that TM’s be implemented implement the DI and TM’s with a simple example.
as Java servlets. But actually, this need not be the case. However, since the design is flexible and highly extensible,
As long as it speaks the HTTP protocol, understands the we believe the architecture is suitable for many E-commerce
XML documents sent from the DI and returns XML systems.
documents to the DI, it can function as a TM. So, a TM
can also be implemented as a CGI program on any web
server. It may also be a standalone program that accepts 7 References
HTTP connections and processes the files as expected.
[1] Tim Bray, Jean Paoli, C.M. Sperberg-McQueen, Extensible
5 Discussions Markup Language (XML) 1.0, 10-February-1998,
“http://www.w3.org/TR/1998/REC-xml-
19980210”.
5.1 Distribution
[2] James Clark, XSL Transformations (XSLT) Version 1.0, 16
November 1999, “http://www.w3.org/TR/1999/REC-
High distribution is a goal of this design. This xslt-19991116”.
architecture is highly distributive. Since the DI and TM’s [3] Dan Connolly, XML Homepage on W3C website, April 1997,
communicate through HTTP, there is essentially no “http://www.w3c.org/XML”.
restriction of location and platform of the servers where [4] Sun Microsystems, Java Homepage,
DI and TM’s run. However, the servers must support “http://java.sun.com/products/servlet/”.
TCP/IP and should not be blocked by any firewall. [5] Sun Microsystems, Java Servlet API Homepage,
“http://java.sun.com/products/servlet/”.
[6] W3C, World Wide Web Consortium (W3C) Homepage,
5.2 Portability “http://www.w3c.org/”.
0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.