0% found this document useful (0 votes)
18 views

Distributed and Scalable XML Document Processing Architecture For E-Commerce Systems

Uploaded by

bashaa.2.7.0.6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Distributed and Scalable XML Document Processing Architecture For E-Commerce Systems

Uploaded by

bashaa.2.7.0.6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Distributed and Scalable XML Document Processing Architecture

for E-Commerce Systems

David Cheung, S.D. Lee, Thomas Lee, William Song, C.J. Tan
E-Business Technology Institute,
The University of Hong Kong,
Hong Kong
{dcheung, sdlee, ytlee, wsong, ctan}@eti.hku.hk

usually highly specialized, and hence difficult to reuse in other


Abstract systems or even other parts of the system. In this paper, we
XML has become a very important emerging standard take a different approach. We have designed a generic XML
for E-commerce because of its flexibility and processing architecture. The heart of the architecture is a
universality. Many software designers are actively Document Integrator which determines how input XML
developing new systems to handle information in XML documents are processed. The processing is based on high-
formats. We propose a generic architecture for level scripts written by the application programmers. Based on
processing XML. We have designed an XML processing the scripts, the Integrator processes an XML document by
system using the latest technologies such as XML, XSLT, passing it to different Transformation Modules. Each such
HTTP and Java Servlets. Our design is very generic, module is designed for handling a special type of task. It
flexible, scalable, extensible and also suitable for processes the XML passed to it by the Integrator, and then
distributed network environments. A main application of returns a result document, also in XML format, to the
the architecture and the system is to support data Integrator. The Integrator then processes the resulting
exchange in electronic commerce systems. document, and invokes other Modules as necessary. Finally,
the Integrator returns the final result to the caller as an XML
document.
1 Introduction
Under this architecture, the capabilities of the XML
Extensible Markup Language (XML) [1] is a highly processing system can be extended by designing new
flexible format for storing and exchanging data. It has Transformation Modules. Existing Modules and the
recently received much attention from Web application Document Integrator needs no modification. Thus the design
developers, especially for E-commerce applications [3]. is flexible and extensible.
Because XML is a machine-architecture independent
format, it facilitates the exchange of data between 2 Design objectives
corporations, which are probably using very different
internal formats for the data. As a recommendation of Our design architecture aims at achieving the following
the World-Wide-Web Consortium (W3C) [3], XML is objectives:
an open standard, which means that any developer can 1. Generality—The system could be easily adapted for the
support XML in their products. So, a corporation using most common XML processing requirements without
XML for data storage and exchange is not locked into a major modifications.
particular software vendor which uses proprietary 2. Modularization—Each module is responsible for
formats for data storage and interchange. It can easily providing one category of capabilities. This facilitates
exchange data with any other corporation using XML. project management and software maintenance.
Whom, it can partner with, is no longer selected by the 3. Distribution—The system is divided into various loosely-
software vendor implementing the proprietary formats. coupled modules. The different modules can then be run
on different computers to improve processing power.
To enjoy the advantages brought about by XML, a 4. Extensibility—New capabilities can be introduced by
system capable of handling XML files and messages is adding new modules. The core parts of the system dos
needed. Currently, many software vendors are actively not need modification when new capabilities are added.
adding XML capabilities, of varying degrees, to their 5. Flexibility—Processing logic is specified in script-like
products. A common approach is to add specially file instead of being hard-coded.
designed modules or enhancements to existing, well- 6. Reliability.
established systems. The program code so developed is 7. Robustness.

0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.
3 System Architecture add this capability to the DI. This is accomplished by
including XSLT (see below) in the DI.
The architecture of our Generic XML Document
3.1.1 XML Transformation (XSLT). XML Stylesheet
Processing System is depicted in the following diagram.
Language Transformation [2] specifies a stylesheet language,

high-level Application
application

XML
Document Document Integrator
Processing
System

TM1 TM2 TMn


XML document flow

Other data flow

external Ext. Prog. A Ext. Prog. B


programs

based on XML, which can be used to describe rules for


The system contains a Document Integrator as well as convert one XML document to another. The language is
server Transformation Modules (TM). expressive enough for describing all the transformations
required in the DI of our system. Therefore, we adopt XSLT
3.1 The Document Integrator (DI) in our DI system for manipulating the input and intermediate
XML files. Concatenation and merging of several
The Document Integrator is the core of this intermediate work files (in XML format) can be achieved by
architecture. It is responsible for receiving input XML creating a temporary “master” XML file which consists of one
data from the application program. Upon receiving an root element1 with the intermediate files as child nodes
XML document, which may come from a disk file or immediately below the root node. This master file can then be
from a network connection, the DI processes the input manipulated with an XSL engine to produce the merged
document according to script files written by the result, which is a new XML file.
application programmer. According the logic described
in the script file, the DI communicates with various 3.2 Transformation Modules (TM)
Transformation Modules (TM), passing to them
appropriate XML documents. The documents returned Each Transformation Module is responsible for handling
from the TM’s are also XML documents. The DI may one category of tasks. A TM typically receives XML
store them temporarily for further processing. It may documents from the DI and then processes it according to the
pass these temporary XML files to other TM’s if logic built into the TM (which may be configurable by means
necessary. After collecting all the results from the TM’s, of TM specific script programs). The results of the processing
the DI combines the results by means of XSLT (see are then encapsulated as an XML document, which is returned
below) and then returns the final result to the application to the DI. The exact formats of the input and returned XML
program. documents are up to the TM, although they must be
applications of XML.
The major role of DI is to act as a bridge between the
application program and the TM’s, as well as to pass It is up to the system implementers to determine how a TM
data among the TM’s. It thus acts as a document would process an incoming XML document. Below, we have
switchbox. The actual processing of XML documents identified some functionalities frequently needed and suggest
are delegated to the various TM’s. However, the DI has how they can be handled using TM’s.
to be able to massage the data in XML documents in
order to pass them among the TM’s. This means it has
to transform XML documents frequently. Of course, the
task of such a transformation could be delegated to an 1
appropriate TM. However, for efficiency, we decide to Refer to the XML specification[1] for the definition of
“element”.

0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.
3.2.1. Database Access TM (DATM). DATM is an via external programs or external network connects, using
instance of TM, which provides access to ordinary external (non-XML) formats.
relational databases. The input XML document contains
information specifying which tables and fields of which 3.2.3. EDI Gateway TM (EDIGTM). Many corporations are
database on which database server are to be accessed. In using EDI to efficiently exchange messages. An EDIGTM
case of database inserts and updates, the input XML also accepts input XML documents as invoice, purchase order, or
contains the data to be used for these operations. The any kind of EDI message. It then converts the document
DATM inserts new records or updates existing records in contents into EDI format and sends it out using EDI channels.
the database appropriately. The DATM may returns an This acts as a gateway for outgoing messages between our
XML document to the DI to indicate whether the XML Document Processing System and EDI systems.
operator was successful, and possibly the cause of error
in case of failure. In case of database query, the query is 3.2.4. Logging TM (LogTM). Activities can be logged by
specified in the input XML document. The DATM implementing a TM, which writes any submitted XML file
queries the database server, and returns the query result onto the file system. Then, script files can be modified to
to the DI, after converting it to some XML format. select suitable contents from the input or intermediate XML
documents and send them to a LogTM.
It is up to the particular implementation of DATM to
design the formats of the input and returned XML 3.3 Document flow
documents. For example, it is possible to directly
include SQL statements in the input to specify the The actual means by which the DI processes an input XML
database operation. Query results can be formatted into is driven by the script files written by the application
XML according to hard-coded logic in the DATM, or programmer. The processing logic is not hardcoded. The DI
according to the specification of DATM-specific script only provides the capabilities (with the help of TM’s) for the
files. application programmer to manipulate the input XML file and
any intermediately generated XML files.
3.2.2. Message Generating TM (MGTM). An MGTM
interprets the input XML document as a message. The following diagram illustrates an example on how the DI
According to the message headers (or other appropriate processes the XML files.
logic), it sends the message out. It returns an XML

Application

1
10

2 5
Document Integrator
4 9 6

TM1 TM2 TM3


7

(3)
(8)

Ext. Prog. A Ext. Prog. B

document to the DI to indicate whether the message was In this example, the DI first receives XML message 1 from
sent successfully. An MGTM may send out a message the application program. This message contains two data
via various media, such as e-mail, fax, Usenet items (shown in the diagram as different solid shapes). DI
newsgroup posting, pager message, a print job or even a consults its scripts and decides that it should first send the first
mobile phone short message. Since the outgoing data item to TM1, encapsulated in message 2. TM1 receives
message must conform to the message format of the the message, and interacts (message (3)) with external
desired medium, an MGTM must be able to convert the program A. Then, it returns message 4 to DI, containing data
incoming XML message to the format of the target in a new XML document. (As a concrete example, imaging
medium before sending it out. The method of sending XML message 2 to be containing an SQL query, TM1 be a
out the message is also dependent on the medium type. DATM, interactions (3) be appropriate relational database
Most probably, an MGTM has to send out the message operations and message 4 be the query results encapsulated as

0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.
an XML document.) While TM1 is processing message ‘target’ in a Makefile) accordingly. The scripting language
2, DI may concurrently send document 5 (which contains is illustrated below. It should be noted that the language is an
the second data item from message 1) to TM3 for application of XML. Therefore, a script file is itself an XML
processing. Thus TM3 can operate in parallel with TM1. file.
TM3 returns document 6 as its result. (A concrete
example: TM3 could be a module which validates web A script file is a XML document which contains the
“certificate”.) Now, DI sends document 7, which is a descriptions on how to make (or generate) a list of XML
combination of the data in documents 4 and 6, to TM2 Documents (i.e. X0…n-1). The following tags are defined:
for further processing. TM2 invokes external program B <DOC id=”#name” action=”action” href=”URL”>
to perform its job, and returns its result as message 9. …
After receiving message 9, DI further uses its XSLT </DOC>
engine to reformat the XML document to produce
document 10, which is returned to the application. This defines a rule section describing how a document is
composed. It has the following attributes.

id=”name” name is a unique identifier of the document. The input


document that enters the DC has the name “input”.
action=”submit” The composed document will be submitted to URL, where a
href=”URL” servlet implementing a TM is ready to receive and process
the document.
action=”transform” The composed document will be transformed (internally by
href=”URL” DI) by applying the XSL file located by URL.

action=”return” The composed document will be returned to the caller

It should be reminded that the above is just an example Within a <DOC>…</DOC> element, the following tags
of the workflow of the XML Document Processing can be used to specify how the DI would compose the
System. The actual processing logic is programmable by document to be sent (in the case of action=”submit”) to
means of writing script files. a TM or transformed using XSLT (in the case of
action=”tranform”).
4 Implementation
<INCLUDE href=”#name”/>
We are implementing the architecture as described <INLINE>…</INLINE>
above. We chose Java [4] as the programming language
because of its platform-independence and richness of <WAITFOR href=”#name”/>
libraries for handling XML and network connections.
Both DI and the TM’s are implemented as Java Servlets <INFO>…</INFO>
[5]. The communication between the application
program and DI is done via the HTTP protocol. The <INCLUDE> tag specifies an XML segment to be
Similarly, DI and each TM communicate using the inserted to the temporarily constructed XML file as a subtree
HTTP protocol. of XML elements. If the href attribute is specified, the
temporary XML document identified by name will be
4.1 Implementation of DI inserted. Note that this implies a dependency—the document
identified by name must have already been composed before
Our implementation of DI is conceptually analogous to this current document (with a name specified by the current
the make utility (generally available on UNIX <DOC> element) is constructed. After examining all rules
platforms). It requires a script file as input. This file The DI can thus determines the sequence of operations
describes rules on how to handle the input XML required to construct the final result document. The
documents, based on the message types. It resembles a application programmer thus only needs to declare the set of
Makefile for the make utility. A makefile contains rules for composing the result and intermediate documents.
a list of targets. A target may depend on other targets. The DI will figure out what to do.
Therefore, before a target is made, make must have
made the depending targets of that target. The DI uses The <INLINE> tag simply inserts the XML segment to be
the same idea. The script file for the DI instructs the DI inserted. Note that a property of XML is that elements must
how to handle the input XML document by creating a be properly nested. So, the segment between
set of intermediate XML documents. To do that, it <INLINE>…</INLINE> must be well-formed XML.
composes each intermediate document (analogous to

0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.
The <WAITFOR> tag specifies that before the current Suppose the application now sends the DI a query:
document is made, the DI must have made the document
name first. Multiple declaration of this tag is allowed. It <PRODUCT_QUERY>
<MODEL>IBM-300GL</MODEL>
is similar to the <INCLUDE> tag except the content of
<MODEL>IBM Intellistation</MODEL>
the waited document will not be inserted into the </PRODUCT_QUERY>
working document.
Then, DI determines from the root element3 that this
The <INFO> tag gives human-readable comments, document has a type of “PRODUCT_QUERY”. Then, it will
which is not processed by the DI2. fetch a suitable script file for this document type. Suppose the
script file is the one shown above, then according to the rules
The following example illustrates how a script file can specified, the DI would perform the following operations:
be written with the above tags to specify how the DI
should process the input XML document. • Send the input document to the TM servlet at the URL
http://myDATM.eti.hku.hk/servlets/myDATM, and store
<DOC id="descriptions" action="submit"
href="http://myDATM.eti.hku.hk/servle the returned result (an XML document) as document
ts/myDATM"> “descriptions”. The TM is supposed to be a DATM,
<INFO>fetch item descriptions from which queries a database and returns the query result after
database</INFO> formatting it into XML. The database name and table
<INCLUDE doc="#input"/> name are passed by DI to TM as specified in the script
<INLINE> file. TM receives the file:
<DATABASE name=”catalog”/>
<TABLE name=”description”/> <DOC>
</INLINE> <PRODUCT_QUERY>
</DOC> <MODEL>IBM-300GL</MODEL>
<MODEL>IBM Intellistation</MODEL>
<DOC id="price" action="submit" </PRODUCT_QUERY>
href="http://myDATM.eti.hku.hk/servle <DATABASE name=”catalog”/>
ts/myDATM?query=price"> <TABLE name=”description”/>
<INFO>fetch price from </DOC>
database</INFO>
<INCLUDE doc="#input"/> • At the same time, DI sends the input document to the TM
<INLINE> servlet (which in this case happens to be the same one as
<DATABASE name=”catalog”/> the above) to query another table of (possibly) another
<TABLE name=”description”/> database. The returned document is stored and named
</INLINE> “price”. Note that this step is independent of step 1,
</DOC>
and hence can be performed concurrently with it.
<DOC id=”merged_list” • Next, DI concatenates the input XML document and the
transform=”http://xsllib.eti.hku.hk/l
ib/merge.xsl”>
result documents from steps 1 and 2 and then applies
<INFO>merge model no., description XSLT to transform it into a new document, named
and price information</INFO> “merged_list”. The XSLT is done according to the
<INCLUDE doc=”#input”/> XSL file located at URL
<INCLUDE doc=”#description”/> “http://xsllib.eti.hku.hk/lib/merge.xsl”.
<INCLUDE doc=”#price”/>
</DOC> • The merge result (“merge_list”) is then passed to the
servlet at URL
<DOC id="send_result" action="submit" “http://MGTM.eti.hku.hk/servlet/MGTM”
href="http://MGTM.eti.hku.hk/servlet/
for processing, and the returned result is an XML
MGTM">
<INFO> </INFO> document named “send_result”. This servlet is
<INCLUDE doc="#merged_list"/> supposed to send the document out as e-mail using an
</DOC> external program (or via SMTP).

<DOC id="output" action="return"> • Finally, DI returns the document “send_result” to


<INFO>reports whether the mail has the application program.
been sent correctly</INFO>
<INCLUDE doc="#send_mail "/>
</DOC> 4.2 Integration with Web server and WAP server

2 3
Except that this comment may be inserted into log files An XML requirement is that each document contains exactly
for tracing and debugging. one root element.

0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.
There are several possibilities for integrating our XML as EDI gateways. Natively coded TM’s will affect portability.
Document Processing System with web servers
providing various web services. Firstly, a web server
can be an application using our system. As such, it 5.3 Performance
accepts queries from users, generated with web forms,
and then formats the query into an XML document and Performance is concern in this architecture. The response
sends this XML document to the DI. The DI then does time may be slow for a query that requires multiple data
the processing according to the pre-written script file. retrievals from other servers. This architecture attempts to
The script file returns the results to the web server, address this issue by allowing parallel data retrieval whenever
which then presents it to the user as the web query result. possible. The DI has intelligence to discover such potentials
Note the document returned to the web server by our DI from the rules of the script file. Since HTTP is not the most
is an XML file, but the web server may need to reformat efficient communication protocol, performance may be
it so that it can be displayed properly to the client. improved by replacing HTTP with Remote Method Invocation
Nevertheless, this reformatting may be performed using (RMI).
our DI. We can program our DI so that it applies a
suitable XSL file to the result just before it returns it to 6 Conclusions
the web server. This XSL file should transform the
XML result into a form suitable for presentation. It In this paper, we identified the need to process XML
should use XHTML for this presentation, as XHTML is documents in E-commerce systems. We argue that instead of
just an application of XML. In a similar manner, WAP designing dedicated programs to handle each kind of XML
(Wireless Application Protocol) servers can be integrated document for each specific application, we could design a
with our XML Document Processing System by acting generic architecture for handling XML documents. The
as an application in this architecture. With WAP architecture is designed to be generic and flexible. It uses a
services, the whole system becomes accessible from Document Integrator (DI) to control the process flow, but
mobile phones or any other WAP devices. delegates most capabilities to various Transformation
Modules (TM). For efficiency concerns, the DI is designed to
The other possibility of integration comes from our use have the capabilities of merging and transforming XML
of the HTTP protocol between DI and TM’s. In the documents with XSLT. We have discussed on how to
above, we have been saying that TM’s be implemented implement the DI and TM’s with a simple example.
as Java servlets. But actually, this need not be the case. However, since the design is flexible and highly extensible,
As long as it speaks the HTTP protocol, understands the we believe the architecture is suitable for many E-commerce
XML documents sent from the DI and returns XML systems.
documents to the DI, it can function as a TM. So, a TM
can also be implemented as a CGI program on any web
server. It may also be a standalone program that accepts 7 References
HTTP connections and processes the files as expected.
[1] Tim Bray, Jean Paoli, C.M. Sperberg-McQueen, Extensible
5 Discussions Markup Language (XML) 1.0, 10-February-1998,
“http://www.w3.org/TR/1998/REC-xml-
19980210”.
5.1 Distribution
[2] James Clark, XSL Transformations (XSLT) Version 1.0, 16
November 1999, “http://www.w3.org/TR/1999/REC-
High distribution is a goal of this design. This xslt-19991116”.
architecture is highly distributive. Since the DI and TM’s [3] Dan Connolly, XML Homepage on W3C website, April 1997,
communicate through HTTP, there is essentially no “http://www.w3c.org/XML”.
restriction of location and platform of the servers where [4] Sun Microsystems, Java Homepage,
DI and TM’s run. However, the servers must support “http://java.sun.com/products/servlet/”.
TCP/IP and should not be blocked by any firewall. [5] Sun Microsystems, Java Servlet API Homepage,
“http://java.sun.com/products/servlet/”.
[6] W3C, World Wide Web Consortium (W3C) Homepage,
5.2 Portability “http://www.w3c.org/”.

We highly suggest that the DI and TM’s be written in


Java, so that they are platform independent. However, if
a TM needs to interact with other systems (e.g. database,
external servers), then the platform-independence will be
determined by these other systems. There may be some
cases in which TM’s require native code or platform-
dependent code to interact with native applications such

0-7695-0610-0/00
Authorized $10.00
licensed use limited to: AMRITA ãVIDYAPEETHAM
VISHWA 2000 IEEE AMRITA SCHOOL OF ENGINEERING. Downloaded on March 07,2024 at 12:35:24 UTC from IEEE Xplore. Restrictions apply.

You might also like