Data Integration With XML ETL Processing
Data Integration With XML ETL Processing
BOUZEROUATA Kawther
November 14, 2024
Abstract
This paper summarizes the article named ”Data integration XML ETL processing” written by
G.Jayashree a research scholar in VELS Institute of Science, Technology and Advanced Studies
(Department of Computer Science) in chenai, India, under the supervision of Dr.C.Priya.The arti-
cle presents data integration methods and their challenges, explains how XML can help overcome
these challenges, details ETL (Extract-Transform-Load) processing of XML data, and provides a
case study of XML data integration for an OLTP system.
1 Introduction
Data integration (DI) is the process of collecting and consolidating data from different systems and
sources to provide a unified view of that information. Integration can happen at 3 levels: instance
(data), model and schema. With the proliferation of applications in companies, the volume and
heterogeneity of data is exploding. Flexible solutions to organize and understand this data are essential
to extract value from it. XML (eXtensible Markup Language) is a standard format for exchanging or
storing data while retaining structure and semantics.
• E-commerce and B2B data interchange (EDI): B2B and B2C transactions
• Storage of business information: forms, reports, test data...
1
• Persuasive computing: XML renders structured and portable data for display wireless (pervasive)
computing devices like cellular phones, personal digital assistants (PDAs), etc. Wireless Markup
Language (WML) and VoiceXML are the current evolving standards for visual and speech-driven
wireless device interfaces.
• Metadata representation: standardized descriptions
• Consolidation via ETL (Extract Transform Load): extraction from N sources, transformation
then loading into a centralized data warehouse. ETL enables batch integration.
• Propagation via EAI (Enterprise Application Integration) and EDR (Enterprise Data Replica-
tion): real-time exchanges between applications, useful for business transactions.
• Data Virtualization: unified view of heterogeneous sources without moving or copying the data.
Queries are translated to the original sources.
• Data Federation: grouping of homogeneous sources into a virtual database with a single global
schema. A query is decomposed to the sources.
• Data Warehousing: storage server dedicated to analytics. Data is cleaned, reformatted, aggre-
gated and organized there to enable analysis.
These issues arise at all integration stages (schema, model, data). The XML format brings flexibility
to represent this diversity through its generic approach. However, fine-grained semantic management
of data in context remains a major challenge, requiring advanced algorithmic processing.
1. Hierarchical and flexible data model to represent all types of content (structured, semi or un-
structured)
2. Tag extensibility with ability to create new ones based on business needs
The semantic contribution of XML therefore lies in encapsulating data with rich information about
structure and meaning of content. This facilitates exchanges and transformations during integration
projects into a common format.
2
7 ETL and XML Data Processing
7.1 XML Data Processing methods
The ETL approach is widely used to integrate information from various sources into a data warehouse
or decision-oriented datamart. Two models are possible when processing XML files in ETL:
2. Full XML parsing: the ETL tool must understand and interpret the overall document structure
to ensure complete data extraction.
A use case is presented with customer order data in XML format to load into the operational OLTP
database of a company. Two technical methods are also explained for programming XML content
parsing within the ETL tool: reading in multiple columns or processing in a single wide text column.
• Load the purchase order data from the multiple XML files into an operational OLTP table
• Move each file to a ”processed” folder after loading to keep track
1. XML source files are moved from the web server to an ETL server
2. Each XML file is parsed and additional data like filenames and timestamps is added
3. Data transformations like customer validation checks or assigning order numbers
4. Insert the final purchase order datasets into the target OLTP tables
5. Update status tracking tables for auditing
6. Move processed XML files to archive directory
8 Conclusion
The flexibility and expressiveness of XML has made it a key standard for representing diverse business
data and enabling large-scale integration. ETL and data integration software vendors have heavily
invested in recent years in capabilities for XML document and stream analysis, transformation and
loading at the core of their solutions. The challenge now lies in semantic and contextual management
of information elements for increased business value.