0% found this document useful (0 votes)
3 views5 pages

Iiwas 02 DBB

This document discusses a modeling process for integrating diverse web data into a data warehouse, focusing on the challenges of handling multiform data. It outlines a step-by-step approach that includes data extraction, transformation into XML, and mapping to a relational database for analysis. The authors emphasize the need for effective metadata and a structured methodology to optimize data warehousing and analysis performance.

Uploaded by

olindajulian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

Iiwas 02 DBB

This document discusses a modeling process for integrating diverse web data into a data warehouse, focusing on the challenges of handling multiform data. It outlines a step-by-step approach that includes data extraction, transformation into XML, and mapping to a relational database for analysis. The authors emphasize the need for effective metadata and a structured methodology to optimize data warehousing and analysis performance.

Uploaded by

olindajulian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 5

WAREHOUSING WEB DATA

Jérôme Darmont, Omar Boussaid, and Fadila Bentayeb


Équipe BDD, Laboratoire ERIC
Université Lumière – Lyon 2
5 avenue Pierre Mendès-France
69676 Bron Cedex
France
E-mail: {jdarmont|boussaid|fbentaye}@univ-lyon2.fr

KEYWORDS numerical. With the development of the Internet, the


availability of various types of data (images, texts, sounds,
Web, Multimedia data, Integration, Modeling process, videos, data from databases…) has increased. These data,
UML, XML, Mapping, Data warehousing, Data analysis. which are extremely diverse in nature (we name them
“multiform data”), may be unstructured, structured, or even
ABSTRACT already organized in databases. Their availability in large
quantities and their complexity render their structuring and
In a data warehousing process, mastering the data exploitation difficult. Nonetheless, the concepts of data
preparation phase allows substantial gains in terms of time warehousing (Kimball 1996; Inmon 1996; Chaudhuri and
and performance when performing multidimensional Dayal 1997) remain valid for multimedia data
analysis or using data mining algorithms. Furthermore, a (Thuraisingham 2001). In this context, the web may be
data warehouse can require external data. The web is a considered as a farming system providing input to a data
prevalent data source in this context. In this paper, we warehouse (Hackathorn 2000). Large data volumes and
propose a modeling process for integrating diverse and their dating are other arguments in favor of this data
heterogeneous (so-called multiform) data into a unified webhouse approach (Kimball and Mertz 2000).
format. Furthermore, the very schema definition provides
first-rate metadata in our data warehousing context. At the Hence, data from the web can be stored into a DSDB such
conceptual level, a complex object is represented in UML. as a data warehouse, in order to be explored by on-line
Our logical model is an XML schema that can be described analysis or data mining techniques. However, these
with a DTD or the XML-Schema language. Eventually, we multiform data must first be structured into a database, and
have designed a Java prototype that transforms our then integrated in the particular architecture of a data
multiform input data into XML documents representing our warehouse (fact tables, dimension tables, data marts, data
physical model. Then, the XML documents we obtain are cubes). Yet, the classical warehousing approach is not very
mapped into a relational database we view as an ODS adequate when dealing with multiform data. The
(Operational Data Storage), whose content will have to be muldimensional modeling of these data is tricky and may
re-modeled in a multidimensional way to allow its storage necessitate the introduction of new concepts. Classical
in a star schema-based warehouse and, later, its analysis. OLAP operators may indeed prove inefficient or ill-
adapted. Administering warehoused multiform data also
1. INTRODUCTION requires adequate refreshment strategies when new data pop
up, as well as specific physical reorganization policies
The end of the 20th Century has seen the rapid development depending on data usage (to optimize query performance).
of new technologies such as web-based, communication, In order to address these issues, we adopted a step-by-step
and decision support technologies. Companies face new strategy that helps us handling our problem’s complexity.
economical challenges such as e-commerce or mobile
commerce, and are drastically changing their information In a first step, our approach consists in physically
systems design and management methods. They are integrating multiform data into a relational database playing
developing various technologies to manage their data and the role of a buffer ahead of the data warehouse. In a
their knowledge. These technologies constitute what is second step, we aim at multidimensionally model these data
called “business intelligence”. These new means allow to prepare them for analysis. The final phase in our process
them to improve their productivity and to support consists in exploring the data with OLAP or data mining
competitive information monitoring. In this context, the techniques.
web is now the main farming source for companies, whose
challenge is to build web-based information and decision The aim of this paper is to address the issue of web data
support systems. Our work lies in this field. We present in integration into a database. This constitutes the first phase
this paper an approach to build a Decision Support in building a multiform data warehouse. We propose a
Database (DSDB) whose main data source is the web. modeling process to achieve this goal. We first designed a
conceptual UML model for a complex object representing a
The data warehousing and OLAP (On-Line Analytical superclass of all the types of multiform data we consider
Processing) technologies are now considered mature in (Miniaoui et al. 2001). Note that our objective is not only to
management applications, especially when data are store data, but also to truly prepare them for analysis, which
is more complex than a mere ETL (Extracting, represents a complex object generalizing all these data
Transforming, and Loading) task. Then, we translated our types. Note that our goal here is to propose a general data
UML conceptual model into an XML schema definition structure: the list of attributes for each class in this diagram
that represents our logical model. Eventually, this logical is willingly not exhaustive.
model has been instantiated into a physical model that is an
XML document. The XML documents we obtain with the A complex object is characterized by its name and its
help of a Java prototype are mapped into a (MySQL) source. The date attribute introduces the notion of
relational database with a PHP script. We consider this successive versions and dating that is crucial in data
database as an ODS (Operational Data Storage), which is a warehouses. Each complex object is composed of several
temporary data repository that is typically used in an ETL subdocuments. Each subdocument is identified by its name,
process before the data warehouse proper is constituted. its type, its size, and its location (i.e., its physical address).
The document type (text, image, etc.) will be helpful later,
The remainder of this paper is organized as follows. when selecting an appropriate analysis tool (text mining
Section 2 presents our unified conceptual model for tools are different from standard data mining tools, for
multiform data. Section 3 outlines how this conceptual instance). The language class is important for text mining
model is translated into a logical, XML schema definition. and information retrieval purposes, since it characterizes
Section 4 details how our input data are transformed into an both documents and keywords.
XML document representing our physical model. We
finally conclude the paper and discuss future research Eventually, keywords represent a semantic representation of
issues. a document. They are metadata describing the object to
integrate (medical image, press article...) or its content.
2. CONCEPTUAL MODEL Keywords are essential in the indexing process that helps
guaranteeing good performances at data retrieval time. Note
The data types we consider (text, multimedia documents, that we consider only logical indexing here, and not

COMPLEX OBJECT
Name
Date
LANGUAGE
Source
Name
*
0..1 *

*
SUBDOCUMENT *
Name * KEYWORD
Type
Term
Size * *
Location

TEXT RELATIONAL VIEW IMAGE CONTINUOUS


Nb_char Query Format Duration
Nb_lines Compression Speed
* * Width
Length
Resolution
TAGGED TEXT PLAIN TEXT
SOUND VIDEO
Content
*
* *
ATTRIBUTE
TUPLE
* Name
* * Domain
LINK
URL

ATOMIC VALUE
Value

Figure 1: Multiform Data Conceptual Model


relational views from databases) for integration in a data physical issues arisen by very large amounts of data, which
warehouse all bear characteristics that can be used for are still quite open as far as we know. Keywords are
indexing. The UML class diagram shown in Figure 1 typically manually captured, but it would be very
interesting to mine them automatically with text mining huge amounts of data, especially if the data source is not
(Tan 1999), image mining (Zhang et al. 2001), or XML regularly updated. On the other hand, if successive
mining (Edmonds 2001) techniques, for instance. snapshots of an evolving view are needed, data will have to
be stored.
All the following classes are subclasses of the subdocument
class. They represent the basic data types and/or documents Images may bear two types of attributes: some that are
we want to integrate. Text documents are subdivided into usually found in the image file header (format, compression
plain texts and tagged texts (namely HTML, XML, or rate, size, resolution), and some that need to be extracted by
SGML documents). Tagged text are further associated to a program, such as color or texture distributions.
certain number of links. Since a web page may point to
external data (other pages, images, multimedia data, Eventually, sounds and video clips are part of a same class
files...), those links help relating these data to their referring because they share continuous attributes that are absent
page. from the other (still) types of data we consider. As far as we
know, these types of data are not currently analyzed by
Relational views are actually extractions from any type of mining algorithms, but they do contain knowledge. This is
database (relational, object, object-relational — we suppose why we take them into account here (though in little detail),
a view can be extracted whatever the data model) that will anticipating advances in multimedia mining techniques
be materialized in the data warehouse. A relational view is (Thuraisingham 2001).
a set of attributes (columns, classically characterized by
their name and their domain) and a set of tuples (rows). At 3. LOGICAL MODEL
the intersection of tuples and attributes is a data value. In
our model, these values appear as ordinal, but in practice Our UML model can be directly translated into an XML
they can be texts or BLOBs containing multimedia data. schema, whether it is expressed as a DTD or in the XML-
The query that helped building the view is also stored. Schema language. We considered using the XMI method
Depending on the context, all the data can be stored, only (Cover 2001) to assist us in the translation process, but
the query and the intention (attribute definitions), or given the relative simplicity of our models, we proceeded

<!ELEMENT COMPLEX_OBJECT (OBJ_NAME, DATE, SOURCE, SUBDOCUMENT+)>


<!ELEMENT OBJ_NAME (#PCDATA)>
<!ELEMENT DATE (#PCDATA)>
<!ELEMENT SOURCE (#PCDATA)>
<!ELEMENT SUBDOCUMENT (DOC_NAME, TYPE, SIZE, LOCATION, LANGUAGE?,
KEYWORD*, (TEXT | RELATIONAL_VIEW | IMAGE | CONTINUOUS))>
<!ELEMENT DOC_NAME (#PCDATA)>
<!ELEMENT TYPE (#PCDATA)>
<!ELEMENT SIZE (#PCDATA)>
<!ELEMENT LOCATION (#PCDATA)>
<!ELEMENT LANGUAGE (#PCDATA)>
<!ELEMENT KEYWORD (#PCDATA)>
<!ELEMENT TEXT (NB_CHAR, NB_LINES, (PLAIN_TEXT | TAGGED_TEXT))>
<!ELEMENT NB_CHAR (#PCDATA)>
<!ELEMENT NB_LINES (#PCDATA)>
<!ELEMENT PLAIN_TEXT (#PCDATA)>
<!ELEMENT TAGGED_TEXT (CONTENT, LINK*)>
<!ELEMENT CONTENT (#PCDATA)>
<!ELEMENT LINK (#PCDATA)>
<!ELEMENT RELATIONAL_VIEW (QUERY?, ATTRIBUTE+, TUPLE*)>
<!ELEMENT QUERY (#PCDATA)>
<!ELEMENT ATTRIBUTE (ATT_NAME, DOMAIN)>
<!ELEMENT ATT_NAME (#PCDATA)>
<!ELEMENT DOMAIN (#PCDATA)>
<!ELEMENT TUPLE (ATT_NAME_REF, VALUE)+>
<!ELEMENT ATT_NAME_REF (#PCDATA)>
<!ELEMENT VALUE (#PCDATA)>
<!ELEMENT IMAGE (COMPRESSION, FORMAT, RESOLUTION, LENGTH, WIDTH)>
<!ELEMENT FORMAT (#PCDATA)>
<!ELEMENT COMPRESSION (#PCDATA)>
<!ELEMENT LENGTH (#PCDATA)>
<!ELEMENT WIDTH (#PCDATA)>
<!ELEMENT RESOLUTION (#PCDATA)>
<!ELEMENT CONTINUOUS (DURATION, SPEED, (SOUND | VIDEO))>
<!ELEMENT DURATION (#PCDATA)>
<!ELEMENT SPEED (#PCDATA)>
<!ELEMENT SOUND (#PCDATA)>
<!ELEMENT VIDEO (#PCDATA)>

Figure 2: Logical Model (DTD)


everything. For instance, it might be inadequate to duplicate
directly. The schema we obtained, expressed as a DTD, is principle is to parse the schema introduced in Figure 2
shown in Figure 2. recursively, fetching the elements it describes, and to write
them into the output XML document, along with the
We applied minor shortcuts not to overload this schema. associated values extracted from the original data, on the
Since the language, keyword, link and value classes only fly. Missing values are currently treated by inserting an
bear one attribute each, we mapped them to single XML empty element, but strategies could be devised to solve this
elements, rather than having them be composed of another, problem, either by prompting the user or automatically.
single element. For instance, the language class became the
language element, but this element is not further composed At this point, our prototype is able to process all
of the name element. Eventually, since the attribute and the the data classes we identified in Figure 1.
tuple elements share the same sub-element "attribute Figure 3 illustrates how one single document
name", we labeled it ATT_NAME in the attribute element and (namely, an image) is transformed using our
ATT_NAME_REF (reference to an attribute name) in the tuple approach. A composite document (such as a
element to avoid any confusion or processing problem. web page including pieces of text, XML data,
data from a relational database, and an audio
4. PHYSICAL MODEL file) would bear the same form. It would just
have several different subdocument elements
We have developed a Java prototype capable of taking as instead of one (namely plain and tagged text,
input a data source from the web, fitting it in our model, relational view, and continuous/sound
and producing an XML document. The source code of this subdocuments, in our example).
application we baptized web2xml is available on-line:
http://bdd.univ-lyon2.fr/download/web2xml.zip. We Eventually, in order to map XML documents into a
view the XML documents we generate as the final physical (MySQL) relational database, we designed a prototype
models in our process. baptized xml2rdb. This PHP script is also available on-
line: http://bdd.univ-lyon2.fr/xml2rdb/. It operates in
The first step of our multiform data integration approach two steps. First, a DTD parser exploits our logical model
consists in extracting the attributes of the complex object (Figure 2) to build a relational schema, i.e., a set of tables
that has been selected by the user. A particular treatment is in which any valid XML document (regarding our DTD)
applied depending on the subdocument class (image, sound, can be mapped. To achieve this goal, we mainly used the
etc.), since each subdocument class bears different techniques proposed by (Anderson et al. 2000; Kappel at al.
attributes. We used three ways to extract the actual data: (1) 2000). Note that our DTD parser is a generic tool: it can
manual capture by the user, through graphical interfaces; operate on any DTD. It takes into account all the XML
(2) use of standard Java methods and packages; (3) use of element types we need, e.g., elements with +, *, or ?
ad-hoc automatic extraction algorithms. Our objective is to multiplicity, element lists, selections, etc. The last and
progressively reduce the number of manually-captured easiest step consists in loading a valid XML document into
attributes and to add new attributes that would be useful for the previously build relational structure.
later analysis and that could be obtained with data mining
techniques. 5. CONCLUSION AND PERSPECTIVES

The second step when producing our physical model


consists in generating an XML document. The algorithm’s
Image XML Model
<?XML version=1.0?>
<!DOCTYPE MlfDt SYSTEM "mlfd.dtd">
<COMPLEX_OBJECT>
<OBJ_NAME>Sample image</OBJ_NAME>
<DATE>2002-06-15</DATE>
<SOURCE>Local</SOURCE>
<SUBDOCUMENT>
<DOC_NAME>Surf</DOC_NAME>
<TYPE>Image</TYPE>
<SIZE>4407</SIZE>
<LOCATION>gewis_surfer2.gif</LOCATION>
<KEYWORD>surf</KEYWORD>
<KEYWORD>black and white</KEYWORD>
<KEYWORD>wave</KEYWORD>
<IMAGE>
<FORMAT>Gif</FORMAT>
<COMPRESSION/>
<WIDTH>344</WIDTH>
User-prompted keywords: <LENGTH>219</LENGTH>
– surf <RESA>72dpi</RESA>
</IMAGE>
– black and white
</SUBDOCUMENT>
– wave </COMPLEX_OBJECT>

Figure 3: Sample Physical Model for an Image


We presented in this paper a modeling process for The diversity of multiform data also requires adapted
integrating multiform data from the web into a Decision operators. Indeed, how can we aggregate data that are not
Support Database such as a data warehouse. Our conceptual quantitative, such as most multimedia attributes? Classical
UML model represents a complex object that generalizes OLAP operators are not adapted to this purpose. We
the different multiform data that can be found on the web envisage to integrate some data mining tasks (e.g.,
and that are interesting to integrate in a data warehouse as clustering) as new OLAP aggregation operators. Multiform
external data sources. Our model allows the unification of data analysis with data mining techniques can also be
these different data into a single framework, for purposes of complemented with OLAP’s data navigation operators.
storage and preparation for analysis. Data must indeed be
properly "formatted" before OLAP or data mining Eventually, using data mining techniques help us
techniques can apply to them. addressing certain aspects of data warehouse auto-
administration. They indeed allow to design refreshment
Our UML conceptual model is then directly translated into strategies when new data pop up and to physically
an XML schema (DTD or XML-Schema), which we view reorganize the data depending on their usage in order to
as a logical model. The last step in our (classical) modeling optimize query performance. This aspect is yet another axis
process is the production of a physical model in the form of in our work that is necessary in our global approach.
an XML document. XML is the format of choice for both
storing and describing the data. The schema indeed REFERENCES
represents the metadata. XML is also very interesting
Anderson, R. et al. 2000. Professional XML Databases. Wrox
because of its flexibility and extensibility, while allowing Press.
straight mapping into a more conventional database if
Chaudhuri, S. and Dayal, U. 1997. “Data Warehousing and OLAP
strong structuring and retrieval efficiency are needed for
for Decision Support”. ACM SIGMOD International
analysis purposes. Conference on Management of Data (SIGMOD 97), Tucson,
USA, 507-508.
The aim of the first step in our approach was to structure
Cover, R. 2001. “XML Metadata Interchange (XMI)”.
multiform data and integrate them in a database. At this http://xml.coverpages.org/xmi.html.
point, data are managed in a transactional way: it is the first
Edmonds, A. 2001. “A General Background to Supervised
modeling step. Since the final objective of our approach is
Learning in Combination with XML”, Technical paper,
to analyze multimedia data, and more generally multiform Scientio Inc., http://www.metadatamining.com.
data, it is necessary to add up a layer of multidimensional
Hackathorn, R. 2000. Web farming for the data warehouse.
modeling to allow an easy and efficient analysis.
Morgan Kaufmann.
One first improvement on our work could be the use of the Inmon, W.H. 1996. Building the Data Warehouse. John Wiley &
Sons.
XML-Schema language instead of a DTD to describe our
logical model. We could indeed take advantage of XML- Jensen, M.R. et al. 2001. “Specifying OLAP Cubes On XML
Schema’s greater richness, chiefly at the data type diversity Data”. 13th International Conference on Scientific and
and (more importantly) inheritance levels. This would Statistical Database Management (SSDBM 2001), Fairfax,
USA, 101-112.
facilitate the transition between the UML data
representation and the XML data representation. Kappel, G. et al. 2000. “X-Ray – Towards Integrating XML and
Relational Database Systems”. 19th International Conference
on Conceptual Modeling, 339-353.
Both the XML and XML-Schema formalisms could also
help us in the multidimensional modeling of multiform Kimball, R. and Mertz, R. 2000. The Data Webhouse: Building
data, which constitutes the second level of structuring. In the Web-enabled Data Warehouse. John Wiley & Sons.
opposition to the classical approach, designing a Kimball, R. 1996. The data warehouse toolkit. John Wiley &
multidimensional model of multiform data is not easy at all. Sons.
A couple of studies deal with the multidimensional Miniaoui, S. et al. 2001. “Web data modeling for integration in
representation and have demonstrated the feasibility of data warehouses”. First International Workshop on
UML snowflake diagrams (Jensen et al. 2001), but they Multimedia Data and Document Engineering (MDDE 01),
remain few. Lyon, France, 88-97.
Tan, A.H. 1999. “Text Mining: The state of the art and the
We are currently working on the determination of facts in challenges”. PAKDD 99 Workshop on Knowledge discovery
terms of measures and dimensions with the help of imaging, from Advanced Databases (KDAD 99), Beijing, China, 71-76.
statistical, and data mining techniques. These techniques Thuraisingham, B. 2001. Managing and Mining Multimedia
are not only useful as analysis support tools, but also as Databases, CRC Press.
modeling support tools. Note that it is not easy to refer to Zhang J. et al. 2001. “An Information-Driven Framework for
measures and dimensions when dealing with multimedia Image Mining”, 12th International Conference on Database
documents, for instance, without having some previous and Expert Systems Applications (DEXA 2001), Munich,
knowledge about their content. Extracting semantics from Germany; LNCS 2113, 232-242.
multimedia documents is currently an open issue that is
addressed by numerous research studies. We work on these
semantic attributes to build multidimensional models.

You might also like