Parent 1998 Issues and Approaches of Database Integration
Parent 1998 Issues and Approaches of Database Integration
local
heterogeneous
schemas
schema
transformation transformation
rules
local
DBA homogeneized
schemas
correspondence similarity
investigation
rules
interschema
correspondence
assertions
schemas
integration integration
rules
integrated schema
and mappings
S1 (ER) S2 (OO)
name
Author Paper
affiliation
Write title keywords authors
Caption:
) monovaluedand
monovalued and mandatory
mandatory linklink (cardinalities
(cardinalities 1:1)
) monovaluedand
monovalued andoptional
optionallinklink (cardinalities
(cardinalities =
= 0:1)
) multivaluedand
multivalued andmandatory
mandatory linklink (cardinalities
(cardinalities = 1: n)
) multivaluedand
multivalued andoptional
optional
linklink (cardinalities
(cardinalities = 0: =n)0
) generalization/specialization
generalization/specialization linklink
) referencelink
reference link
Figure 2. The publication example
Correspondences between properties may be very complex, due to differences in coding, scales, units, extents, etc.
Ad hoc mapping functions may be needed [5].
The set of stated ICAs should be checked for consistency and minimality. Assume one schema has a A is-a B
construct and the other schema has a C is-a D construct. An example of inconsistent ICA specification is: A ≡ D, B ≡ C.
Both cannot be true. Moreover, if A ≡ C is asserted, B ⊇ C and D ⊇ A may be inferred. Only ICAs bearing non-derivable
correspondences need to be explicitly stated.
An appropriate set of ICAs for the publication example is given in Figure 3. Figure 4 illustrates an integrated
schema built according to these ICAs.
Author
Paper
Publication
Journal Proceedings
Conference
E1 ∪ E2
E1 E2
E1 ∩ E2 or E1 E2
E1 ∩ E2
E1 ∩ E2
E1 ∪ E2
E1 ≠ E2
E1 E2
1a: the standard solution : preserve the local types
Integrated schema
Conflict
Merge technique Exhaustive technique
E1
E1 ⊇ E2 E1
E2 E1 - E2
E1 ∪ E2
E1 ∩ E2 E1 ∪ E2 E1 E2
E1 - E2 E1 ∩ E2 E2 - E1
E1 ≠ E2 E1 ∪ E2
1b: alternative solutions
Table 1. Solving classification conflicts
Alternatively, the simplicity principle calls for the insertion into the IS of a unique type which describes the union
of the extents ("merge technique" in Table 1b). Mappings will then use selection operators to relate the input populations to
the integrated type. Finally, the exhaustiveness principle leads to the inclusion into the IS of both input types, together
with their union (one supertype) and their intersection plus the complements of the intersection (three subtypes).
In the integrated schema of Figure 4, the merge technique has been applied. For instance, the S1.Paper type has
been merged with the S2.Paper type.
Strategies for classification conflicts have been extended to the integration of generalization hierarchies related
by multiple ICAs. The hierarchies are merged by taking each class from one hierarchy and determining the place where this
class fits in the other hierarchy. Placement is based on class inclusion semantics.
A structural conflict arises whenever corresponding types are described with constructs which have different
representational power: an object class and an attribute, for instance, or an entity type and a relationship type. In the
publication example, ICAs #1 and #4 about authors and about conferences identify structural conflicts. Few contributions
discuss this kind of conflict, mostly in the scope of a particular data model [4].
The integrated schema must describe the populations of the two conflicting types [12]. Hence, the integrated type
must subsume both input types: subsume their capacity to describe information (adopting the least upper bound) and
subsume the attached constraints (adopting the greatest lower bound). Capacity denotes which combination of
identity/value/link a construct expresses. For instance, an ER attribute holds a value, while an OO or relational attribute
holds either a value or a link. A relational relation holds a value and possibly links (through foreign keys), not identity.
Therefore, if an ICA between relational schemas identifies a relation (value+links) as corresponding to an attribute (value
or link), the IS will retain the relation.
Typical constraints to be considered are cardinality constraints and existence dependencies. For instance, an
attribute is existence dependent on its owner, while an object is generally not constrained by existence dependencies.
Considering the ICA relating the S1.Author entity type to the S2.Paper.authors attribute, the IS will retain the object
class. The greatest lower bound in this case is: no constraint.
Table 2 shows which integrated construct may be chosen for each structural conflict. In the example, ICAs #1 and
#4 denote case ① conflicts: the S2.Paper.authors and S1.Conference.name attributes are changed into classes in the IS.
case Schema 1 Schema 2 Integrated schema
T1
T1
1
T0 T0
T0
T1
T1
2 T0 T0 T0
T2
T2
T1 T1 T1
3 T0 T0 T0
T2 T2
T1 T3 T1 T3
4 T0 T0 T0
T2 T2
represents an attribute.
More complex structural conflicts (involving for instance join expressions) are possible [2, 4]. A general solution
remains to be found.
A descriptive conflict arises whenever there is some difference between properties of corresponding types. These
conflicts have been widely discussed [4, 5]. Table 3 summarizes many existing taxonomies.
Conflict Solution
names homonyms prefix the names
synonyms provide aliases
different keys use a conversion function or a correspondence
table
different sets of attributes do the union of the sets of attributes (following
the least upper bound approach)
conflicting corresponding attributes use a conversion function or a correspondence
table
integrity constraints take the greatest lower bound of the integrity
constraints
Table 4. Classical solutions for descriptive conflicts
Table 4 shows traditional solutions for most frequently discussed conflicts. A global solution has been proposed
in [8] by attaching to each attribute a context which describes its semantics. But, up to now, the attribute structure conflict
has no real solution. Existing methodologies deal with simple monovalued attributes, rarely with multivalued attributes
[5], and ignore complex attributes (attributes composed of other attributes).
In the publication example:
• A naming conflict appears in ICA #3: S1.Conference has been renamed Proceedings in the IS.
• Different sets of attributes occur in ICAs #1, 3 and 4. In each case the integrated type bears all the attributes of
the two corresponding types.
Current integration methodologies only integrate schemas which are expressed in their own data model. Schemas
which are not expressed in this data model have to be translated during the pre-integration step.
We advocate that problems and solutions for conflicts are basically independent of data models. It should
therefore be feasible to identify the set of fundamental integration rules which are needed and to define, for any specific
data model, how each rule can be applied by reformulating the rule according to the peculiarities of the model under
consideration [12]. A tool can then be built, capable of supporting direct integration of heterogeneous schemas and of
producing an integrated schema in any data model.
Another proposal suggests higher order logic to allow users to directly define the integrated schema over
heterogeneous input schemas.
Data/metadata conflicts arise when data (values) in one database correspond to metadata (type names) in the
other database. Discussions of these conflicts have been illustrated using a relational Stock example (Figure 5), where a
value for stockcode in DB1 corresponds to an attribute name in DB2 and to a relation name in DB3.
DB1 : DB2 :
Stock ( date , stockcode , price ) Stock ( date , IBM , HP , ... )
951001 IBM 77.72 951001 77.72 60.02 ...
951002 IBM 79.23 951002 79.23 61.45 ...
........... ........ ......... ........... ......... ........ ...
951001 HP 60.02
951002 HP 61.45
........... ........ .........
DB3 :
IBM ( date , price ) HP ( date , price ) ..................
951001 77.72 951001 60.02
951002 79.23 951002 61.45
........... ........ ........... .........
Normal relational languages cannot turn an attribute or relation name into a value, nor vice versa. It is not
possible to map DB2 or DB3 into DB1. These conflicts can be solved introducing mapping languages which support
simultaneous manipulation of both data and metadata. Examples of such languages are an extended relational algebra
and a higher order logic.
Alternatively, it is possible to get rid of these conflicts by using a data model inhibiting any usage of meaningful
type names. More likely, OO approaches will provide schema transformation operations to perform all needed mappings:
namely, partitioning a class into subclasses according to the value of a specialization attribute (mapping DB1 into DB3),
creating a common superclass, with a new classifying attribute, over a set of given classes (mapping DB3 into DB1) ...
Data conflict occurs at the instance level if corresponding occurrences have conflicting values for corresponding
attributes. For instance, the same paper is stored in the two publications databases with different keywords. Sources for
data conflicts may be: typing errors, variety of information sources, different versioning, deferred updates ...
Data conflicts are normally found during query processing. The system may just report the conflict to the
user, or may apply some heuristic to determine the appropriate value. Common heuristics are: choosing the value
from "the most reliable" database, uniting conflicting values in some way (through union for sets of values, though
aggregation for single values). Another possibility is to provide users with a manipulation language with facilities to
manipulate the set of possible tuples or values generated by data conflicts.
Conclusion
Integrating existing databases is certainly not an easy task. Still, it is something that enterprises probably cannot avoid if
they want to launch new applications or to reorganize the existing information system for a better profit.
We have shown in this article that a basic understanding of the issues and of the solutions is available. We
focused on the fundamental concepts and techniques, insisting on the alternatives and on criteria for choice. More details
are easily found in the literature.
Several important problems remain to be investigated. Examples are: integration of complex objects, n-n
correspondences (fragmentation conflicts), integration of integrity constraints and methods, integration of heterogeneous
databases. Theoretical work is needed to assess integration rules and their behavior (commutativity, associativity, ...). It is
therefore important that the effort to solve integration issues be continued and that proposed methodologies be evaluated
through experiments with real applications.
References
1. Brodie M.L., Stonebraker M. Migrating Legacy Systems: Gateways, Interfaces & The Incremental Approach. Morgan
Kaufmann, 1995.
2. Dupont Y. Resolving Fragmentation Conflicts in Schema Integration, In Entity-Relationship Approach - ER'94. P.
Loucopoulos Ed. LNCS 881, Springer-Verlag, Germany, 1994, pp. 513–532.
3. Gotthard W., Lockemann P.C., Neufeld A. System-guided view integration for object-oriented databases. IEEE Trans.
Knowl.Data Eng. 4 , 1 (Feb. 1992), pp. 1–22.
4. Kim W. (Ed.) Modern Database Systems: The Object Model, Interoperability and Beyond. ACM Press and Addison
Wesley, Reading, Mass.,1995.
5. Larson J.A., Navathe S.B., Elmasri R. A Theory of attribute equivalence in databases with application to schema
integration. IEEE Trans. Softw. Eng. 15, 4 (Apr. 1989), pp. 449–463.
6. Litwin W., Mark L., Roussopoulos N. Interoperability of multiple autonomous databases. ACM Comput. Surveys 22,
3 (Sept.1990), pp. 267–293.
7. Motro A. Superviews: Virtual integration of multiple databases. IEEE Trans. Softw. Eng. 13, 7 (July 1987), pp. 785–
798.
8. Sciore E., Siegel M., Rosenthal A. Using semantic values to facilitate interoperability among heterogeneous
information systems. ACM TODS 19, 2 (June 1994), pp. 254–290.
9. Sheth A., Larson J. Federated database systems for managing distributed, heterogeneous, and autonomous databases.
ACM Comput. Sur. 22, 3 (Sept. 1990), pp. 183–236.
10. Sheth A., Kashyap V. So Far (Schematically) yet So Near (Semantically). In Interoperable Database Systems (DS-5).
D.K. Hsiao, E.J. Neuhold, and R. Sacks-Davis< Eds. IFIP Trans. A-25, North-Holland, 1993, pp. 283–312.
11. Scheuermann P., Chong E.I. Role-Based Query Processing in Multidatabase Systems, In Advances in Database
Technology - EDBT'94. M. Jarke, J. Bubenko, and K. Jeffery, Eds. LNCS 779, Springer-Verlag, Germany, 1994, pp. 95–
108.
12. Spaccapietra S., Parent C., Dupont Y. Model independent assertions for integration of heterogeneous schemas. VLDB
J. 1, 1 (July 1992), pp. 81–126.
CHRISTINE PARENT ([email protected]) is a professor in the computer science department at the University of
Burgundy in Dijon, France. She is currently leading a research project on spatial database modeling at the Swiss Federal
Institute of Technology in Lausanne, Switzerland.
STEFANO SPACCAPIETRA ([email protected]) a professor in the computer science department at the Swiss
Federal Institute of Technology in Lausanne, Switzerland. He is currently involved in the development of visual user
interfaces and cooperative design methodologies.