Conceptual Modeling For Data Integration
Conceptual Modeling For Data Integration
net/publication/221350071
CITATIONS READS
28 363
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Maurizio Lenzerini on 23 May 2014.
1 Introduction
The goal of data integration is to provide a uniform access to a set of heterogeneous
data sources, freeing a client from the knowledge about where the data are, how they
are stored, and how they can be accessed. The problem of designing effective data in-
tegration solutions has been addressed by several research and development projects in
the last years. However, data integration is still one of the major challenges in Infor-
mation Technology [5]. One of the reasons is that large amounts of heterogeneous data
are nowadays available within an organization, but these data have been often collected
and stored by different applications and systems. Therefore, on the one hand the need
of accessing data by means of flexible and unified mechanisms is becoming more and
more important, and, on the other hand, current commercial data integration tools have
several drawbacks.
Starting from the late 90s, research in data integration has mostly focused on declar-
ative approaches (as opposed to procedural ones) [32,26]. One of the outcomes of this
A.T. Borgida et al. (Eds.): Mylopoulos Festschrift, LNCS 5600, pp. 173–197, 2009.
c Springer-Verlag Berlin Heidelberg 2009
174 D. Calvanese et al.
thus amortizing the cost of integration. Therefore, the overall design can be regarded as
the incremental process of understanding and representing the domain on the one hand,
and the available data on the other hand.
We believe that all the advantages outlined above represent convincing arguments
supporting the conceptual approach to data integration. However, to fully pursue such
an approach, several challenging issues are to be addressed. The goal of this paper is
to analyze one of them, namely, how to express the conceptual model representing the
global schema.
We start our analysis with the case where the global schema of the data integration
system is expressed in terms of a UML class diagram. Notably, we show that the ex-
pressive power of UML class diagrams is enough to get intractability of various tasks,
including query answering. We then present a specific proposal of a logic-based lan-
guage for expressing conceptual models. The language, called DL-LiteA,id , is a tractable
Description Logic, specifically defined for achieving tractability of the reasoning tasks
that are relevant in data integration. The proposed data integration framework, based
on such a logic, has several interesting properties, including the fact that both reason-
ing at design time, and answering queries at run time can be done efficiently. Also, we
study possible extensions to the data integration framework based on DL-LiteA,id , and
show that our proposal basically represents an optimal compromise between expressive
power and computational complexity.
The paper is organized as follows. In Section 2, we describe a general architecture for
data integration, and the basic features of Description Logics, which are the logics we use
to formally express conceptual models. In Section 3, we analyze the case where the global
schema of the data integration system is expressed in terms of a UML class diagram. In
Section 4, we illustrate the basic characteristics of the Description Logic DL-LiteA,id , and
in Section 5, we illustrate a specific proposal of data integration system based on such
a logic. In Section 6, we study possible extensions to such data integration framework,
whereas in Section 7, we conclude the paper with a discussion on related and future work.
qS ; qG ,
qG ; qS ,
where qS and qG are two queries of the same arity, respectively over the source
schema S, and over the global schema G. Queries qS are expressed in a query
language LM,S over the alphabet AS , and queries qG are expressed in a query
language LM,G over the alphabet AG . Intuitively, an assertion qS ; qG specifies
that the concept represented by the query qS over the sources corresponds to the
concept in the global schema represented by the query qG (similarly for an assertion
of type qG ; qS ).
The global schema provides a description of the domain of interest, and not simply
a unified representation of the source data. The source schema describes the structure
of the sources, where the real data are. The assertions in the mapping establish the
connection between the elements of the global schema and those of the source schema.
The semantics of a data integration system is based on the notion of interpretation
in logic. Indeed, in this paper we assume that G is formalized as a logical theory, and
therefore, given a source database D (i.e., a database for S), the semantics of the whole
system coincides with the set of interpretations that satisfy all the assertions of G (i.e.,
they are logical models of G) and all the assertions of M with respect to D. Such a set
of interpretations, denoted sem D (J ), is called the set of models of J relative to D.
There are two basic tasks concerning a data integration system that we consider in
this paper. The first task is relevant in the design phase of the system, and concerns
the possibility of reasoning over the global schema: given G and a logical assertion
α, check whether α holds in every model of G. The second task is query answering,
which is crucial during the run-time of the system. Queries to J are posed in terms
of the global schema G, and are expressed in a query language LQ over the alphabet
AG . A query is intended to provide the specification of which extensional information
to extract from the domain of interest in the data integration system. More precisely,
given a source database D, the answer q J ,D to a query q in J with respect to D is the
set of tuples t of objects such that t ∈ q B (i.e., t is an answer to q over B) for every
model B of J relative to D. The set q J ,D is called the set of certain answers to q in J
with respect to D. Note that, from the point of view of logic, finding certain answers is
a logical implication problem: check whether the fact that t satisfies the query logically
follows from the information on the sources and on the mapping.
The above definition of data integration system is general enough to capture virtually
all approaches in the literature. Obviously, the nature of a specific approach depends on
the characteristics of the mapping, and on the expressive power of the various schema
and query languages. For example, the language LG may be very simple (basically
allowing for the definition of a set of relations), or may allow for various forms of
Conceptual Modeling for Data Integration 177
Description Logics [2] (DLs) were introduced in the early 80s in the attempt to provide
a formal ground to Semantic Networks and Frames. Since then, they have evolved into
knowledge representation languages that are able to capture virtually all class-based
representation formalisms used in Artificial Intelligence, Software Engineering, and
Databases. One of the distinguishing features of the work on these logics is the detailed
computational complexity analysis both of the associated reasoning algorithms, and of
the logical implication problem that the algorithms are supposed to solve. By virtue
of this analysis, most of these logics have optimal reasoning algorithms, and practical
systems implementing such algorithms are now used in several projects. In DLs, the do-
main of interest is modeled by means of concepts and roles (i.e., binary relationships),
which denote classes of objects and binary relations between classes of objects, respec-
tively. Concepts and roles can be denoted using expressions of a specified languages,
and the various DLs differ in the expressive power of such a language. The DLs con-
sidered in this paper are subsets of a DL called ALCQIbid . ALCQIbid is an expressive
DL that extends the basic DL language AL (attributive language) with negation of ar-
bitrary concepts (indicated by the letter C), qualified number restrictions (indicated by
the letter Q), inverse of roles (indicated by the letter I), boolean combinations of roles
(indicated by the letter b), and identification assertions (indicated by the subscript id).
More in detail, concepts and roles in ALCQIbid are formed according to the following
syntax:
C, C −→ A | ¬C | C C | C C |
∀R.C | ∃R.C | n R.C | n R.C
R, R −→ P | P − | R ∩ R | R ∪ R | R \ R
where A denotes an atomic concept, P an atomic role, P − the inverse of an atomic role,
C, C arbitrary concepts, and R, R arbitrary roles. Furthermore, ¬C denotes concept
negation, C C concept intersection, C C concept union, ∀R.C value restriction,
∃R.C qualified existential quantification on roles, and n R.C and n R.C so-called
number restrictions. We use , denoting the top concept, as an abbreviation for A¬A,
for some concept A. An arbitrary role can be an atomic role or its inverse, or a role
obtained combining roles through set theoretic operators, i.e., intersection (“∩”), union
(“∪”), and difference (“\”). W.l.o.g., we assume difference applied only to atomic roles
and their inverses.
As an example, consider the atomic concepts Man and Woman, and the atomic
roles HAS-HUSBAND, representing the relationship between a woman and the man
with whom she is married, and HAS-CHILD, representing the parent-child rela-
tionship. Then, intuitively, the inverse of HAS-HUSBAND, i.e., HAS-HUSBAND− ,
represents the relationship between a man and his wife. Also, Man Woman is a
concept representing people (considered the union of men and women), whereas the
concept ∃HAS-CHILD.Woman represents those having a daughter, and the concept
178 D. Calvanese et al.
an inclusion assertion states that, in every model of T , each instance of the left-hand
side expression is also an instance of the right-hand side expression. For example,
the inclusions Woman 1 HAS-HUSBAND. and ∃HAS-HUSBAND− .
M an respectively specifies that women may have at most one husband and that
husbands are men.
– A local identification assertion (or, simply, identification assertion or identification
constraint – IdC) makes use of the notion of path. A path is an expression built
according to the following syntax,
π −→ S | D? | π ◦ π (1)
(id C π1 , . . . , πn )
AI ⊆ ΔI PI ⊆ Δ I × ΔI
¬C I = ΔI \ C I (P − )I = {(o, o ) | (o , o) ∈ P I }
(C C )I = C I ∩ C I (R ∩ R )I = RI ∩ RI
(C C )I = C I ∩ C I (R ∪ R )I = RI ∪ RI
(∀R.C)I = { o | ∀o . (o, o ) ∈ RI ⊃ o ∈ C I } (R \ R )I = RI \ RI
(∃R.C)I = { o | ∃o . (o, o ) ∈ RI ∧ o ∈ C I }
( n R.C)I = { o | |{o ∈ C I | (o, o ) ∈ RI }| ≥ n }
( n R.C)I = { o | |{o ∈ C I | (o, o ) ∈ RI }| ≤ n }
the same husband, whereas the identification assertion (id Man HAS-CHILD) says
that a man is identified by his children, i.e., there are not two men with a child in
common. We can also say that there are not two men with the same daughters by
means of the identification (id Man HAS-CHILD ◦ Woman?).
The ABox consists of a set of extensional assertions, which are used to state the
instances of concepts and roles. Each such assertion has the form A(a), P (a, b), a = b,
or a = b, with A and P respectively an atomic concept and an atomic role occurring in
T , and a, b constants.
We now turn to the semantics of ALCQIbid , which is given in terms of interpreta-
tions. An interpretation I = (ΔI , ·I ) consists of a non-empty interpretation domain
ΔI and an interpretation function ·I , which assigns to each concept C a subset C I of
ΔI , and to each role R a binary relation RI over ΔI is such a way that the conditions
specified in Figure 1 are satisfied. The semantics of an ALCQIbid KB K = T , A is
the set of models of K, i.e., the set of interpretations satisfying all assertions in T and
A. It remains to specify when an interpretation satisfies an assertion.
An interpretation I satisfies an inclusion assertion C C (resp., R R ), if
C I ⊆ C I (resp., RI ⊆ RI ).
In order to define the semantics of IdCs, we first define the semantics of paths, and
then specify the conditions for an interpretation to satisfy an IdC. The extension π I of
a path π in an interpretation I is defined as follows:
– if π = S, then π I = S I ,
– if π = D?, then π I = { (o, o) | o ∈ DI },
– if π = π1 ◦ π2 , then π I = π1I ◦ π2I , where ◦ denotes the composition operator on
relations.
As a notation, we write π I (o) to denote the set of π-fillers for o in I, i.e., π I (o) =
{o | (o, o ) ∈ π I }. Then, an interpretation I satisfies the IdC (id C π1 , . . . , πn ) if for
all o, o ∈ C I , π1I (o) ∩ π1I (o ) = ∅ ∧ · · · ∧ πnI (o) ∩ πnI (o ) = ∅ implies o = o . Observe
that this definition is coherent with the intuitive reading of IdCs discussed above, in
particular by sanctioning that two different instances o, o of C differ in the set of their
πi -fillers when such sets are disjoint.
Finally, to specify the semantics of ALCQIbid ABox assertions, we extend the in-
terpretation function to constants, by assigning to each constant a an object aI ∈ ΔI .
180 D. Calvanese et al.
HOME
HOST PLAYED-IN
team match round
BELONGS-TO
INCLUSION ASSERTIONS
league ∃OF. ∃HOST. match
∃OF. league ∃HOST− . team
∃OF− . nation match ∃HOST.
round ∃BELONGS-TO. playedMatch match
∃BELONGS-TO. round scheduledMatch match
∃BELONGS-TO− . league playedMatch ¬scheduledMatch
match ∃PLAYED-IN. match playedMatch scheduledMatch
∃PLAYED-IN. match 1 OF.
∃PLAYED-IN− . round 1 BELONGS-TO.
match ∃HOME. 1 PLAYED-IN.
∃HOME. match 1 HOME.
∃HOME− . team 1 HOST.
IDENTIFICATION ASSERTIONS
(id match HOME, PLAYED-IN)
(id match HOST, PLAYED-IN)
n ..nu m ..mu
C1 RA,1 RA,2
C2
C1
n ..nu A m ..mu
C2 A
To specify that the attribute is mandatory (i.e., multiplicity [1..∗]), we add the assertion
C ∃aC ,
which specifies that each instance of C participates necessarily at least once to the role
aC . To specify that the attribute is single-valued (i.e., multiplicity [0..1]), we add the
assertion
(funct aC ),
which is an abbreviation for 1 aC . . Finally, if the attribute is both mandatory
and single-valued (i.e., multiplicity [1..1]), we use both assertions together:
C ∃aC , (funct aC ).
An association in UML is a relation between the instances of two (or more) classes.
An association often has a related association class that describes properties of the asso-
ciation, such as attributes, operations, etc. A binary association A between the instances
of two classes C1 and C2 is graphically rendered as in Figure 4(a), where the multiplic-
ity m ..mu specifies that each instance of class C1 can participate at least m times and
at most mu times to association A. The multiplicity n ..nu has an analogous meaning
for class C2 .
An association A between classes C1 and C2 is formalized in DL by means of a role
A on which we enforce the assertions
∃A C1 , ∃A− C2 .
To represent that the association A is between classes C1 and C2 , we use the assertions
− −
∃RA,1 A, ∃RA,1 C1 , ∃RA,2 A, ∃RA,2 C2 .
In UML, one can use generalization between a parent class and a child class to
specify that each instance of the child class is also an instance of the parent class. Hence,
the instances of the child class inherit the properties of the parent class, but typically
they satisfy additional properties that in general do not hold for the parent class.
Generalization is naturally supported in DLs. If a UML class C2 generalizes a class
C1 , we can express this by the DL assertion
C1 C2 .
Often, when defining generalizations between classes, we need to add additional as-
sertions among the involved classes. For example, for the class hierarchy in Figure 5, an
3
If the roles of the association are not available, we may use an arbitrary DL role name.
4
Notice that such an approach can immediately be used to represent an association of any arity:
it suffices to repeat the above for every component.
184 D. Calvanese et al.
C
{disjoint, complete}
C1 C2 Cn
assertion may express that C1 , . . . , Cn are mutually disjoint. In DL, such a relationship
can be expressed by the assertions
C C1 · · · Cn .
of classes and associations. Hence, query answering over even moderately large data
sets is again infeasible in practice. It is not difficult to see that this implies that, in a data
integration system where the global schema is expressed as a UML diagram, answering
conjunctive queries is coNP-hard with respect to the size of the source data.
Actually, as we will see in the next section, the culprit of such a high complexity
is mainly the ability of expressing covering assertions, which induces reasoning by
cases. Once we disallow covering and suitably restrict the simultaneous use of subset
constraints between associations and multiplicities, not only the sources of exponential
complexity disappear, but actually query answering becomes reducible to standard SQL
evaluation over a relational database.
We have seen that in a data integration system where the global schema is expressed as
a UML class diagram, reasoning is too complex. Thus, a natural question arising at this
point is: which is the right language to express the global schema of a data integration
system?
In this section, we present DL-LiteA,id , a DL of the DL-Lite family [12,11], enriched
with identification constraints (idCs) [12], and show that it is very well suited for con-
ceptual modeling in data integration, in particular for its ability of balancing expres-
sive power with efficiency of reasoning, i.e., query answering, which can be managed
through relational database technology.
DL-LiteA,id is essentially a subset of ALCQIbid , but, contrary to the DL presented
in Section 2, it distinguishes concepts from value-domains, which denote sets of (data)
values, and roles from attributes, which denote binary relations between objects and
values. Concepts, roles, attributes, and value-domains in this DL are formed according
to the following syntax5 :
B −→ A | ∃Q | δ(U ) E −→ ρ(U )
C −→ B | ¬B F −→ D | T1 | · · · | Tn
Q −→ P | P− V −→ U | ¬U
R −→ Q | ¬Q
In such rules, A, P , and P − respectively denote an atomic concept, an atomic role, and
the inverse of an atomic role, Q and R respectively denote a basic role and an arbitrary
role, whereas B denotes a basic concept, C an arbitrary concept, U an atomic attribute,
V an arbitrary attribute, E a basic value-domain, and F an arbitrary value-domain.
Furthermore, δ(U ) denotes the domain of U , i.e., the set of objects that U relates to
values; ρ(U ) denotes the range of U , i.e., the set of values that U relates to objects;
D is the universal value-domain; T1 , . . . , Tn are n pairwise disjoint unbounded value-
domains, corresponding to RDF data types, such as xsd:string, xsd:integer,
etc.
5
The results mentioned in this paper apply also to DL-LiteA,id extended with role attributes
(cf. [9]), which are not considered here for the sake of simplicity.
186 D. Calvanese et al.
B C Q R E F U V
(funct Q) (funct U ) (id C π1 , . . . , πn )
The assertions above, from left to right, respectively denote inclusions between con-
cepts, roles, value-domains, and attributes, (global) functionality on roles and on
attributes, and identification constraints6 . Notice that paths occurring in DL-LiteA,id
identification assertions may involve also attributes and value-domains, which are in-
stead not among the constructs present in ALCQIbid . More precisely, the symbol S in
equation (1) now can be also an attribute or the inverse of an attribute, and the symbol
D in (1) now can be also a basic or an arbitrary value-domain.
As for the ABox, beside assertions of the form A(a), P (a, b), with A an atomic con-
cept, P and atomic role, and a, b constants, in DL-LiteA,id we may also have assertions
of the form U (a, v), where U is an atomic attribute, a a constant, and v a value. Notice
however that assertions of the form a = b or a = b are not allowed.
We are now ready to define what a DL-LiteA,id KB is.
Definition 1. A DL-LiteA,id KB K is a pair T , A, where T is a DL-LiteA,id TBox, A
is a DL-LiteA,id ABox, and the following conditions are satisfied:
(1) for each atomic role P , if either (funct P ) or (funct P − ) occur in T , then T does
not contain assertions of the form Q P or Q P − , where Q is a basic role;
(2) for each atomic attribute U , if (funct U ) occurs in T , then T does not contain
assertions of the form U U , where U is an atomic attribute;
(3) all concepts identified in T are basic concepts, i.e., in each IdC (id C π1 , . . . , πn )
of T , the concept C is of the form A, ∃Q, or δ(U );
(4) all concepts or value-domains appearing in the test relations in T are of the form
A, ∃Q, δ(U ), ρ(U ), D , T1 , . . ., or Tn ;
(5) for each IdC α in T , every role or attribute that occurs (in either direct or inverse
direction) in a path of α is not specialized in T , i.e., it does not appear in the
right-hand side of assertions of the form Q Q or U U .
Intuitively, the conditions stated at points (1-2) (resp., (5)) say that, in DL-LiteA,id
TBoxes, roles and attributes occurring in functionality assertions (resp., in paths of
IdCs) cannot be specialized. All the above conditions are crucial for the tractability of
reasoning in our logic.
The semantics of a DL-LiteA,id TBox is standard, except that it adopts the unique
name assumption: for every interpretation I, and distinct constants a, b, we have that
aI = bI . Moreover, it takes into account the distinction between objects and values by
partitioning the interpretation domain in two sets, containing objects and values, respec-
tively. Note that the adoption of the unique name assumption in DL-LiteA,id makes it
meaningless to use ABox assertions of the form a = b and a = b, which instead occur in
ALCQIbid knowledge bases. Indeed, assertions of the first form cannot be satisfied by
DL-LiteA,id interpretations, thus immediately making the knowledge base inconsistent,
whereas assertions of the second form are always satisfied and are therefore implicit.
6
We remind the reader that the identification constraints referred to in this paper are local.
Conceptual Modeling for Data Integration 187
code
HOME
HOST PLAYED-IN
team match round
BELONGS-TO
playedOn year
scheduledMatch playedMatch
league OF
nation
homeGoals hostGoals
We finally recall a notable result given in [13], characterizing the complexity of query
answering of UCQs over DL-LiteA,id knowledge bases. We remind the reader that AC0
is the complexity class that corresponds to the complexity in the size of the data of
evaluating a first-order (i.e., SQL) query over a relational database (see, e.g., [1]).
Theorem 1 ([13]). Answering UCQs in DL-LiteA,id can be done in AC0 with respect
to the size of ABox.
The above result is proved by showing that it is possible to reduce the query answer-
ing problem to the evaluation of a FOL query, directly translatable to SQL, over the
database corresponding to the ABox assertions, thus exploiting standard commercial
relational database technology.
Let us consider again the example on football leagues introduced in Section 2, and
model it as a DL-LiteA,id TBox. By virtue of the characteristics of DL-LiteA,id we can
now explicitly consider also attributes of concepts. In particular, we assume that when a
scheduled match takes place, it is played in a specific date, and that for every match that
has been played, the number of goals scored by the home team and by the host team
are given. Note that different matches scheduled for the same round can be played in
different dates. Also, we want to distinguish football championships on the basis of the
nation and the year in which a championship takes place (e.g., the 2008 Spanish Liga).
Finally, we assume that both matches and rounds have codes. In Figure 6, we show a
schematic representation of (part of) the new ontology for the football leagues domain,
whereas in Figure 7 the TBox assertions in DL-LiteA,id capturing the above aspects are
shown. Note that, beside the new assertions involving attributes, Figure 7 lists all asser-
tions given in Figure 37 , which provide the ALCQIbid TBox modeling of the football
ontology, with the exception of the assertion match scheduledMatch playedMatch.
This is actually the price to pay to maintain reasoning tractable in DL-LiteA,id , and in
particular conjunctive query answering in AC0 . Indeed, the above assertion expresses
the covering of the concept match with the concepts scheduledMatch and playedMatch,
but as said in Section 3, the presence of covering assertions makes query answering
coNP-hard in the size of the ABox.
7
We have used ∃R instead of ∃R., and inclusions of the form 1 R. are expressed as
functional assertions of the form (funct R).
188 D. Calvanese et al.
INCLUSION ASSERTIONS
league ∃OF playedMatch match
∃OF league scheduledMatch match
∃OF− nation playedMatch ¬scheduledMatch
round ∃BELONGS-TO
∃BELONGS-TO round league δ(year)
∃BELONGS-TO− league match δ(code)
match ∃PLAYED-IN round δ(code)
∃PLAYED-IN match playedMatch δ(playedOn)
∃PLAYED-IN− round playedMatch δ(homeGoals)
match ∃HOME playedMatch δ(hostGoals)
∃HOME match ρ(playedOn) xsd:date
∃HOME− team ρ(homeGoals) xsd:nonNegativeInteger
match ∃HOST ρ(hostGoals) xsd:nonNegativeInteger
∃HOST match ρ(code) xsd:positiveInteger
∃HOST− team ρ(year) xsd:positiveInteger
FUNCTIONAL ASSERTIONS
(funct OF) (funct year)
(funct BELONGS-TO) (funct code)
(funct PLAYED-IN) (funct playedOn)
(funct HOME) (funct homeGoals)
(funct HOST) (funct hostGoals)
IDENTIFICATION CONSTRAINTS
1. (id league OF, year) 6. (id playedMatch playedOn, HOST)
2. (id round BELONGS-TO, code) 7. (id playedMatch playedOn, HOME)
3. (id match PLAYED-IN, code) 8. (id league year, BELONGS-TO− ◦ PLAYED-IN− ◦ HOME)
4. (id match HOME, PLAYED-IN) 9. (id league year, BELONGS-TO− ◦ PLAYED-IN− ◦ HOST)
5. (id match HOST, PLAYED-IN) 10. (id match HOME, HOST, PLAYED-IN ◦ BELONGS-TO ◦ year)
The mapping assertions keep data value constants separate from object identifiers,
and construct identifiers as (logic) terms over data values. More precisely, object iden-
tifiers in our approach are terms of the form f (d1 , . . . , dn ), called object terms, where
f is a function symbol of arity n > 0, and d1 , . . . , dn are data values stored at the
sources. Note that this idea traces back to the work done in deductive object-oriented
databases [24].
We detail below the above ideas. The mapping component is specified through a set
of mapping assertions, each of the form
Φ(v) ; G(w)
where
– Φ(v), called the body of the mapping, is a first-order logic (FOL) query of arity
n > 0, with distinguished variables v, over the source schema S (we will write
such query in the SQL syntax), and
– G(w), called the head, is an atom where G can be an atomic concept, an atomic
role, or an atomic attribute occurring in the global schema G, and w is a sequence
of terms.
We define three different types of mapping assertions:
– Concept mapping assertions, in which the head is a unary atom of the form
A(f (v)), where A is an atomic concept and f is a function symbol of arity n;
– Role mapping assertions, in which the head is a binary atom of the form
P (f1 (v ), f2 (v )), where P is an atomic role, f1 and f2 are function symbols of
arity n1 , n2 > 0, and v and v are sequences of variables appearing in v;
– Attribute mapping assertions, in which the head is a binary atom of the form
U (f (v ), v : Ti ), where U is an atomic attribute, f is a function symbol of ar-
ity n > 0, v is a sequence of variables appearing in v, v is a variable appearing
in v, and Ti is an RDF data type.
In words, such mapping assertions are used to map source relations (and the tuples
they store), to concepts, roles, and attributes of the ontology (and the objects and the
values that constitute their instances), respectively. Note that an attribute mapping also
specifies the type of values retrieved from the source database, in order to guarantee
coherency of the system.
We conclude this section with an example of mapping assertions, referring again
to the football domain. Suppose that the source schema contains the relational ta-
ble TABLE(mcode,league,round,home,host), where a tuple (m, l, r, h1 , h2 )
with l > 0 represents a match with code m of league l and round r, and with home
team h1 and host team h2 . If we want to map the tuples from the table TABLE to the
global schema shown in Figure 7, the mapping assertions might be as shown in Figure 8.
M1 is a concept mapping assertion that selects from TABLE the code and the round of
matches (only for the appropriate tuples), and then uses such data to build instances of
the concept match, using the function symbol m. M2 is an attribute mapping assertion
that is used to “populate” the attribute code for the objects that are instances of match.
Finally, M3 is a role mapping assertion relating TABLE to the atomic role PLAYED-IN,
Conceptual Modeling for Data Integration 191
M1 : SELECT T.mcode,
T.round,
T.league
FROM TABLE T
WHERE T.league > 0 ; match(m(T.mcode, T.round, T.league))
M2 : SELECT T.mcode,
T.round,
T.league
FROM TABLE T
WHERE T.league > 0 ; code(m(T.mcode, T.round, T.league),
T.mcode : xsd:string)
M3 : SELECT T.mcode,
T.round,
T.league
FROM TABLE T
WHERE T.league > 0 ; PLAYED-IN(m(T.mcode, T.round, T.league),
r(T.round, T.league))
where instances of round are denoted by means of the function symbol r. We notice that
in the mapping assertion M2 , the mapping designer had to specify a correct DL-LiteA,id
data type for the values extracted from the source.
We point out that, during query answering, the body of each mapping assertion is
never really evaluated in order to extract values from the sources to build instances of
the global schema, but rather it is used to unfold queries posed over the global schema,
rewriting them into queries posed over the source schema. We discuss this aspect next.
Rewriting. Given a UCQ Q over a data integration system J = G, S, M, and a
source database D for J , the rewriting step computes a new UCQ Q1 over J , where the
assertions of G are compiled in. In computing the rewriting, only inclusion assertions
of the form B1 B2 , Q1 Q2 , and U1 U2 are taken into account, where Bi ,
Qi , and Ui , with i ∈ {1, 2}, are a basic concept, a basic role, and an atomic attribute,
respectively. Intuitively, the query Q is rewritten according to the knowledge specified
in G that is relevant for answering Q, in such a way that the rewritten query Q1 is such
∅,S,M ,D
that Q1 = QJ ,D , i.e., the rewriting allows to get rid of G.
We refer to [30,12] for a formal description of the query rewriting algorithm and
for a proof of its soundness and completeness. We only notice here that the rewriting
procedure does not depend on the source database D, runs in polynomial time in the
size of G, and returns a query Q1 whose size is at most exponential in the size of Q.
Filtering. Let Q1 be the UCQ produced by the rewriting step above. In the filtering
step we take care of a particular problem that the disjuncts, i.e., conjunctive queries,
in Q1 might have. Specifically, a conjunctive query cq is called ill-typed if it has at
least one join variable x appearing in two incompatible positions in cq, i.e., such that
the TBox G of our data integration system logically implies that x is both of type Ti ,
and of type Tj , with Ti = Tj (remember that in DL-LiteA,id data types are pairwise
disjoint). The purpose of the filtering step is to remove from the UCQ Q1 all the ill-
typed conjunctive queries. Intuitively, such a step is needed because the query Q1 has
to be unfolded and then evaluated over the source database D (cf. the next two steps of
the query answering algorithm, described below). These last two steps, performed for
an ill-typed conjunctive query might produce incorrect results.
Unfolding. Given the UCQ Q2 over J computed by the filtering step, the unfolding
step computes, by using logic programming techniques, an SQL query Q3 over the
source schema S, that possibly returns object terms. It can be shown [30] that Q3 is
∅,S,M ,D
such that QD3 = Q2 , i.e., unfolding allows us to get rid of M. Moreover, the
unfolding procedure does not depend on D, runs in polynomial time in the size of M,
and returns a query whose size is polynomial in the size of Q2 .
Evaluation. The evaluation step consists in simply delegating the evaluation of the SQL
query Q3 , produced by the unfolding step, to the data federation tool managing the data
sources. Formally, such a tool returns the set QD3 , i.e., the set of tuples obtained from
the evaluation of Q3 over D.
In other words, the above theorem says that UCQs in our approach are FO-rewritable.
Finally, we remark that, as we said at the beginning of this section, we have assumed
that the data integration system J is consistent with respect to the database D, i.e.,
sem D (J ) is non-empty. Notably, it can be shown that all the machinery we have de-
vised for query answering can also be used for checking consistency of J with respect
to D. Therefore, checking consistency can also be reduced to sending appropriate SQL
queries to the source database [30,13].
We now analyze the possibility of extending the data integration setting presented above
without affecting the complexity of query answering. In particular, we investigate pos-
sible extensions for the language for expressing the global schema, the language for
expressing the mappings, and the language for expressing the source schema.
We start by dealing with extending the global schema language. There are two pos-
sible ways of extending DL-LiteA,id . The first one corresponds to a proper language
extension, i.e., adding new DL constructs to DL-LiteA,id , while the second one consists
of changing/strengthening the semantics of the formalism.
Concerning language extensions, the results in [11] show that it is not possible to
add any of the usual DL constructs to DL-LiteA,id while keeping the data complexity
of query answering within AC0 . This means that DL-LiteA,id is essentially the most
expressive DL allowing for data integration systems where query answering is FO-
rewritable.
Concerning the possibility of strengthening the semantics, we briefly analyze the
consequences of removing the unique name assumption (UNA), i.e., the assumption
that, in every interpretation of a data integration system, two distinct object terms and
two distinct value constants denote two different domain elements. Unfortunately, this
leads query answering out of L OG S PACE, and therefore, this leads to loosing FO-
rewritability of queries.
based on query rewriting in Datalog enriched with negation and disjunction, under sta-
ble model semantics [8,23].
A second interesting issue for further work is looking at “write-also” data integration
tools. Indeed, while the techniques presented in this paper provide support for answer-
ing queries posed to the data integration system, it could be of interest to also deal
with updates expressed on the global schema (e.g., according to the approach described
in [16,17]). The most challenging issue to be addressed in this context is to design
mechanisms for correctly reformulating an update expressed over the ontology into a
series of insert and delete operations on the data sources.
Acknowledgements. This research has been partially supported by the IP project On-
toRule (ONTOlogies meet Business RULEs ONtologiES), funded by the EC under ICT
Call 3 FP7-ICT-2008-3, contract number FP7-231875, by project DASIbench (Data and
Service Integration workbench), funded by IBM through a Faculty Award grant, and by
the MIUR FIRB 2005 project “Tecnologie Orientate alla Conoscenza per Aggregazioni
di Imprese in Internet” (TOCAI.IT).
References
1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison Wesley Publ. Co.,
Reading (1995)
2. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P.F. (eds.): The De-
scription Logic Handbook: Theory, Implementation and Applications. Cambridge University
Press, Cambridge (2003)
3. Berardi, D., Calvanese, D., De Giacomo, G.: Reasoning on UML class diagrams. Artificial
Intelligence 168(1–2), 70–118 (2005)
4. Bernstein, P.A., Giunchiglia, F., Kementsietsidis, A., Mylopoulos, J., Serafini, L., Zaihrayeu,
I.: Data management for peer-to-peer computing: A vision. In: Proc. of the 5th Int. Workshop
on the Web and Databases, WebDB 2002 (2002)
5. Bernstein, P.A., Haas, L.: Informaton integration in the enterprise. Communications of the
ACM 51(9), 72–79 (2008)
6. Brodie, M.L., Mylopoulos, J., Schmidt, J.W. (eds.): On Conceptual Modeling: Perspectives
from Artificial Intelligence, Databases, and Programming Languages. Springer, Heidelberg
(1984)
7. Calı̀, A., Calvanese, D., De Giacomo, G., Lenzerini, M.: On the expressive power of data in-
tegration systems. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS,
vol. 2503, pp. 338–350. Springer, Heidelberg (2002)
8. Calı̀, A., Lembo, D., Rosati, R.: Query rewriting and answering under constraints in data inte-
gration systems. In: Proc. of the 18th Int. Joint Conf. on Artificial Intelligence (IJCAI 2003),
pp. 16–21 (2003)
9. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rosati, R.: Linking
data to ontologies: The description logic DL-LiteA . In: Proc. of the 2nd Int. Workshop on
OWL: Experiences and Directions (OWLED 2006). CEUR Electronic Workshop Proceed-
ings, vol. 216 (2006), http://ceur-ws.org/
10. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rosati, R., Ruzzi,
M.: Data integration through DL-LiteA ontologies. In: Schewe, K.-D., Thalheim, B. (eds.)
SDKB 2008. LNCS, vol. 4925, pp. 26–47. Springer, Heidelberg (2008)
196 D. Calvanese et al.
11. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Data complexity of
query answering in description logics. In: Proc. of the 10th Int. Conf. on the Principles of
Knowledge Representation and Reasoning (KR 2006), pp. 260–270 (2006)
12. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning
and efficient query answering in description logics: The DL-Lite family. J. of Automated
Reasoning 39(3), 385–429 (2007)
13. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Path-based identifi-
cation constraints in description logics. In: Proc. of the 11th Int. Conf. on the Principles of
Knowledge Representation and Reasoning (KR 2008), pp. 231–241 (2008)
14. Calvanese, D., De Giacomo, G., Lenzerini, M., Nardi, D., Rosati, R.: Data integration in data
warehousing. Int. J. of Cooperative Information Systems 10(3), 237–271 (2001)
15. Carey, M.J., Haas, L.M., Schwarz, P.M., Arya, M., Cody, W.F., Fagin, R., Flickner, M., Lu-
niewski, A., Niblack, W., Petkovic, D., Thomas, J., Williams, J.H., Wimmers, E.L.: Towards
heterogeneous multimedia information systems: The Garlic approach. In: Proc. of the 5th
Int. Workshop on Research Issues in Data Engineering – Distributed Object Management
(RIDE-DOM 1995), pp. 124–131. IEEE Computer Society Press, Los Alamitos
16. De Giacomo, G., Lenzerini, M., Poggi, A., Rosati, R.: On the update of description logic
ontologies at the instance level. In: Proc. of the 21st Nat. Conf. on Artificial Intelligence
(AAAI 2006), pp. 1271–1276 (2006)
17. De Giacomo, G., Lenzerini, M., Poggi, A., Rosati, R.: On the approximation of instance
level update and erasure in description logics. In: Proc. of the 22nd Nat. Conf. on Artificial
Intelligence (AAAI 2007), pp. 403–408 (2007)
18. Duschka, O.M., Genesereth, M.R.: Answering recursive queries using views. In: Proc. of the
16th ACM SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS
1997), pp. 109–116 (1997)
19. Duschka, O.M., Genesereth, M.R., Levy, A.Y.: Recursive query plans for data integration. J.
of Logic Programming 43(1), 49–73 (2000)
20. Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J.D.,
Vassalos, V., Widom, J.: The TSIMMIS approach to mediation: Data models and languages.
J. of Intelligent Information Systems 8(2), 117–132 (1997)
21. Genereseth, M.R., Keller, A.M., Duschka, O.M.: Infomaster: An information integration sys-
tem. In: Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pp. 539–542 (1997)
22. Goh, C.H., Bressan, S., Madnick, S.E., Siegel, M.D.: Context interchange: New features
and formalisms for the intelligent integration of information. ACM Trans. on Information
Systems 17(3), 270–293 (1999)
23. Grieco, L., Lembo, D., Ruzzi, M., Rosati, R.: Consistent query answering under key and
exclusion dependencies: Algorithms and experiments. In: Proc. of the 14th Int. Conf. on
Information and Knowledge Management (CIKM 2005), pp. 792–799 (2005)
24. Hull, R.: A survey of theoretical research on typed complex database objects. In: Paredaens,
J. (ed.) Databases, pp. 193–256. Academic Press, London (1988)
25. Kirk, T., Levy, A.Y., Sagiv, Y., Srivastava, D.: The Information Manifold. In: Proceedings
of the AAAI 1995 Spring Symp. on Information Gathering from Heterogeneous, Distributed
Enviroments, pp. 85–91 (1995)
26. Lenzerini, M.: Data integration: A theoretical perspective. In: Proc. of the 21st ACM
SIGACT SIGMOD SIGART Symp. on Principles of Database Systems (PODS 2002), pp.
233–246 (2002)
27. Leone, N., Eiter, T., Faber, W., Fink, M., Gottlob, G., Greco, G., Kalka, E., Ianni, G., Lembo,
D., Lenzerini, M., Lio, V., Nowicki, B., Rosati, R., Ruzzi, M., Staniszkis, W., Terracina,
G.: The INFOMIX system for advanced integration of incomplete and inconsistent data. In:
Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pp. 915–917 (2005)
Conceptual Modeling for Data Integration 197
28. Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogenous information sources using
source descriptions. In: Proc. of the 22nd Int. Conf. on Very Large Data Bases, VLDB 1996
(1996)
29. Levy, A.Y., Srivastava, D., Kirk, T.: Data model and query evaluation in global information
systems. J. of Intelligent Information Systems 5, 121–143 (1995)
30. Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.: Linking
data to ontologies. In: Spaccapietra, S. (ed.) Journal on Data Semantics X. LNCS, vol. 4900,
pp. 133–173. Springer, Heidelberg (2008)
31. Tomasic, A., Raschid, L., Valduriez, P.: Scaling access to heterogeneous data sources with
DISCO. IEEE Trans. on Knowledge and Data Engineering 10(5), 808–823 (1998)
32. Ullman, J.D.: Information integration using logical views. In: Afrati, F.N., Kolaitis, P.G.
(eds.) ICDT 1997. LNCS, vol. 1186, pp. 19–40. Springer, Heidelberg (1996)