Database
Database
2
3
History
• Typical IT assets of a modern company include ERP
systems, sales tracking systems, HR systems, etc.
• Over the last twenty years these were implemented
based on a client-server computing model where the
DBMS runs at the server level, accessed by a
collection of applications which run on the client
desktop.
• Recently, Client-Server computing has become
obsolete: the client level moves inside a web
browser, and three-tier architectures are built where
the client may be thick or thin.
• In either case, the application is executed on the
middle tier (Application Server).
Evolution from Application Servers
Application
driven Enteprise Application Integration
(EAI)
Application
servers
Data driven
5
Problems in traditional DB architectures
Lot of different types of heterogeneities among several DBs
to be used together
1. Different platforms: Technological heterogeinity
2. Different data models at the participating DBMS Æ Model
heterogeneity
3. Different query languages -> Language heterogeneity
4. Different data schemas and different conceptual
representations in DBs previuosly developed at the
participating DBMS Æ Schema (or semantic)
heterogeneities
5. Errors in data, that result different values for the same
info Æ instance heteorgeneities
6. Dependencies exist among databases, databases and
applications, among applications
6
Virtual vs materialized integration
A distinction is made in how the information
needed is retrieved:
1. virtual or
2. materialized integration approach
determines how the information is accessed.
7
Materialized
Relational OO DB Excel
DB sheet
8
Virtual
On-line
Relational OO DB Excel
DB sheet
9
Relevant Types of
Integrated Database Systems
10
Materialized integration
• Large common machines came to be known as
warehouses, and the software to access, scrape,
transform, and load data into warehouses, became
known as extract, transform, and load (ETL) systems.
• In a dynamic environment, one must perform ETL
periodically (say once a day or once a week), thereby
building up a history of the enterprise.
• The main purpose of a data warehouse is to allow
systematic or ad-hoc data mining.
• Not appropriate when need to integrate the operational
systems (keeping data up-to-date)
• Will be dealt with in the second part of the course
11
Virtual integration
• The virtual integration approach leaves the information
requested in the local sources. The virtual approach will
always return a fresh answer to the query. The query
posted to the global schema is reformulated into the
formats of the local information system. The information
retrieved needs to be combined to answer the query.
12
Rationale
The conventional wisdom is to use data warehousing and ETL
products to perform data integration. However, there is a
serious flaw in one aspect of this wisdom.
Suppose one wants to integrate current (operational) data
rather than historical information. Consider, for example, an
e-commerce web site which wishes to sell hotel rooms over
the web. The actual inventory of available hotel rooms
exists in 100 or so information systems. After all, Hilton,
Hyatt and Marriott all run their own reservation systems.
Applying ETL and warehousing to this problem will create a
copy of hotel availability data, which is quickly out of date.
If a web site sells a hotel room, based on this data, it has
no way of guaranteeing delivery of the room, because
somebody else may have sold the room in the meantime.
13
A simple example (Batini)
14
Query execution in the
integrated database
When a DML statement, such as a query, is submitted,
the system has to decompose it into queries
against the two component databases.
1. determine first which tables are from which
database,
2. which predicates apply to only a single database
and
3. which predicates apply to tuples from both
databases.
• Sintax:
create view ViewName [ (AttList) ] as
SQLquery
16
Query execution: example
• The join needs to be evaluated at a global level.
• One of the main challenges in developing virtual data
integration systems is to find good strategies of how to
decompose and evaluate queries against multiple databases
in an efficient manner.
17
The Data Integration
problem
• Combining data coming from different
data sources, providing the user with a
unified vision of the data
• Detecting correspondencies between
similar concepts that come from
different sources, and conflict solving
18
The Data Integration problem
query answer
QUERY FRONT-END
DATA SOURCE 1
(RDBMS) DATA SOURCE 3
DATA SOURCE 2 (WWW)
(XML)
19
More systematically,
the data integration problems
concern…
• Autonomy:
– Design, or representation, autonomy: which data, and how
– Communication autonomy: which services should be provided to the users or
to the other DB systems
– Execution autonomy: which algorithms for query processing and in general for
data access
which causes
• Heterogeneity:
– Different platforms
– Different data models
– Different query languages
– Different data schemas, i.e., modeling styles (conflicts…)
– Different values for the same info (inconsistency)
20
The possible situations
• Also in a unique, centralized DB, there
is a problem of integration
• Distributed or federated DB
– Homogeneous data: same data model
– Heterogeneous data: different data
models
– Semi-structured data
• The extreme case: data integration for
transient, initially unknown data sources
21
An orthogonal classification
• Centralized architecture: the traditional
architecture for centralized, virtual or
materialized data integration
• Data exchange: pairwise exchange of data
between two data sources
• Peer-to-peer: decentralized, dynamic, data-
centric coordination between autonomous
organizations
22
DATA INTEGRATION
One vs
multiple
data
sources
DATA WAREHOUSE
DATA EXCHANGE
(materialized)
Global schema
SINGLE USER VIEW OVER A vs point-to-
Materialized CENTRALIZED DB point
vs virtual
DATA EXCHANGE
(virtual)
23
The possible solutions
¾ Data integration problems arise even in the
simplest situation: unique, centralized DB…
25
UNIQUE DB
26
An integrated DB
P1 P5
P2 P4
P3
V1
V5
V2 V3 V4
DB
27
UNIQUE DB :
view integration
• Each functionality in the company will
have its own personalized view
• This is achieved by using view
definitions
• Views allow personalization as well as
access control
28
Design steps for a unique DB
by view integration
(mixed strategy)
29
Query processing in a centralized DB
(ANSI/SPARC architecture)
USER USER USER USER USER USER
GLOBAL
LOGICAL SCHEMA
DBMS
INTERNAL SCHEMA
RECORD
FILE MANAGEMENT
OS PAGE
DISK MANAGEMENT
BLOCK
DATA 30
Point 4 above:
View Integration
a. Related concept identification
b. Conflict analysis and resolution
c. Conceptual Schema integration
32
4.b Conflict analysis
• Name conflicts
– HOMONYMS
Product price (production price)
– SYNONIMS
Department
Division
33
Conflict analysis
TYPE CONFLICTS
• in a single attribute (e.g. NUMERIC, ALPHANUMERIC, ...)
e.g. the attribute “gender”:
– Male/Female
– M/F
– 0/1
– In Italy, it is implicit in the “codice fiscale” (SSN)
• in an entity type
different abstractions of the same real world concept produce
different sets of properties (attributes)
34
Conflict analysis
DATA SEMANTICS
• different currencies (euros, US dollars,
etc.)
• different measure systems (kilos vs
pounds, centigrades vs. Farhenheit.)
• different granularities (grams, kilos, etc.)
35
Conflict analysis
EMPLOYEE EMPLOYEE
STRUCTURE
DEPARTMENT PROJECT
CONFLICTS
PROJECT
36
Conflict analysis
• DEPENDENCY (OR CARDINALITY)
CONFLICTS
EMPLOYEE EMPLOYEE
1:1 1:n
1:n 1:n
DEPARTMENT PROJECT
1:1
1:n
PROJECT
37
Conflict analysis
• KEY CONFLICTS
PRODUCT CODE
LINE
PRODUCT CODE
DESCRIPTION
38
4.c Schema Integration
• Conflict resolution
• Production of a new conceptual schema
which expresses (as much as possible) the
same semantics as the schemata we wanted
to integrate
• Production of the transformations between
the original schemata and the integrated one:
V1(DB), V2(DB),… V3(DB)
39
Exercise
We want to define the database of a ski school having different sites in Italy. Each
site has a name, a location, a phone number and an e-mail address. We want
to store information about customers, employees and teachers (SSN, name,
surname, birth date, address, and phone). For each teacher we store also the
technique (cross-country, downhill, snow board). The personnel office has a
view over the personnel data, that is, ski teachers and employees.
The school organizes courses in the different locations of the school. The course
organization office has a view over these courses. The courses have a code
(unique for each site), the starting date, the day of the week in which the
course takes place, the time, and the kind of course (cross-country, downhill,
snow board), the level, the number of lessons, the cost, the minimal age for
participant. For each course, a unique teacher is associated, and the
participants.
40
Query processing in a centralized DB
(ANSI/SPARC architecture)
USER USER USER USER USER USER
GLOBAL
LOGICAL SCHEMA
DBMS
INTERNAL SCHEMA
RECORD
FILE MANAGEMENT
OS PAGE
DISK MANAGEMENT
BLOCK
DATA 41
Query processing in a
centralized DB
• User issues query on the view Æ Q(Vi)
• Query composition Æ Q ° Vi (DB)
• Answer is in terms of the base
relations (global schema) or of the
viewed relations (external schema),
depending on how sophisticated the
system is
42
Distributed DB
The simplest case of non-centralized DB
• Often data for the same organization
• Integrated a-priori: same design pattern as in
the centralized situation, indeed we have
homogeneous technology, data model, and
schema integration problems as above
• For the instance: design decisions on:
• Fragmentation:
¾Vertical
¾Horizontal
• Allocation
• Replication
43
NON-CENTRALIZED DBs
More heterogeneities:
• Same data model, different systems e.g. relational
(Oracle, Sybase, DB2…) Æ technological
heterogeneity
• Different data models, e.g. hierarchical or network
(IMS, Codasyl…), relational, OO Æ model
heterogeneity
• Same data model, different query languages (SQL,
Quel) Æ language heterogeneity
• Semi- or unstructured data (HTML, XML,
multimedia…) Æ again model heterogeneity
44
Approaches
• Data conversion (materialization): data are
converted and stored into a unique system
– Multiple copies: redundancy, inconsistency
– Application rewriting if one system is discarded
• Data exchange: creation of gateways
between system pairs
– Only appropriate when only two systems, no
support for queries to data coming from multiple
systems (e.g. peer-to-peer)
– Number of gateways increases rapidly
• Multidatabase: creation of a global schema
45
Data integration in the
Multidatabase
1. Source schema identification (when present)
2. Source schema reverse engineering (data source
conceptual schemata)
3. Conceptual schemata integration and restructuring
4. Conceptual to logical translation (of the obtained
global schema)
5. Mapping between the global logical schema and
the single schemata (logical view definition)
6. After integration: query-answering through data
views
46
Source schema identification:
a first approximation
• Same data model
• Adoption of a global schema
• The global schema will provide a
• Reconciled
• Integrated
• Virtual
view of the data sources
47
Architecture with a
homogeneous data model
48
Global query processing:
an example of “unfolding”
A global, nested query. Given R1 in DB1 and
R2 in DB2,
Select R1.A
From R1
Where R1.B in
(Select R2.B From R2)
In a different notation:
Q(A):-R1(A,B,_,_..), R2(B,_,_...).
50
Questions
• How much data has to be sent
“upwards”, for global evaluation
• How much data processing has to be
done locally
¾ AN OPTIMIZATION PROBLEM
51
Point 2 above: Source
schema reverse engineering
Reverse engineering of the source schemata,
and their translation into a conceptual
model (e.g. ER)
• CASE tools may partially help us to reconstruct the
original conceptual schema
• However, conceptual schemata are more expressive
than logical ones, thus we should know the reality of
interest to really be able to recover the initial
knowledge, lost in the logical design phase
52
Point 3 above: Conceptual schemata
integration and restructuring
54
Mapping between data
sources and global (mediated)
schema
• A data integration system is a triple (G, S, M)
• The query to the integrated system are posed
in terms of G and specify which data of the
virtual database we are interested in
• The problem is understanding which real data
(in the data sources) correspond to those
virtual data
55
GAV (Global As View)
56
The other possible ways
LAV (Local As View)
• The global schema has been designed
independently of the data source schemata
• The relationship (mapping) between sources and
global schema is obtained by defining each data
source as a view over the global schema
GLAV (Global and Local As View)
• The relationship (mapping) between sources and
global schema is obtained by defining a set of
views, some over the global schema and some
over the data sources 57
Mapping between data
sources and global schema
• Global schema G
• Source schemata S
• Mapping M between sources and global
schema: a set of assertions
qS Æ q G
qG Æ q S
59
GAV example
SOURCE 1
SOURCE 2
60
SOURCE 1
SOURCE 2
GLOBAL SCHEMA
61
GAV
• Suppose now we introduce a new source
• The simple view we have just created is
to be modified
• In the simplest case we only need to
add a union with a new SELECT-FROM-
WHERE clause
• This is not true in general, view
definitions may be much more complex
62
GAV
• Quality depends on how well we have
compiled the sources into the global
schema through the mapping
• Whenever a source changes or a new
one is added, the global schema needs
to be reconsidered
• Query processing is based on unfolding
• Example: one already seen
63
Query processing in GAV:
Unfolding
Select R1.A Query over DB2:
From R1 Select R2.B into X
Where R1.B in From R2
64
Query processing in GAV
QUERY OVER THE GLOBAL SCHEMA
65
How do we write the GAV views?
66
OPERATORS:
EXAMPLES
SSN NAME AGE SALARY SSN NAME SALARY PHONE
R1 123456789 JOHN 34 30K 234567891 KETTY 20K 1234567 R2
234567891 KETTY 27 25K 345678912 WANG 22K 2345678
345678912 WANG 39 32K 456789123 MARY 34K 3456789
67
OPERATORS: EXAMPLES (2)
SSN NAME AGE SALARY SSN NAME SALARY PHONE
R1 123456789 JOHN 34 30K 234567891 KETTY 20K 1234567 R2
234567891 KETTY 27 25K 345678912 WANG 22K 2345678
345678912 WANG 39 32K 456789123 MARY 34K 3456789
234567891 KETTY ??
R1 Ge R2
345678912 WANG ??
123456789 JOHN 30K
More will be said about
456789123 MARY 34K
these “uncertain” values
68
LAV
A mapping LAV is a set of assertions, one for each
element s of each source S
s Æ qG
Thus the content of each source is characterized in
terms of a view qG over the global schema
69
LAV
• Quality depends on how well we have
characterized the sources
• High modularity and extensibility (if the
global schema is well designed, when a
source changes or is added, only its
definition is to be updated)
• Query processing needs reasoning
70
SOURCE 1
SOURCE 2
GLOBAL SCHEMA
GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock)
71
GLOBAL SCHEMA
SOURCE 2
GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock)
SOURCE 1
75
Observations
• Do GAV and LAV provide exact views?
• Usually we assume so, but it is not so
(e.g. in GAV if we define integrity
constraints on the global schema)
76 76
GAV approach
S1 (Name,Age)
S2 (Name,Age) S1 Name Age
G (Name, Age) Rossi 17
Verdi 21
Create view G as
S2 Name Age
Select G.Name as S1.Name, G.Age as S1.Age From S1
Verdi 21
Union Select G.Name as S2.Name, G.Age as S2.Age Bianchi 29
From S2
Tuples accessible from the
With the following global integrity constraint: data sources
GProf Name Age
G.Age > 18
Rossi 17
Verdi 21
This view is the union of the two data sources
Bianchi 29
but does not satisfy the integrity constraint
Tuples accessible from global
schema
GProf Name Age
The mapping is sound, not complete,
Verdi 21
thus not exact
Bianchi 29
77
GAV
with integrity constraints
Tuples accessible
from the global schema
Tuples accessible
from source1 Tuples accessible
from source2
¾ Data Models
81
Heterogeneous information
sources:
Tax_Position source (XML)
<!ELEMENT ListOfStudent (Student*)>
<!ELEMENT Student
(name,s_code,school_name,e_mail,tax_fee)>
<!ELEMENT name (#PCDATA)>
82
Heterogeneous information
sources:
Computer_Science source (OO)
CS_Person(first_name,last_name)
Professor:CS_Person(belongs_to:Division,rank)
Student:CS_Person(year,takes:set<Course>,rank,e
_mail)
Division(description,address:Location)
Location(city,street,number,country)
Course(course_name,taught_by:Professor)
83
General multidatabase model
(see C. Yu)
84
Steps
1. Reverse engineering
2. Conceptual schemata integration
3. Choice of the target data model and translation
of the global conceptual schema
4. Definition of the language translation
5. Definition of the data views (as usual)
according to the chosen paradigm (GAV,
LAV,…)
85
Step 4.
WRAPPERS (translators)
• Convert queries into queries/commands which
are understandable for the specific data source
– they can extend the query possibilities of a data
source
• Convert query results from the source’s format
to a format which is understandable for the
application
•We will say more when talking about semi-
structured information
86
The new application
context
• A (possibly large) number of data sources
• Time-variant data “form” (e.g. WEB)
• Heterogeneous datasources
• Mobile, transient datasources
• Mobile users
• Different levels of data structure
– Databases (relational, OO…)
– Semistructured datasources (XML, HTML, more
markups …)
– Unstructured data (text, multimedia etc…)
• Different terminologies and different operation
contexts
87
Heterogeneous, dynamic
systems
In a general setting, we would like to have a
uniform, as transparent as possible interface to
many, however autonomous and heterogeneous
data sources. This interface should:
• Take care of finding, for us, the data sources which are
relevant to the issue we are interested in
• Interact with the single sources
• Combine results obtained from the single sources
88
A more dynamic solution: no
global schema, but mediators
• An interface from the users/applications to the
database servers, which only defines communication
protocols and formats, does not deal with the
abstraction and representation problems existing in
today’s data and knowledge resources
• The interfaces must take on an active role
• We will refer to the dynamic interface function as
mediation. This term includes (Wiederhold):
– the processing needed to make the interfaces work
– the knowledge structures that drive the transformations
needed to transform data to information
– any intermediate storage that is needed
89
Mediators
90
Types of mediation functions
that have been developed
• Transformation and subsetting of databases using
view definitions and object templates
• Methods to access and merge data from multiple
databases
• Computations that support abstraction and
generalization over underlying data
• Intelligent directories to information bases such as
library catalogs, indexing aids and thesaurus
structures
• Methods to deal with uncertainty and missing data
because of incomplete or mismatched sources
91
AN EXAMPLE OF ARCHITECTURE WITH
MEDIATORS
(TSIMMIS)
APPLICATION 1 APPLICATION 2
MEDIATOR 1 MEDIATOR 2
92
Mediators
• No unique global schema is required
• Each mediator has its own functioning
way
• One mediator may or may not use a
global schema
• E.g.: in the Tsimmis project, dataguide
93
P2P data integration
• Several peers
• Each peer with local and external
sources
• Queries over one peer
• Answers integrating the peer’s own
data plus the others
94
Instance Heterogeneity
• At query processing
SSN NAME AGE SALARY
time, when a real
234567891 Ketty 48 18k
world object is
represented by
instances in different
SSN NAME AGE SALARY
databases, they may
234567891 Ketty 48 25k
have different
values
95
Resolution function
Data inconsistency may depend on
different reasons:
– One (or both) of the sources are incorrect
– Each source has a correct but partial view,
e.g. databases from different workplaces
Æ the full salary is the sum of the two
– In general, the correct value may be
obtained as a function of the original ones
(maybe: 0*value1 + 1*value2 !!)
96
RESOLUTION FUNCTION:
EXAMPLE
SSN NAME AGE SALARY POSITION SSN NAME AGE SALARY PHONE
123456789 JOHN 34 30K ENGINEER 234567891 KETTY 25 20K 1234567
234567891 KETTY 27 25K ENGINEER 345678912 WANG 38 22K 2345678
345678912 WANG 39 32K MANAGER 456789123 MARY 42 34K 3456789
99