0% found this document useful (0 votes)

7 views

Database

Uploaded by

Fahad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Database

Uploaded by

Fahad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Motivation

In modern Information Systems there is a growing

quest for achieving integration of SW applications,
services and information (databases, and others)
managed by the organization or set of
organizations

Due to the need to:

• reuse legacy applications and databases
• reconcile the different points of views adopted by
the different players using the information system
We concentrate on Data Integration

2
3
History
• Typical IT assets of a modern company include ERP
systems, sales tracking systems, HR systems, etc.
• Over the last twenty years these were implemented
based on a client-server computing model where the
DBMS runs at the server level, accessed by a
collection of applications which run on the client
desktop.
• Recently, Client-Server computing has become
obsolete: the client level moves inside a web
browser, and three-tier architectures are built where
the client may be thick or thin.
• In either case, the application is executed on the
middle tier (Application Server).
Evolution from Application Servers

Application
driven Enteprise Application Integration
(EAI)

Application
servers
Data driven

Extract Transform Load/ Enteprise Information

Data Warehouse Integration EII

5
Problems in traditional DB architectures
Lot of different types of heterogeneities among several DBs
to be used together
1. Different platforms: Technological heterogeinity
2. Different data models at the participating DBMS Æ Model
heterogeneity
3. Different query languages -> Language heterogeneity
4. Different data schemas and different conceptual
representations in DBs previuosly developed at the
participating DBMS Æ Schema (or semantic)
heterogeneities
5. Errors in data, that result different values for the same
info Æ instance heteorgeneities
6. Dependencies exist among databases, databases and
applications, among applications
6
Virtual vs materialized integration
A distinction is made in how the information
needed is retrieved:
1. virtual or
2. materialized integration approach
determines how the information is accessed.

7
Materialized

Relational OO DB Excel
DB sheet

8
Virtual

Global schema (view)

On-line

Relational OO DB Excel
DB sheet

9
Relevant Types of
Integrated Database Systems

1. Use a materialized data base (data are merged in a

new database) – Enterprise-Transform-Load Systems
Data Warehouses: Materialized integrated data
sources
2. Use a virtual non-materialized data base (data
remain at sources) Æ Enteprise Information
Integration (EII) (or Data Integration) Systems
Æ Several solutions

10
Materialized integration
• Large common machines came to be known as
warehouses, and the software to access, scrape,
transform, and load data into warehouses, became
known as extract, transform, and load (ETL) systems.
• In a dynamic environment, one must perform ETL
periodically (say once a day or once a week), thereby
building up a history of the enterprise.
• The main purpose of a data warehouse is to allow
systematic or ad-hoc data mining.
• Not appropriate when need to integrate the operational
systems (keeping data up-to-date)
• Will be dealt with in the second part of the course

11
Virtual integration
• The virtual integration approach leaves the information
requested in the local sources. The virtual approach will
always return a fresh answer to the query. The query
posted to the global schema is reformulated into the
formats of the local information system. The information
retrieved needs to be combined to answer the query.

12
Rationale
The conventional wisdom is to use data warehousing and ETL
products to perform data integration. However, there is a
serious flaw in one aspect of this wisdom.
Suppose one wants to integrate current (operational) data
rather than historical information. Consider, for example, an
e-commerce web site which wishes to sell hotel rooms over
the web. The actual inventory of available hotel rooms
exists in 100 or so information systems. After all, Hilton,
Hyatt and Marriott all run their own reservation systems.
Applying ETL and warehousing to this problem will create a
copy of hotel availability data, which is quickly out of date.
If a web site sells a hotel room, based on this data, it has
no way of guaranteeing delivery of the room, because
somebody else may have sold the room in the meantime.
13
A simple example (Batini)

14
Query execution in the
integrated database
When a DML statement, such as a query, is submitted,
the system has to decompose it into queries
against the two component databases.
1. determine first which tables are from which
database,
2. which predicates apply to only a single database
and
3. which predicates apply to tuples from both
databases.

The latter ones can only be evaluated over the global

database (view), whereas the former ones can be
evaluated within the component databases.
15
Views
• Also called external schemata

• Sintax:
create view ViewName [ (AttList) ] as
SQLquery

[ with [ local | cascaded ] check option ]

16
Query execution: example
• The join needs to be evaluated at a global level.
• One of the main challenges in developing virtual data
integration systems is to find good strategies of how to
decompose and evaluate queries against multiple databases
in an efficient manner.

17
The Data Integration
problem
• Combining data coming from different
data sources, providing the user with a
unified vision of the data
• Detecting correspondencies between
similar concepts that come from
different sources, and conflict solving

18
The Data Integration problem

query answer

QUERY FRONT-END

DATA SOURCE 1
(RDBMS) DATA SOURCE 3
DATA SOURCE 2 (WWW)
(XML)
19
More systematically,
the data integration problems
concern…
• Autonomy:
– Design, or representation, autonomy: which data, and how
– Communication autonomy: which services should be provided to the users or
to the other DB systems
– Execution autonomy: which algorithms for query processing and in general for
data access
which causes
• Heterogeneity:
– Different platforms
– Different data models
– Different query languages
– Different data schemas, i.e., modeling styles (conflicts…)
– Different values for the same info (inconsistency)

20
The possible situations
• Also in a unique, centralized DB, there
is a problem of integration
• Distributed or federated DB
– Homogeneous data: same data model
– Heterogeneous data: different data
models
– Semi-structured data
• The extreme case: data integration for
transient, initially unknown data sources
21
An orthogonal classification
• Centralized architecture: the traditional
architecture for centralized, virtual or
materialized data integration
• Data exchange: pairwise exchange of data
between two data sources
• Peer-to-peer: decentralized, dynamic, data-
centric coordination between autonomous
organizations

22
DATA INTEGRATION
One vs
multiple
data
sources

DATA WAREHOUSE

P2P WITH MATERIALIZATION

GLOBAL SCHEMA
DATA INTEGRATION

CLASSICAL P2P DATA INTEGRATION

CENTRALIZED DB

DATA EXCHANGE
(materialized)
Global schema
SINGLE USER VIEW OVER A vs point-to-
Materialized CENTRALIZED DB point
vs virtual

DATA EXCHANGE
(virtual)
23
The possible solutions
¾ Data integration problems arise even in the
simplest situation: unique, centralized DB…

…and it becomes more and more complex

…up to the extreme case of transient, dynamic,
initially unknown data sources…

¾ We propose techniques and methods which

should be applied, adding complication as the
situation becomes more complex
24
The typical problem of data
design
• Given a data model
• Organizing data according to the
chosen data model in order to:
– Avoid inconsistencies
– Allow query optimization
• In a unique DB:
– only schema heterogeity

25
UNIQUE DB

Each datum, whatever application uses it, only

appears once. This ELIMINATES useless
REDOUNDANCIES, which would cause:
• Inconsistencies
• Useless memory occupation

26
An integrated DB

P1 P5

P2 P4
P3
V1
V5
V2 V3 V4

DB
27
UNIQUE DB :
view integration
• Each functionality in the company will
have its own personalized view
• This is achieved by using view
definitions
• Views allow personalization as well as
access control

28
Design steps for a unique DB
by view integration
(mixed strategy)

1. Subsystem (functionality) identification

2. Design of the skeleton schema
3. Subschema (view) conceptual design
4. View integration and restructuring
5. Conceptual to logical translation (of the global
schema, of the single subschemata)
6. Reconciliation of the global logical schema with
the single schemata (logical view definition)

29
Query processing in a centralized DB
(ANSI/SPARC architecture)
USER USER USER USER USER USER

VIEW … VIEW VIEW

(EXT. SCHEMA)
…
(EXT. SCHEMA) (EXT. SCHEMA)

GLOBAL
LOGICAL SCHEMA

DBMS
INTERNAL SCHEMA

RECORD

FILE MANAGEMENT
OS PAGE

DISK MANAGEMENT
BLOCK

DATA 30
Point 4 above:
View Integration
a. Related concept identification
b. Conflict analysis and resolution
c. Conceptual Schema integration

Æ Recall: in a unique DB we can only have schema

heterogeity
Æ This same operation is carried out for each of
the integration problems, from the simplest (this
one) to the most complex
31
4.a Related concept
identification
• Ex:
– employee, clerk
– exam, course
– code, num
• Not too difficult if manual
• Very difficult if automatic – this is the
extreme case

32
4.b Conflict analysis
• Name conflicts
– HOMONYMS
Product price (production price)

Product price (sale price)

– SYNONIMS
Department

Division

33
Conflict analysis
TYPE CONFLICTS
• in a single attribute (e.g. NUMERIC, ALPHANUMERIC, ...)
e.g. the attribute “gender”:
– Male/Female
– M/F
– 0/1
– In Italy, it is implicit in the “codice fiscale” (SSN)

• in an entity type
different abstractions of the same real world concept produce
different sets of properties (attributes)

34
Conflict analysis
DATA SEMANTICS
• different currencies (euros, US dollars,
etc.)
• different measure systems (kilos vs
pounds, centigrades vs. Farhenheit.)
• different granularities (grams, kilos, etc.)

35
Conflict analysis
EMPLOYEE EMPLOYEE

STRUCTURE
DEPARTMENT PROJECT
CONFLICTS
PROJECT

BOOK BOOK PUBLISHER

GENDER
Person Person

MAN WOMAN PUBLISHER

36
Conflict analysis
• DEPENDENCY (OR CARDINALITY)
CONFLICTS

EMPLOYEE EMPLOYEE
1:1 1:n

1:n 1:n

DEPARTMENT PROJECT
1:1

1:n
PROJECT

37
Conflict analysis
• KEY CONFLICTS

PRODUCT CODE

LINE
PRODUCT CODE
DESCRIPTION

38
4.c Schema Integration
• Conflict resolution
• Production of a new conceptual schema
which expresses (as much as possible) the
same semantics as the schemata we wanted
to integrate
• Production of the transformations between
the original schemata and the integrated one:
V1(DB), V2(DB),… V3(DB)

39
Exercise
We want to define the database of a ski school having different sites in Italy. Each
site has a name, a location, a phone number and an e-mail address. We want
to store information about customers, employees and teachers (SSN, name,
surname, birth date, address, and phone). For each teacher we store also the
technique (cross-country, downhill, snow board). The personnel office has a
view over the personnel data, that is, ski teachers and employees.
The school organizes courses in the different locations of the school. The course
organization office has a view over these courses. The courses have a code
(unique for each site), the starting date, the day of the week in which the
course takes place, the time, and the kind of course (cross-country, downhill,
snow board), the level, the number of lessons, the cost, the minimal age for
participant. For each course, a unique teacher is associated, and the
participants.

40
Query processing in a centralized DB
(ANSI/SPARC architecture)
USER USER USER USER USER USER

VIEW … VIEW VIEW

(EXT. SCHEMA)
…
(EXT. SCHEMA) (EXT. SCHEMA)

GLOBAL
LOGICAL SCHEMA

DBMS
INTERNAL SCHEMA

RECORD

FILE MANAGEMENT
OS PAGE

DISK MANAGEMENT
BLOCK

DATA 41
Query processing in a
centralized DB
• User issues query on the view Æ Q(Vi)
• Query composition Æ Q ° Vi (DB)
• Answer is in terms of the base
relations (global schema) or of the
viewed relations (external schema),
depending on how sophisticated the
system is

42
Distributed DB
The simplest case of non-centralized DB
• Often data for the same organization
• Integrated a-priori: same design pattern as in
the centralized situation, indeed we have
homogeneous technology, data model, and
schema integration problems as above
• For the instance: design decisions on:
• Fragmentation:
¾Vertical
¾Horizontal
• Allocation
• Replication
43
NON-CENTRALIZED DBs
More heterogeneities:
• Same data model, different systems e.g. relational
(Oracle, Sybase, DB2…) Æ technological
heterogeneity
• Different data models, e.g. hierarchical or network
(IMS, Codasyl…), relational, OO Æ model
heterogeneity
• Same data model, different query languages (SQL,
Quel) Æ language heterogeneity
• Semi- or unstructured data (HTML, XML,
multimedia…) Æ again model heterogeneity

44
Approaches
• Data conversion (materialization): data are
converted and stored into a unique system
– Multiple copies: redundancy, inconsistency
– Application rewriting if one system is discarded
• Data exchange: creation of gateways
between system pairs
– Only appropriate when only two systems, no
support for queries to data coming from multiple
systems (e.g. peer-to-peer)
– Number of gateways increases rapidly
• Multidatabase: creation of a global schema

45
Data integration in the
Multidatabase
1. Source schema identification (when present)
2. Source schema reverse engineering (data source
conceptual schemata)
3. Conceptual schemata integration and restructuring
4. Conceptual to logical translation (of the obtained
global schema)
5. Mapping between the global logical schema and
the single schemata (logical view definition)
6. After integration: query-answering through data
views
46
Source schema identification:
a first approximation
• Same data model
• Adoption of a global schema
• The global schema will provide a
• Reconciled
• Integrated
• Virtual
view of the data sources

47
Architecture with a
homogeneous data model

48
Global query processing:
an example of “unfolding”
A global, nested query. Given R1 in DB1 and
R2 in DB2,
Select R1.A
From R1
Where R1.B in
(Select R2.B From R2)
In a different notation:
Q(A):-R1(A,B,_,_..), R2(B,_,_...).

Can be decomposed into 2 subqueries

49
Global query processing:
an example (2)
Query over DB2:
Select R2.B into X
From R2
Query over DB1:
Select R1.A
From R1
Where R1.B in X
This is an “easy” case, more complex
query transformations may be in order,
and however…

50
Questions
• How much data has to be sent
“upwards”, for global evaluation
• How much data processing has to be
done locally

¾ AN OPTIMIZATION PROBLEM

51
Point 2 above: Source
schema reverse engineering
Reverse engineering of the source schemata,
and their translation into a conceptual
model (e.g. ER)
• CASE tools may partially help us to reconstruct the
original conceptual schema
• However, conceptual schemata are more expressive
than logical ones, thus we should know the reality of
interest to really be able to recover the initial
knowledge, lost in the logical design phase

52
Point 3 above: Conceptual schemata
integration and restructuring

As in the centralized DB case

a. Related concept identification
b. Conflict analysis and resolution
c. Schema integration

Point 4 above: Conceptual-to-logical

translation Æ well-known
53
Point 5 above: Mapping between
the global logical schema and the
single source schemata
(logical view definition)
• Two basic approaches
– GAV (Global As View)
– LAV (Local As View)
• Can be used also in case of different data models
• In that case a model transformation is required
(we’ll see it later)

54
Mapping between data
sources and global (mediated)
schema
• A data integration system is a triple (G, S, M)
• The query to the integrated system are posed
in terms of G and specify which data of the
virtual database we are interested in
• The problem is understanding which real data
(in the data sources) correspond to those
virtual data

55
GAV (Global As View)

• Up to now we supposed that the global

schema be derived from the integration
process of the data source schemata
• Thus the global schema is expressed in
terms of the data source schemata
• Such approach is called the Global As View
approach

56
The other possible ways
LAV (Local As View)
• The global schema has been designed
independently of the data source schemata
• The relationship (mapping) between sources and
global schema is obtained by defining each data
source as a view over the global schema
GLAV (Global and Local As View)
• The relationship (mapping) between sources and
global schema is obtained by defining a set of
views, some over the global schema and some
over the data sources 57
Mapping between data
sources and global schema
• Global schema G
• Source schemata S
• Mapping M between sources and global
schema: a set of assertions
qS Æ q G
qG Æ q S

Intuitively, the first assertion specifies that the concept

represented by a view (query) qS over a source
schema S corresponds to the concept specified by qG
over the global schema. Vice-versa for the second
assertion. 58
GAV
• A GAV mapping is a set of assertions, one for
each element g of G
g Æ qS
That is, the mapping specifies g as a query qS
over the data sources. This means that the
mapping tells us exactly how the element g is
computed.

¾ OK for stable data sources

¾ Difficult to extend with a new data source

59
GAV example
SOURCE 1

Product(Code, Name, Description, Warnings, Notes, CatID)

Category(ID, Name, Description)
Version(ProductCode, VersionCode, Size, Color, Name,
Description, Stock, Price)

SOURCE 2

Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)

Tipe(TypeCode, Name, Description)

n.b.: here we do not care about data types…

60
SOURCE 1

Product(Code, Name, Description, Warnings, Notes, CatID)

Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price)

SOURCE 2

Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)

GLOBAL SCHEMA

CREATE VIEW GLOB-PROD AS

SELECT Code AS PCode, VersionCode as VCode, Version.NAme AS Name, Size, Color,
Version.Description as Description, CatID, Version.Price, Stock
FROM SOURCE1.Product, SOURCE1.Version
WHERE Code = ProductCode
UNION
SELECT Code AS PCode, null as VCode, Name, Size, Color, Description,Type as CatID,
Price, Q.ty AS Stock
FROM SOURCE2.Product

61
GAV
• Suppose now we introduce a new source
• The simple view we have just created is
to be modified
• In the simplest case we only need to
add a union with a new SELECT-FROM-
WHERE clause
• This is not true in general, view
definitions may be much more complex
62
GAV
• Quality depends on how well we have
compiled the sources into the global
schema through the mapping
• Whenever a source changes or a new
one is added, the global schema needs
to be reconsidered
• Query processing is based on unfolding
• Example: one already seen
63
Query processing in GAV:
Unfolding
Select R1.A Query over DB2:
From R1 Select R2.B into X
Where R1.B in From R2

(Select R2.B From Query over DB1:

R2) Select R1.A
From R1
Where R1.B in X

64
Query processing in GAV
QUERY OVER THE GLOBAL SCHEMA

SELECT PCode, VCode, Price, Stock

FROM GLOB-PROD
WHERE Size = “V” AND Color = “Red”

The transformation is easy, since the combination operator

is a UNION Æ push selections through union!!
SELECT Code, VersionCode, Version.Price, Stock
FROM SOURCE1.Product, SOURCE1.Version
WHERE Code = ProductCode AND Size = “V” AND Color = “Red”
UNION
SELECT Code, null, Price, Q.ty
FROM SOURCE2.Product
WHERE Size = “V” AND Color = “Red”

65
How do we write the GAV views?

The most useful (and used) integration

operators within the relational model
• Union
• Outerunion
• Outerjoin
• Generalization

66
OPERATORS:
EXAMPLES
SSN NAME AGE SALARY SSN NAME SALARY PHONE
R1 123456789 JOHN 34 30K 234567891 KETTY 20K 1234567 R2
234567891 KETTY 27 25K 345678912 WANG 22K 2345678
345678912 WANG 39 32K 456789123 MARY 34K 3456789

SSN NAME AGE SALARY PHONE

R1 OU R2
123456789 JOHN 34 30K NULL
234567891 KETTY 27 25K NULL
345678912 WANG 39 32K NULL
234567891 KETTY NULL 20K 1234567
345678912 WANG NULL 22K 2345678
456789123 MARY NULL 34K 3456789

SSN NAME AGE SALARY1 SALARY2 PHONE

R1 JOIN R2 234567891 KETTY 27 25K 20K 1234567

SSN,NAME 345678912 WANG 39 39K 22K 234567
8

67
OPERATORS: EXAMPLES (2)
SSN NAME AGE SALARY SSN NAME SALARY PHONE
R1 123456789 JOHN 34 30K 234567891 KETTY 20K 1234567 R2
234567891 KETTY 27 25K 345678912 WANG 22K 2345678
345678912 WANG 39 32K 456789123 MARY 34K 3456789

SSN NAME AGE SALARY PHONE

R1 OJ R2
123456789 JOHN 34 30K NULL
234567891 KETTY 27 ?? 1234567
345678912 WANG 39 ?? 2345678
SSN NAME SALARY 456789123 MARY NULL 34K 3456789

234567891 KETTY ??
R1 Ge R2
345678912 WANG ??
123456789 JOHN 30K
More will be said about
456789123 MARY 34K
these “uncertain” values

68
LAV
A mapping LAV is a set of assertions, one for each
element s of each source S
s Æ qG
Thus the content of each source is characterized in
terms of a view qG over the global schema

• OK if the global schema is stable, e.g. based on a domain

ontology or an enterprise model
• It favours extensibility
• Query processing much more complex

69
LAV
• Quality depends on how well we have
characterized the sources
• High modularity and extensibility (if the
global schema is well designed, when a
source changes or is added, only its
definition is to be updated)
• Query processing needs reasoning

70
SOURCE 1

Product(Code, Name, Description, Warnings, Notes, CatID)

Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price)

SOURCE 2

Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)

GLOBAL SCHEMA

GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock)

In this case we have to express the sources as

views over the global schema

71
GLOBAL SCHEMA

GLOB-PROD (PCode, VCode, Name, Size, Color,

Description, CatID, Price, Stock)

SOURCE 2

Product (Code, Name, Size, Color, Description, Type,

Price, Q.ty)

CREATE VIEW SOURCE2.Product AS

SELECT PCode AS Code, Name, Size, Color,
Description, CatID as Type, Price, Stock AS Q.ty
FROM GLOB-PROD
72
GLOBAL SCHEMA

GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock)

SOURCE 1

Product(Code, Name, Description, Warnings, Notes, CatID)

Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock, Price)

CREATE VIEW SOURCE1.Product AS

SELECT Pcode AS Code, ?Name?, ?Description?, ? Warnings ?, ?Notes?, CatID
FROM GLOB-PROD

CREATE VIEW SOURCE1. Version AS

SELECT Pcode AS ProductCode, VCode as VersionCode, Size, Color, Name,
Description , Stock, Price)
FROM GLOB-PROD

N.B.: Some information is lacking: there is no correspondent value

(e.g.Warnings, Notes, etc..). Æ A difficult job
73
Query processing in LAV
• Queries are expressed in terms of the global schema

SELECT PCode, VCode, Price, Stock

FROM GLOB-PROD
WHERE Size = “V” AND Color = “Red”

• Views specify transformations from global to sources

CREATE VIEW SOURCE1. Version AS

SELECT Pcode AS ProductCode, VCode as VersionCode, Size,
Color, Name, ? Description ? , Stock, Price
FROM GLOB-PROD
Problem: the mapping, i.e. the view definition, specifies the opposite
transformation w.r.t. the one we need! 74
Sound, exact, and complete views - recall
A view, defined over some data source, is sound when it provides a
subset of the available data in the data source that corresponds
to the definition.

A view is complete if it provides a superset of the available data in

the data source that corresponds to the definition.

A view is exact if it provides all and only data corresponding to the

definition.

75
Observations
• Do GAV and LAV provide exact views?
• Usually we assume so, but it is not so
(e.g. in GAV if we define integrity
constraints on the global schema)

76 76
GAV approach
S1 (Name,Age)
S2 (Name,Age) S1 Name Age
G (Name, Age) Rossi 17
Verdi 21

Create view G as
S2 Name Age
Select G.Name as S1.Name, G.Age as S1.Age From S1
Verdi 21
Union Select G.Name as S2.Name, G.Age as S2.Age Bianchi 29
From S2
Tuples accessible from the
With the following global integrity constraint: data sources
GProf Name Age
G.Age > 18
Rossi 17
Verdi 21
This view is the union of the two data sources
Bianchi 29
but does not satisfy the integrity constraint
Tuples accessible from global
schema
GProf Name Age
The mapping is sound, not complete,
Verdi 21
thus not exact
Bianchi 29
77
GAV
with integrity constraints

Tuples accessible from the sources

Tuples accessible
from the global schema

The mapping can be exact or only sound 78

LAV approach
The global schema can be defined independently of
the sources’ schemata

Tuples accessible from the global schema

Tuples accessible
from source1 Tuples accessible
from source2

79 The mapping can be exact or only complete

Source heterogeneity

¾ Schemata (already seen)

¾ Data Models

¾ Technology (not our concern

here)
80
Heterogeneous information
sources:
UNIVERSITY DB (relational)
Department(dept_code,dept_name,budget)
Research_Staff(name,e_mail,dept_code,s_code)
FK dept_code REF Department, s_code REF Session
School_Member(name,school,year,e_mail)
Session(s_code,session_name,length,room_code)
FK room_code REF Room
Room(room_code,seats_number,notes)

81
Heterogeneous information
sources:
Tax_Position source (XML)
<!ELEMENT ListOfStudent (Student*)>
<!ELEMENT Student
(name,s_code,school_name,e_mail,tax_fee)>
<!ELEMENT name (#PCDATA)>

82
Heterogeneous information
sources:
Computer_Science source (OO)
CS_Person(first_name,last_name)
Professor:CS_Person(belongs_to:Division,rank)
Student:CS_Person(year,takes:set<Course>,rank,e
_mail)
Division(description,address:Location)
Location(city,street,number,country)
Course(course_name,taught_by:Professor)

83
General multidatabase model
(see C. Yu)

84
Steps
1. Reverse engineering
2. Conceptual schemata integration
3. Choice of the target data model and translation
of the global conceptual schema
4. Definition of the language translation
5. Definition of the data views (as usual)
according to the chosen paradigm (GAV,
LAV,…)

85
Step 4.
WRAPPERS (translators)
• Convert queries into queries/commands which
are understandable for the specific data source
– they can extend the query possibilities of a data
source
• Convert query results from the source’s format
to a format which is understandable for the
application
•We will say more when talking about semi-
structured information

86
The new application
context
• A (possibly large) number of data sources
• Time-variant data “form” (e.g. WEB)
• Heterogeneous datasources
• Mobile, transient datasources
• Mobile users
• Different levels of data structure
– Databases (relational, OO…)
– Semistructured datasources (XML, HTML, more
markups …)
– Unstructured data (text, multimedia etc…)
• Different terminologies and different operation
contexts
87
Heterogeneous, dynamic
systems
In a general setting, we would like to have a
uniform, as transparent as possible interface to
many, however autonomous and heterogeneous
data sources. This interface should:
• Take care of finding, for us, the data sources which are
relevant to the issue we are interested in
• Interact with the single sources
• Combine results obtained from the single sources

88
A more dynamic solution: no
global schema, but mediators
• An interface from the users/applications to the
database servers, which only defines communication
protocols and formats, does not deal with the
abstraction and representation problems existing in
today’s data and knowledge resources
• The interfaces must take on an active role
• We will refer to the dynamic interface function as
mediation. This term includes (Wiederhold):
– the processing needed to make the interfaces work
– the knowledge structures that drive the transformations
needed to transform data to information
– any intermediate storage that is needed

89
Mediators

A mediator’s main functionality is object fusion:

group together information about the same real

world entity
remove redundancy among the various data
sources
resolve inconsistencies among the various data
sources

90
Types of mediation functions
that have been developed
• Transformation and subsetting of databases using
view definitions and object templates
• Methods to access and merge data from multiple
databases
• Computations that support abstraction and
generalization over underlying data
• Intelligent directories to information bases such as
library catalogs, indexing aids and thesaurus
structures
• Methods to deal with uncertainty and missing data
because of incomplete or mismatched sources
91
AN EXAMPLE OF ARCHITECTURE WITH
MEDIATORS
(TSIMMIS)
APPLICATION 1 APPLICATION 2

MEDIATOR 1 MEDIATOR 2

WRAPPER 1 WRAPPER 2 WRAPPER 3

DATA SOURCE 1 DATA SOURCE 2 DATA SOURCE 3

(RDBMS) (XML) (WWW)

92
Mediators
• No unique global schema is required
• Each mediator has its own functioning
way
• One mediator may or may not use a
global schema
• E.g.: in the Tsimmis project, dataguide

93
P2P data integration
• Several peers
• Each peer with local and external
sources
• Queries over one peer
• Answers integrating the peer’s own
data plus the others

94
Instance Heterogeneity
• At query processing
SSN NAME AGE SALARY
time, when a real
234567891 Ketty 48 18k
world object is
represented by
instances in different
SSN NAME AGE SALARY
databases, they may
234567891 Ketty 48 25k
have different
values

95
Resolution function
Data inconsistency may depend on
different reasons:
– One (or both) of the sources are incorrect
– Each source has a correct but partial view,
e.g. databases from different workplaces
Æ the full salary is the sum of the two
– In general, the correct value may be
obtained as a function of the original ones
(maybe: 0*value1 + 1*value2 !!)
96
RESOLUTION FUNCTION:
EXAMPLE
SSN NAME AGE SALARY POSITION SSN NAME AGE SALARY PHONE
123456789 JOHN 34 30K ENGINEER 234567891 KETTY 25 20K 1234567
234567891 KETTY 27 25K ENGINEER 345678912 WANG 38 22K 2345678
345678912 WANG 39 32K MANAGER 456789123 MARY 42 34K 3456789

SSN NAME AGE SALARY POSITION PHONE

123456789 JOHN 34 30K ENGINEER NULL
234567891 KETTY 27 45K ENGINEER 1234567

345678912 WANG 39 54K MANAGER 2345678

456789123 MARY 42 34K NULL 3456789

R=MAX_AGE, SUM_SALARY (R1 OJ R2)

97
Conclusion: integration presents
problems along several dimensions
• Source heterogeneity
• Schemata
• Models
• Technology
• Connection failures
• Access limitations
• Security
• Privacy
• Final integration model (relationships among sources)
• Global Schema
• Mediators
• Peer-to-peer
• Data exchange
• Expression of queries and updates
• Integration type: virtual or materialized
• Estraction, cleaning
• Reconciliation
• possible inconsistencies among data sources
98
Conclusions
• The integration problem does not lend itself
to a unique possible solution
• Approaches we listed are not necessarily
mutually exclusive
• The more complex the problem, the less
precise we can expect the solution to be
• We have not introduced semistructured data:
adding yet more complexity

UNIT - 1 - Datawarehouse & Data Mining
100% (1)
UNIT - 1 - Datawarehouse & Data Mining
24 pages
Evaluation of Pedestrian Winds in Urban Area
100% (1)
Evaluation of Pedestrian Winds in Urban Area
15 pages
Management Companies - Last Updated On 31 August 2010
No ratings yet
Management Companies - Last Updated On 31 August 2010
13 pages
Data Integration
No ratings yet
Data Integration
44 pages
1 IntegrationApproaches
No ratings yet
1 IntegrationApproaches
19 pages
w5_L52_data integration_my
No ratings yet
w5_L52_data integration_my
46 pages
Chapter_1
No ratings yet
Chapter_1
39 pages
Parent 1998 Issues and Approaches of Database Integration
No ratings yet
Parent 1998 Issues and Approaches of Database Integration
12 pages
Data Integration
100% (1)
Data Integration
38 pages
8 Data Warehousing
No ratings yet
8 Data Warehousing
113 pages
Data Integration
No ratings yet
Data Integration
46 pages
Data Integration
No ratings yet
Data Integration
6 pages
Unit-1 DWDM Savita
No ratings yet
Unit-1 DWDM Savita
35 pages
Data Integration: Click To Edit Master Subtitle Style
No ratings yet
Data Integration: Click To Edit Master Subtitle Style
60 pages
16 08 2024 Data Virtualization Session2
No ratings yet
16 08 2024 Data Virtualization Session2
45 pages
Lecture 09
No ratings yet
Lecture 09
56 pages
Beauty and The Beast: The Theory and Practice of Information Integration
No ratings yet
Beauty and The Beast: The Theory and Practice of Information Integration
17 pages
Finalpresentation 111220200340 Phpapp01
No ratings yet
Finalpresentation 111220200340 Phpapp01
18 pages
Unit 3
No ratings yet
Unit 3
27 pages
Unit 6 Data Warehousing
No ratings yet
Unit 6 Data Warehousing
40 pages
First Data WarehouseAima First Final Updated 9 Sep 2016
No ratings yet
First Data WarehouseAima First Final Updated 9 Sep 2016
188 pages
Database Integration
No ratings yet
Database Integration
17 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
DSS ch2
No ratings yet
DSS ch2
112 pages
Ch6 - Data Warehouse in The The Age of Big Data
No ratings yet
Ch6 - Data Warehouse in The The Age of Big Data
16 pages
1. Business Intelligence and Data Warehousing (1)
No ratings yet
1. Business Intelligence and Data Warehousing (1)
117 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
20 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
26 pages
Unit 1-1
No ratings yet
Unit 1-1
60 pages
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Chapter 10: Data Warehousing & Caching
No ratings yet
Chapter 10: Data Warehousing & Caching
32 pages
DM Part 2
No ratings yet
DM Part 2
24 pages
Data Warehousing Introduction Pages 2 53
No ratings yet
Data Warehousing Introduction Pages 2 53
52 pages
Unit 2 Data Warehousing and OLAP
No ratings yet
Unit 2 Data Warehousing and OLAP
72 pages
CSB4318 Datawarehousing and Data Mining
No ratings yet
CSB4318 Datawarehousing and Data Mining
577 pages
Data Warehouse
No ratings yet
Data Warehouse
73 pages
Chapter5_DataWarehouse
No ratings yet
Chapter5_DataWarehouse
77 pages
DS Module2 L5 L15
No ratings yet
DS Module2 L5 L15
40 pages
Data Warehouse Data Mining: Rahul Sachdeva
No ratings yet
Data Warehouse Data Mining: Rahul Sachdeva
35 pages
Lecture4 - DATA WAREHOUSING PDF
No ratings yet
Lecture4 - DATA WAREHOUSING PDF
13 pages
BigQuery
No ratings yet
BigQuery
8 pages
Chapter6_DataWareHousing_final
No ratings yet
Chapter6_DataWareHousing_final
46 pages
Ch4 - Data Warehousing
No ratings yet
Ch4 - Data Warehousing
33 pages
Advances in Database Querying: S. Sudarshan
No ratings yet
Advances in Database Querying: S. Sudarshan
86 pages
Iare DWDM PPT Cse
No ratings yet
Iare DWDM PPT Cse
249 pages
An Introduction To SQL Server 2005 Integration Services
No ratings yet
An Introduction To SQL Server 2005 Integration Services
18 pages
Information Integration: Existing Methods and Solutions
No ratings yet
Information Integration: Existing Methods and Solutions
25 pages
Iia 4
No ratings yet
Iia 4
29 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
Data Integration
No ratings yet
Data Integration
8 pages
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
100% (1)
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
77 pages
Data Integration
No ratings yet
Data Integration
42 pages
BMIS Chapter 4 SCMSB
No ratings yet
BMIS Chapter 4 SCMSB
35 pages
05 Data Warehousing Process Overview
No ratings yet
05 Data Warehousing Process Overview
31 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
86 pages
Distributed Data Architecture & Management: DR Simon Scola
No ratings yet
Distributed Data Architecture & Management: DR Simon Scola
42 pages
Integrated Technology Object Oriented Model
No ratings yet
Integrated Technology Object Oriented Model
11 pages
Session10-Parts 19-20
No ratings yet
Session10-Parts 19-20
171 pages
MIS - Session 11-14 - BI Data Warehouse
No ratings yet
MIS - Session 11-14 - BI Data Warehouse
65 pages
Unit 2 LT
No ratings yet
Unit 2 LT
13 pages
Data Warehouse For Bignners
No ratings yet
Data Warehouse For Bignners
14 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
DS-2CD3665G0-IZS 6 MP IR Varifocal Bullet Network Camera: Key Features
No ratings yet
DS-2CD3665G0-IZS 6 MP IR Varifocal Bullet Network Camera: Key Features
4 pages
Aa1a32 02
No ratings yet
Aa1a32 02
3 pages
Dell Networking OS Conversion Guide For The N20xx/N30xx Series
No ratings yet
Dell Networking OS Conversion Guide For The N20xx/N30xx Series
10 pages
02 ERP Proposal For RFQ
100% (4)
02 ERP Proposal For RFQ
17 pages
Babyhug Silicone Pacifier With Sterilizer Case Pack of 2 Pink Online India, Buy Teethers & Soothers for (3-24Months) at FirstCry
No ratings yet
Babyhug Silicone Pacifier With Sterilizer Case Pack of 2 Pink Online India, Buy Teethers & Soothers for (3-24Months) at FirstCry
1 page
Jyotisha Documentation: Release 0.0.2
No ratings yet
Jyotisha Documentation: Release 0.0.2
15 pages
E1-FinalExam QPaper 2020-21 2
No ratings yet
E1-FinalExam QPaper 2020-21 2
9 pages
Case Queuing Quandary
0% (1)
Case Queuing Quandary
2 pages
Adaudit Plus SSL Configuration
No ratings yet
Adaudit Plus SSL Configuration
10 pages
Garbage PPT
No ratings yet
Garbage PPT
13 pages
RDR 2000 & RDR 2100: Weather Radar For General Aviation
No ratings yet
RDR 2000 & RDR 2100: Weather Radar For General Aviation
2 pages
Suitcase X-Treme 12VS Spec Sheet
No ratings yet
Suitcase X-Treme 12VS Spec Sheet
4 pages
Class+Notes5 Diffie+Hellman+Key+Exchange+Algorithm
No ratings yet
Class+Notes5 Diffie+Hellman+Key+Exchange+Algorithm
2 pages
Seminar FKMTSI Kupang 10 Juli 2023 - Bahan Tayangan (Final)
No ratings yet
Seminar FKMTSI Kupang 10 Juli 2023 - Bahan Tayangan (Final)
49 pages
Painless Docker Unlock the Power of Docker and its Ecosystem 2024 Edition Aymen El Amri @Eon01 2024 scribd download
100% (5)
Painless Docker Unlock the Power of Docker and its Ecosystem 2024 Edition Aymen El Amri @Eon01 2024 scribd download
75 pages
UL & CE Approved k800 Fixed Detector
No ratings yet
UL & CE Approved k800 Fixed Detector
9 pages
Training Matrix For SBM Webinar
100% (1)
Training Matrix For SBM Webinar
1 page
Trend Micro
No ratings yet
Trend Micro
5 pages
PM L6
No ratings yet
PM L6
31 pages
Arche 4 Timber Module123
No ratings yet
Arche 4 Timber Module123
106 pages
DE2 115 User Manual
No ratings yet
DE2 115 User Manual
116 pages
Sap Ewm Overview
No ratings yet
Sap Ewm Overview
11 pages
Globalization Insurance
No ratings yet
Globalization Insurance
12 pages
Superposition, Memorization, and Double Descent
No ratings yet
Superposition, Memorization, and Double Descent
30 pages
TAPROOT COMPLETE MAT LIST (Grand Stand)
No ratings yet
TAPROOT COMPLETE MAT LIST (Grand Stand)
1 page
Stone Extraction Balloon-EN-V7
No ratings yet
Stone Extraction Balloon-EN-V7
4 pages
Log
No ratings yet
Log
2 pages
Roadster 521 PDF
No ratings yet
Roadster 521 PDF
230 pages

Database

Uploaded by

Database

Uploaded by

Motivation

In modern Information Systems there is a growing

Due to the need to:

Extract Transform Load/ Enteprise Information

Global schema (view)

1. Use a materialized data base (data are merged in a

The latter ones can only be evaluated over the global

[ with [ local | cascaded ] check option ]

P2P WITH MATERIALIZATION

CLASSICAL P2P DATA INTEGRATION

…and it becomes more and more complex

¾ We propose techniques and methods which

Each datum, whatever application uses it, only

1. Subsystem (functionality) identification

VIEW … VIEW VIEW

Æ Recall: in a unique DB we can only have schema

Product price (sale price)

BOOK BOOK PUBLISHER

MAN WOMAN PUBLISHER

VIEW … VIEW VIEW

Can be decomposed into 2 subqueries

As in the centralized DB case

Point 4 above: Conceptual-to-logical

• Up to now we supposed that the global

Intuitively, the first assertion specifies that the concept

¾ OK for stable data sources

Product(Code, Name, Description, Warnings, Notes, CatID)

Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)

n.b.: here we do not care about data types…

Product(Code, Name, Description, Warnings, Notes, CatID)

Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)

CREATE VIEW GLOB-PROD AS

(Select R2.B From Query over DB1:

SELECT PCode, VCode, Price, Stock

The transformation is easy, since the combination operator

The most useful (and used) integration

SSN NAME AGE SALARY PHONE

SSN NAME AGE SALARY1 SALARY2 PHONE

R1 JOIN R2 234567891 KETTY 27 25K 20K 1234567

SSN NAME AGE SALARY PHONE

• OK if the global schema is stable, e.g. based on a domain

Product(Code, Name, Description, Warnings, Notes, CatID)

Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)

In this case we have to express the sources as

GLOB-PROD (PCode, VCode, Name, Size, Color,

Product (Code, Name, Size, Color, Description, Type,

CREATE VIEW SOURCE2.Product AS

Product(Code, Name, Description, Warnings, Notes, CatID)

CREATE VIEW SOURCE1.Product AS

CREATE VIEW SOURCE1. Version AS

N.B.: Some information is lacking: there is no correspondent value

SELECT PCode, VCode, Price, Stock

• Views specify transformations from global to sources

CREATE VIEW SOURCE1. Version AS

A view is complete if it provides a superset of the available data in

A view is exact if it provides all and only data corresponding to the

Tuples accessible from the sources

The mapping can be exact or only sound 78

Tuples accessible from the global schema

79 The mapping can be exact or only complete

¾ Schemata (already seen)

¾ Technology (not our concern

A mediator’s main functionality is object fusion:

 group together information about the same real

WRAPPER 1 WRAPPER 2 WRAPPER 3

DATA SOURCE 1 DATA SOURCE 2 DATA SOURCE 3

SSN NAME AGE SALARY POSITION PHONE

345678912 WANG 39 54K MANAGER 2345678

R=MAX_AGE, SUM_SALARY (R1 OJ R2)

You might also like

group together information about the same real