Data Integration
Data Integration
DATA INTEGRATION
4
Application Area : Science
Sequenceable Structured
Phenotype Gene
Entity Vocabulary
Nucleotide
Protein
Sequence
Swiss-
OMIM HUGO GO
Prot
Gene- Locus-
Entrez
Clinics Link
6
Create a single site to search for jobs/rentals/… 7
Goals of Data Integration
Uniform query access to a set of data sources
Handle:
Scale of sources: from tens to millions
Heterogeneity
Autonomy
Semi-structured
Unstructured
8
Different Levels of Integration
9
Why is it Hard?
Schema heterogeneity
10
Data integration architectures
Data warehousing:
integrate by bringing the data into a single
physical warehouse
11
Advantages of Mediated Schema approache
Data Freshness (low latency - almost realtime)
Higher Agility
Less costlier :
Lot of infrastructure cost can be saved since data
localization not required
12
Disadvantages of Mediated Schema approaches
Semantic conflicts :
The meaning of "net profit" can be different in
different systems
13
Virtual Data Integration Architecture
Leave the data in the sources.
Data is fresh
14
Virtual Data Integration Architecture(2)
Mediated Schema
or Warehouse Query reformulation/
Query over materialized data
RDBMS 1 RDBMS 2
HTML1 XML1
15
Virtual Data Integration Architecture(3)
16
Example
S1 S2 S3 S4 S5
Movies (name, Cinemas (place , CinemasInNYC CinemasInSF Reviews (title,
actors ,director, movie, start) (cinema ,title, (location ,movie, date ,grade ,
genre) startTime) startingTime) review)
17
Wrappers
Sources export data in different formats
Wrappers are custom-built programs that
transform data from the source native format to
something acceptable to the mediator
XML
HTML
<book>
<b> Introduction to DB </b> <title> Introduction to DB </title>
<i> Phil Bernstein </i> <author> Phil Bernstein </author>
<i> Eric Newcomer </i> <author> Eric Newcomer </author>
Addison Wesley, 1999 <publisher> Addison Wesley </publisher>
<year> 1999 </year>
</book>
18
18
Wrappers(2)
Complexity of wrapper depends on nature of data
source
19
Data Source Catalog
Contains meta-information about sources
20
Query Reformulation
Users pose queries over the mediated schema
Reformulation:
Queries over the mediated schema have to be
rewritten as queries over the source schemas
21
Approachs for Schema Mapping
Q
Mediated Schema
22
Approachs for Schema Mapping(2)
Global-as-view (GAV):
express the mediated schema relations as a set of
views over the data source relations
Local-as-view (LAV):
express the source relations as views over the
mediated schema.
23
Global-as-View: Example 1
Mediated schema: Express mediator schema
Movie(title, dir, year, genre). relations as views over
Schedule(cinema, title, time). source relations
Create View Movie
Sources:
[S1(title,dir,year,genre)]
[S2(title, dir,year,genre)]
S3 [T1(title,dir) ,T2(title,year,genre)]
where S3.title=S4.title
24
Global-as-View: Example 1
Mediated schema: Sources:
[S1(title,dir,year,genre)] Schedule(cinema, title, time).
[S2(title, dir,year,genre)] Movie(title, dir, year, genre).
S3[T1 (title,dir),T2(title,year,genre)]
25
Global-as-View: Example 2
Mediated schema: Sources:
Movie(title, dir, year, genre). [S1(title,dir,year)]
Schedule(cinema, title, time). [S2(title, dir,genre)]
26
Global-as-View: Example 3
Mediated schema:
Movie(title, dir, year, genre).
Schedule(cinema, title, time).
Source : S4(cinema, genre)
Create View Movie AS
select NULL, NULL, NULL, genre
from S4
Create View Schedule AS
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are
playing comedies? 27
Local-as-View: Example 1
Mediated
Mediatedschema:
schema: Express source schema
relations as views over
Movie(title, dir, year, genre).
mediator relations
Schedule(cinema, title, time).
Create View S1 AS
select * from Movie
Create View S3 AS
S1(title,dir,year,genre)
select title, dir from Movie
S3(title,dir)
Create View S5 AS
select title, dir, year S5(title,dir,year), year >1960, genre=“Comedy
from Movie
where year > 1960 AND genre=“Comedy”
Sources are “materialized views” of
mediator schema 28
Local-as-View: Example 2
Mediated schema: Express source schema
relations as views over
Movie(title, dir, year, genre), mediator relations
30
Query Processing
Query Query reformulation
Query optimizer
Replanning request
Execution engine
32
Reformulation Algorithms
Bucket Algorithm: will check all possible combinations
33
Bucket Algorithm : Example
Q( ID, dir ) : Movie( ID, title, year , genre), Re venues( ID, amount ),
Director ( ID, dir ), amount $100 M
V1 ( I , Y ) : Movie( I , T , Y , G ), Re venues( I , A), I 5000, A $200M
V2 ( I , A) : Movie( I , T , Y , G ), Re venues( I , A)
V3 ( I , A) : Re venues( I , A), A $50M
V4 ( I , D, Y ) : Movie( I , T , Y , G ), Director ( I , D), I 3000
V2(ID,A’) V2(ID,amount)
V4(ID,D’,year)
V2(ID,A’) V2(ID,amount)
V4(ID,D’,year)
Global-as-View (Gav)
37
P2P Data Integration
38
P2P Data Integration
39
P2P Data Integration
40
Question
42