0% found this document useful (0 votes)
5 views39 pages

Chapter 1

Uploaded by

soltanihajer098
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views39 pages

Chapter 1

Uploaded by

soltanihajer098
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Data warehouse & Exploration

PRINCIPLES OF
DATA INTEGRATION
Sellami Mokhtar
CHAPTER 1: INTRODUCTION
TO DATA INTEGRATION

PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Outline
➢ Introduction: data integration as a new abstraction
▪ Examples of data integration applications
▪ Schema heterogeneity
▪ Goal of data integration, why it’s a hard problem
▪ Data integration architectures
Data Integration
▪ Databases are great: they let us manage huge
amounts of data
▪ Assuming you’ve put it all into your schema.
▪ In reality, data sets are often created independently
▪ Only to discover later that they need to combine their
data!
▪ At that point, they’re using different systems, different
schemata and have limited interfaces to their data.
▪ The goal of data integration: tie together different
sources, controlled by many people, under a
common schema.
DBMS: it’s all about abstraction
▪ Logical vs. Physical; What vs. How.
Students: Takes:
SSN Name Category SSN CID
123-45-6789 Charles undergrad 123-45-6789 CSE444
234-56-7890 Dan grad 123-45-6789 CSE444
… … 234-56-7890 CSE142
Courses: …
CID Name Quarter
CSE444 Databases fall
CSE541 Operating systems winter

SELECT C.name
FROM Students S, Takes T, Courses C
WHERE S.name=“Mary” and
S.ssn = T.ssn and T.cid = C.cid
Data Integration:
A Higher-level Abstraction

Query Independence of:


• source & location
Mediated Schema • data model, syntax
• semantic variations
•…
Semantic
Mappings
S1 S2 S3
<cd> <title> The best of … </title>
SSN
123-45-6789
234-56-7890
Name
Charles
Dan

Category
undergrad
grad

SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
… <artist> Carreras </artist>
<artist> Pavarotti </artist>

… <artist> Domingo </artist>
CID Name Quarter
CSE444 Databases fall <price> 19.95 </price>
CSE541 Operating systems winter
</cd>
Outline
✓ Introduction: data integration as a new abstraction
➢ Examples of data integration applications
▪ Schema heterogeneity
▪ Goal of data integration, why it’s a hard problem
▪ Data integration architectures
Applications of Data Integration
▪ Business
▪ Science
▪ Government
▪ The Web
▪ Pretty much everywhere
Application Area 1: Business

Enterprise Databases
EII Apps:

CRM
ERP
Single Mediated View
Portals

Legacy Databases
Services and Applications

50% of all IT $$$ spent here!


Application Area 2: Science
Sequenceable Structured
Phenotype Gene Experiment
Entity Vocabulary

Nucleotide Microarray
Protein
Sequence Experiment

Swiss-
OMIM HUGO GO
Prot

Gene- Locus-
Entrez GEO
Clinics Link

Hundreds of biomedical data sources available; growing rapidly!


Application Area 3: The Web
Hundreds of millions of high-quality
tables on the Web
The Deep Web
▪ Millions of high quality HTML forms out there
▪ Each form has its own special interface
▪ Hard to explore data across sites.
▪ Goal (for some domains):
▪ A single interface into a multitude of deep-web sources.
Create a single site to search for jobs/rentals/…
Easily traverse between the site by clicking its name
Outline
✓ Introduction: data integration as a new abstraction
✓ Examples of data integration applications
➢ Schema heterogeneity
▪ Goal of data integration, why it’s a hard problem
▪ Data integration architectures
Enterprise Data Integration:
FullServe Corporation

Employees Resumes
FullTimeEmp Interview
Hire CV
TempEmployees
Training Services
Services
Courses
Customers
Enrollments
Contracts
Sales HelpLine
Products Calls
Sales
EuroCard Corporation

Employees Resumes
Employees Interview
Hire

Credit Cards HelpLine


Customer Calls
CustDetail
Examples of Heterogeneity

FullServe EuroCard
FullTimeEmp Employees
ssn, empId, firstName ID, firstNameMiddleInitial,
middleName, lastName lastName

Hire Hire
empId, hireDate, recruiter ID, hireDate, recruiter

TempEmployees
ssn, hireStart, hireEnd

Find all employees (making over $100K)


Customer Call Center

Agents should have a full view of customer when they call


in.

Sales
Credit Cards
Products
Customer
Sales
CustDetail
Services
Services
Customers
Contracts
Other Reasons to Integrate Data
▪ Create a (useful) web site for tracking services
▪ Collaborate with third parties
▪ E.g., create branded services
▪ Comply with government regulations
▪ Find “risky” employees
▪ Business intelligence
▪ What’s really wrong with our products?
Outline
✓ Introduction: data integration as a new abstraction
✓ Examples of data integration applications
✓ Schema heterogeneity
➢ Goal of data integration, why it’s a hard problem
▪ Data integration architectures
Goal of Data Integration
▪ Uniform query access to a set of data sources
▪ Handle:
▪ Scale of sources: from tens to millions to billions
▪ Heterogeneity
▪ Autonomy
▪ Semi-structure
▪ Non Structured
Why is it Hard?
▪ Systems-level reasons:
▪ Managing different platforms
▪ SQL across multiple systems is not so simple
▪ Distributed query processing
▪ Logical reasons:
▪ Schema (and data) heterogeneity
▪ ‘Social’ reasons:
▪ Locating and capturing relevant data in the enterprise.
▪ Convincing people to share (data fiefdoms)
❖ Security, privacy and performance implications.
Setting Expectations
Data integration is AI-Complete.
▪ Completely automated solutions unlikely.
Goal 1:
▪ Reduce the effort needed to set up an integration
application.
Goal 2:
▪ Enable the system to perform gracefully with uncertainty
(e.g., on the web)
Goal 3:
▪ Enable the system to secure integration of heretegous
sources : privacy aware security data integration
Data Integration Smorgasbord
Something for everyone:
▪ Theory of modeling data sources
▪ Systems aspects of data integration
▪ Architectural issues: e.g., P2P data sharing, Microservice
▪ AI @ work: automated schema matching
▪ IA@ work: AI Based schema matching
▪ Web: latest on data integration & web
▪ Commercial products: BEA, IBM
▪ Semantic Web: what does it have to offer?
▪ New trends in DBMS: uncertainty, dataspaces, big data,
nosql, cloud, security privacy.
Outline
✓ Introduction: data integration as a new abstraction
✓ Examples of data integration applications
✓ Schema heterogeneity
✓ Goal of data integration, why it’s a hard problem
➢ Data integration architectures
Virtual, Warehousing and in Between
▪ Data warehousing: integrate by bringing the data
into a single physical warehouse
▪ Virtual data integration: leave the data at the
sources and access it at query time.

▪ Some differences, but semantic heterogeneity arises


in both cases.
▪ Numerous intermediate architectures.
▪ The course illustrates data integration technology
mostly through the virtual architecture.
Virtual Data Integration Architecture
Mediated Schema
or Warehouse Query reformulation/
Query over materialized data

Source
descriptions/
Transforms

Wrapper / Wrapper / Wrapper / Wrapper /


Extractor Extractor Extractor Extractor

RDBMS 1 RDBMS 2
HTML1 XML1
Example

Movie(title, director, year, genre)


Actors(title, actor)
Plays(movie, location, startTime)
Reviews(title, rating, description)

S1 S2 S3 S4 S5
Movies (name, Cinemas (place, CinemasInNYC CinemasInSF Reviews (title,
actors, director, movie, start) (cinema, title, (location, movie, date, grade,
genre) startTime) startingTime) review)
Wrappers

<cd> <title> The best of … </title>


<artist> Abiteboul </artist>
<artist> Pavarotti </artist>
<artist> Domingo </artist>
<price> 19.95 </price>
</cd>

Send queries to data


sources and transform
answers into tuples (or
other internal data
model). (Chapter 9)
Mediation Languages
Mediated Schema
Describe CD: ASIN, Title, Genre,…
Artist: ASIN, name, …
relationships
between
logic
mediated
schema and CDs Books
Album Title Authors
data sources ASIN
Price
ISBN
Price
ISBN
FirstName
DiscountPrice DiscountPrice LastName
(Chapter 3). Studio Edition

Artists
ASIN
CDCategories BookCategories ArtistName
ASIN ISBN GroupName
Category Category
Woody Allen Comedies in NY

Mediated schema:
Movie: Title, director, year, genre
Actors: title, actor
Plays: movie, location, startTime
Reviews: title, rating, description

select title, startTime


from Movie, Plays
where Movie.title=Plays.movie AND
location=“New York” AND
director=“Woody Allen”
Movie: Title, director, year, genre
Actors: title, actor
Plays: movie, location, startTime
Reviews: title, rating, description

select title, startTime


from Movie, Plays
where Movie.title=Plays.movie AND
location=“New York” AND
director=“Woody Allen”
Sources S1 and S3 are relevant, sources S4 and S5 are
irrelevant, and source S2 is relevant but possibly
redundant.
S1 S2 S3 S4 S5
Movies: Cinemas: Cinemas in NYC: Cinemas in SF: Reviews:
name, actors, place, movie, cinema, title, location, movie, title, date
director, genre start startTime startingTime grade, review
Query Processing
Query Query LLM reformulation

Logical query plan

Chapter 8 Query LLM optimizer

Physical query plan

Replanning request
Execution LLM engine

LLM Wrapper LLM wrapper wrapper wrapper wrapper

source source source source source


Data Warehouses – Offline Replication

▪ Determine physical schema


▪ Define a database with this Query Results

schema Data Warehouse


▪ Define procedural mappings
in an “ETL tool” to import
the data and clean it.
▪ Periodically copy all of the
data from the data sources
▪ Note that the sources
and the warehouse are
basically independent at
this point 37
Pros and Cons of Data Warehouses
 Need to spend time to design the physical database
layout, as well as logical
 This actually takes a lot of effort!
 Data is generally not up-to-date (lazy or offline
refresh)

✓ Queries over the warehouse don’t disrupt the data


sources
✓ Can run very heavy-duty computations, including
data mining and cleaning

38
Summary of Chapter 1
▪ Data integration: abstract away the fact that data
comes from multiple sources in varying schemata.
▪ Problem occurs everywhere: it’s key to business,
science, Web and government.
▪ Goal: reduce the effort involved in integrating.
▪ Regardless of the architecture, heterogeneity is a key
issue.
▪ Architectures range from warehousing to virtual
integration.

You might also like