Chapter 1
Chapter 1
PRINCIPLES OF
DATA INTEGRATION
Sellami Mokhtar
CHAPTER 1: INTRODUCTION
TO DATA INTEGRATION
PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Outline
➢ Introduction: data integration as a new abstraction
▪ Examples of data integration applications
▪ Schema heterogeneity
▪ Goal of data integration, why it’s a hard problem
▪ Data integration architectures
Data Integration
▪ Databases are great: they let us manage huge
amounts of data
▪ Assuming you’ve put it all into your schema.
▪ In reality, data sets are often created independently
▪ Only to discover later that they need to combine their
data!
▪ At that point, they’re using different systems, different
schemata and have limited interfaces to their data.
▪ The goal of data integration: tie together different
sources, controlled by many people, under a
common schema.
DBMS: it’s all about abstraction
▪ Logical vs. Physical; What vs. How.
Students: Takes:
SSN Name Category SSN CID
123-45-6789 Charles undergrad 123-45-6789 CSE444
234-56-7890 Dan grad 123-45-6789 CSE444
… … 234-56-7890 CSE142
Courses: …
CID Name Quarter
CSE444 Databases fall
CSE541 Operating systems winter
SELECT C.name
FROM Students S, Takes T, Courses C
WHERE S.name=“Mary” and
S.ssn = T.ssn and T.cid = C.cid
Data Integration:
A Higher-level Abstraction
Enterprise Databases
EII Apps:
CRM
ERP
Single Mediated View
Portals
…
Legacy Databases
Services and Applications
Nucleotide Microarray
Protein
Sequence Experiment
Swiss-
OMIM HUGO GO
Prot
Gene- Locus-
Entrez GEO
Clinics Link
Employees Resumes
FullTimeEmp Interview
Hire CV
TempEmployees
Training Services
Services
Courses
Customers
Enrollments
Contracts
Sales HelpLine
Products Calls
Sales
EuroCard Corporation
Employees Resumes
Employees Interview
Hire
FullServe EuroCard
FullTimeEmp Employees
ssn, empId, firstName ID, firstNameMiddleInitial,
middleName, lastName lastName
Hire Hire
empId, hireDate, recruiter ID, hireDate, recruiter
TempEmployees
ssn, hireStart, hireEnd
Sales
Credit Cards
Products
Customer
Sales
CustDetail
Services
Services
Customers
Contracts
Other Reasons to Integrate Data
▪ Create a (useful) web site for tracking services
▪ Collaborate with third parties
▪ E.g., create branded services
▪ Comply with government regulations
▪ Find “risky” employees
▪ Business intelligence
▪ What’s really wrong with our products?
Outline
✓ Introduction: data integration as a new abstraction
✓ Examples of data integration applications
✓ Schema heterogeneity
➢ Goal of data integration, why it’s a hard problem
▪ Data integration architectures
Goal of Data Integration
▪ Uniform query access to a set of data sources
▪ Handle:
▪ Scale of sources: from tens to millions to billions
▪ Heterogeneity
▪ Autonomy
▪ Semi-structure
▪ Non Structured
Why is it Hard?
▪ Systems-level reasons:
▪ Managing different platforms
▪ SQL across multiple systems is not so simple
▪ Distributed query processing
▪ Logical reasons:
▪ Schema (and data) heterogeneity
▪ ‘Social’ reasons:
▪ Locating and capturing relevant data in the enterprise.
▪ Convincing people to share (data fiefdoms)
❖ Security, privacy and performance implications.
Setting Expectations
Data integration is AI-Complete.
▪ Completely automated solutions unlikely.
Goal 1:
▪ Reduce the effort needed to set up an integration
application.
Goal 2:
▪ Enable the system to perform gracefully with uncertainty
(e.g., on the web)
Goal 3:
▪ Enable the system to secure integration of heretegous
sources : privacy aware security data integration
Data Integration Smorgasbord
Something for everyone:
▪ Theory of modeling data sources
▪ Systems aspects of data integration
▪ Architectural issues: e.g., P2P data sharing, Microservice
▪ AI @ work: automated schema matching
▪ IA@ work: AI Based schema matching
▪ Web: latest on data integration & web
▪ Commercial products: BEA, IBM
▪ Semantic Web: what does it have to offer?
▪ New trends in DBMS: uncertainty, dataspaces, big data,
nosql, cloud, security privacy.
Outline
✓ Introduction: data integration as a new abstraction
✓ Examples of data integration applications
✓ Schema heterogeneity
✓ Goal of data integration, why it’s a hard problem
➢ Data integration architectures
Virtual, Warehousing and in Between
▪ Data warehousing: integrate by bringing the data
into a single physical warehouse
▪ Virtual data integration: leave the data at the
sources and access it at query time.
Source
descriptions/
Transforms
RDBMS 1 RDBMS 2
HTML1 XML1
Example
S1 S2 S3 S4 S5
Movies (name, Cinemas (place, CinemasInNYC CinemasInSF Reviews (title,
actors, director, movie, start) (cinema, title, (location, movie, date, grade,
genre) startTime) startingTime) review)
Wrappers
Artists
ASIN
CDCategories BookCategories ArtistName
ASIN ISBN GroupName
Category Category
Woody Allen Comedies in NY
Mediated schema:
Movie: Title, director, year, genre
Actors: title, actor
Plays: movie, location, startTime
Reviews: title, rating, description
Replanning request
Execution LLM engine
38
Summary of Chapter 1
▪ Data integration: abstract away the fact that data
comes from multiple sources in varying schemata.
▪ Problem occurs everywhere: it’s key to business,
science, Web and government.
▪ Goal: reduce the effort involved in integrating.
▪ Regardless of the architecture, heterogeneity is a key
issue.
▪ Architectures range from warehousing to virtual
integration.