0% found this document useful (0 votes)
157 views107 pages

Data Fusion

This document summarizes a tutorial on data fusion presented at VLDB 2009. The tutorial addressed resolving conflicts when integrating data from multiple sources. It discussed various origins of data conflicts both within and between data sources, such as differences in naming conventions, data types and values. The tutorial also covered strategies for resolving conflicts, including error correction, standardization, and utilizing metadata and domain knowledge. An overview of foundations of data fusion, advanced truth discovery techniques and existing data fusion systems was provided.

Uploaded by

Tarik Magnifik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views107 pages

Data Fusion

This document summarizes a tutorial on data fusion presented at VLDB 2009. The tutorial addressed resolving conflicts when integrating data from multiple sources. It discussed various origins of data conflicts both within and between data sources, such as differences in naming conventions, data types and values. The tutorial also covered strategies for resolving conflicts, including error correction, standardization, and utilizing metadata and domain knowledge. An overview of foundations of data fusion, advanced truth discovery techniques and existing data fusion systems was provided.

Uploaded by

Tarik Magnifik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

DATA FUSION –

RESOLVING DATA
CONFLICTS IN INTEGRATION
Tutorial at Xin Luna Dong – AT&T Labs-Research
VLDB 2009 Felix Naumann – Hasso Plattner Institute (HPI)
Origins of Data Conflicts
2

Original

ACM Computing
Survey [BN08]

Scanned
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Data Conflicts
3

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Data Conflicts
4

Integrated data

Schering CRM Bayer CRM

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Data Conflicts: German Names
5

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Data Conflicts: Difficult Names
6

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Intra-Source Conflicts
7

 No integrity or consistency checks


N h k
 Redundant schemata
 Typos, transmission errors, incorrect calculations
 Variants
 Kantstr. / Kantstrasse / Kant Str. / Kant Strasse
 Kolmogorov / Kolmogoroff / Kolmogorow

 Typical confusion (OCR)


 U<->V, 0<->o, 1<->l, etc.
 Obsolete values
 Different update frequencies, forgotten update

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Inter-Source Conflicts
8

 LLocally
ll consistent
it tb butt globally
l b ll iinconsistent
it t
 Different data types
 Local spelling
p g variations and conventions
 Addresses
 St → Street, Ave → Avenue, etc.
 Breitscheid Str. 72 a → Rudolf
R -Breitscheid-Str
R. Rudolf-Breitscheid
Breitscheid.-Str
Str. 72A
 128 spellings for Frankfurt am Main
 Frankfurt a.M., Frankfurt/M, Frankfurt, Frankfurt a. Main, …
 Names
 Dr. Ing. h.c. F. Porsche AG
 Hewlett-Packard Development Company, L.P.
 Numerical data
 10.000 € = 10T EURO = 10k EUR = 10.000,00€ = 10,000.- €
 Phone numbers, birth dates, etc.

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Resolution of Data Conflicts? tables googlelabs com
tables.googlelabs.com
9

 ““… focus
f is on fusing
f data management and
collaboration: merging multiple data sources,
di
discussion
i off the
th ddata,
t querying,
i visualization,
i li ti and d
Web publishing.”
 “Th power off data
“The d is
i truly
l harnessed
h d when
h you
combine data from multiple sources. Fusion Tables
enables you to fuse multiple sets of data when they
are about the same entities. In database speak, we
call this a join on a primary key but the data
originates from multiple independent sources.”

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Web Integration—Google Fusion Tables
10

 Allows discussion of values between users


Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Data Conflict Elimination
11

 Error correction
 Reference tables
 Cities, countries,
Citi t i products
d t ...
 Similarity measures
 Standardization and transformation
 Domain-knowledge (meta data)
 Conventions (country/region-specific spelling)
 Ontologies
 Thesauri, dictionaries for homonyms, synonyms, ...
 O li d
Outlier detection
i and
d elimination
li i i
 And data fusion…

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Overview
12

 Data fusion in the integration process


 Foundations of data fusion
 Conflict resolution strategies and functions
 Conflict resolution operators

 Advanced truth-discovery techniques


 Existing data fusion systems
 Open problems

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Information Integration
13

<pub>
pub
<Titel> Federated Database
Systems </Titel>
Source A <Autoren>
<Autor>
Autor Amit Sheth </Autor>
/Autor
<Autor> James Larson </Autor>
</Autoren>
</pub>

<publication>
<title> Federated Database
Systems for Managing
Source B Distributed Heterogeneous,
Distributed, Heterogeneous
and Autonomous
Databases </title>
<author> Scheth & Larson </author>
<year> 1990 </year>
</publication>

Schema Data Duplicate D


Data Fusion
F i
Mapping Transformation Detection
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Information Integration
14

<pub>
pub <pub>
pub
<Titel> Federated Database <title> </title>
Systems </Titel> <Autoren>
Source A <Autoren> <author> </author>
<Autor>
Autor Amit Sheth </Autor>
/Autor <author>
author </author>
/author
<Autor> James Larson </Autor> </Autoren>
</Autoren> <year> </year>
</pub> </pub>

<publication>
<title> Federated Database
Systems for Managing
Source B Schema Integration
Distributed Heterogeneous,
Distributed, Heterogeneous
and Autonomous
Databases </title>
<author> Scheth & Larson </author> Schema Mapping
<year> 1990 </year>
</publication>

Schema Data Duplicate D


Data Fusion
F i
Mapping Transformation Detection
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Information Integration
15
Transformation
queries or views
<pub>
pub <pub>
<Titel> Federated Database <title> Federated Database
Systems </Titel> Systems </title>
Source A <Autoren> <Autoren>
<Autor>
Autor Amit Sheth </Autor>
/Autor XQueryy <author> Amit Sheth </author>
<Autor> James Larson </Autor> <author> James Larson </author>
</Autoren> </Autoren>
</pub> </pub>
<pub>
<publication> <title> Federated Database Systems for
<title> Federated Database Managing Distributed,
Systems for Managing Heterogeneous, and Autonomous
Source B Distributed Heterogeneous,
Distributed, Heterogeneous Databases </title>
and Autonomous <Autoren>
Databases </title> XQuery <author> Scheth & Larson </author>
<author> Scheth & Larson </author> </Autoren>
<year> 1990 </year> <year> 1990 </year>
</publication> </pub>

Schema Data Duplicate D


Data Fusion
F i
Mapping Transformation Detection
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Information Integration
16

<pub>
pub <pub>
<Titel> Federated Database <title> Federated Database
Systems </Titel> Systems </title>
Source A <Autoren> <Autoren>
<Autor>
Autor Amit Sheth </Autor>
/Autor <author> Amit Sheth </author>
<Autor> James Larson </Autor> <author> James Larson </author>
</Autoren> </Autoren>
</pub> </pub>
<pub>
<publication> <title> Federated Database Systems for
<title> Federated Database Managing Distributed,
Systems for Managing Heterogeneous, and Autonomous
Source B Distributed Heterogeneous,
Distributed, Heterogeneous Databases </title>
and Autonomous <Autoren>
Databases </title> <author> Scheth & Larson </author>
<author> Scheth & Larson </author> </Autoren>
<year> 1990 </year> <year> 1990 </year>
</publication> </pub>

Schema Data Duplicate D


Data Fusion
F i
Mapping Transformation Detection
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Information Integration
17

<pub>
pub <pub>
<Titel> Federated Database <title> Federated Database
Systems </Titel> Systems </title>
Source A <Autoren> <Autoren>
<Autor>
Autor Amit Sheth </Autor>
/Autor <author> Amit Sheth </author>
<Autor> James Larson </Autor> <author> James Larson </author>
</Autoren> </Autoren>
</pub> </pub>
<pub>
<publication> <title> Federated Database Systems for
<title> Federated Database Managing Distributed,
Systems for Managing Heterogeneous, and Autonomous
Source B Distributed Heterogeneous,
Distributed, Heterogeneous Databases </title>
and Autonomous <Autoren>
Databases </title> <author> Scheth & Larson </author>
<author> Scheth & Larson </author> </Autoren>
<year> 1990 </year> <year> 1990 </year>
</publication> </pub>

Schema Data Duplicate D


Data Fusion
F i
Mapping Transformation Detection
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Information Integration
18

<pub> <pub>
pub
<title> Federated Database <title> Federated Database Systems for
Systems </title> Managing Distributed,
Source A <Autoren> Heterogeneous, and
<author> Amit Sheth </author> Autonomous Databases </title>
/title
<author> James Larson </author> <Autoren>
</Autoren> <author> Amit Sheth </author>
</pub> <author> James Larson </author>
<pub>
pub </Autoren>
/Autoren
<title> Federated Database Systems for <year> 1990 </year>
Managing Distributed, </pub>
Heterogeneous, and Autonomous
Source B Databases </title>
/title
<Autoren>
<author> Scheth & Larson </author>
</Autoren>
<year>
year 1990 </year>
/year Preserve lineage
</pub>

Schema Data Duplicate D


Data Fusion
F i
Mapping Transformation Detection
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Completeness, Conciseness, and Correctness
19

Schema Matching:
Same attribute semantics

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Completeness, Conciseness, and Correctness
20

Duplicate detection:
Same real-world
entities
titi

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Completeness, Conciseness, and Correctness
21 Intensional conciseness

Data Fusion:
D F i Resolve
R l
uncertainties and
Extenssional coompleteeness

contradictions

Extensional
conciseness

Intensionall completeness
l
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Schema Matching
22

 Problem
 Given two schemata, find all correspondences between
their
h i attributes
ib
 Difficulties
 Schematic heterogeneity
(synonyms & homonyms)
 Data
D t hheterogeneity
t it
 n:m mappings
 Transformation
T f i ffunctions
i
 User interaction

 Then:
h Derive a schema
h mapping
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Duplicate Detection
23

 Problem
 Given one or more data sets, find all sets of objects
that represent the same real-world entity.
 Difficulties
 Duplicates are not identical
 Similarity measures – Levenshtein, Soundex, Jaccard, etc.
 Large volume,
volume cannot compare all pairs
 Partitioning strategies – Sorted neighborhood, Blocking, etc.
Duplicates
CRM1 CRM1 x CRM2
Partitioning ???
Si il it measure
Similarity
CRM2 Non-
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann Duplicates
Ironically, “Duplicate Detection” has many Duplicates
24

Doubles
Household matching Duplicate detection

Mixed and split citation problem


Record linkage
Object identification
Match

Deduplication Fuzzy match Object consolidation

Entity resolution
Entity clustering
Approximate match
Identity uncertainty
Reference reconciliation
Merge/purge
Hardening soft databases
Householding
Reference matching
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Data Fusion
25

 Problem
bl
 Given a duplicate, create a single object representation
while resolving conflicting data values.
values
 Difficulties
 Null values:
values Subsumption and complementation
 Contradictions in data values
 Uncertainty & truth: Discover the true value and model
uncertainty in this process
 Metadata: Preferences, recency, correctness
 Lineage: Keep original values and their origin
 Implementation in DBMS: SQL, extended SQL, UDFs, etc.

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
The Field of Data Fusion
26

Data Fusion

Conflict types Resolution strategies Operators Resolution functions

Uncertainty Contradiction Join-based Possible worlds Subsumption Aggregation

Union-based Consistent answers Complementation Advanced


functions
Ignorance Avoidance Resolution

Instance-based Metadata-based Instance-based Metadata-based

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Overview
27

 Data fusion in the integration process


 Foundations of data fusion
 Conflict resolution strategies and functions
 Conflict resolution operators

 Advanced truth-discovery techniques


 Existing data fusion systems
 Open problems

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Uncertainty and Contradiction
28

 Uncertainty
 NULL value vs. non-NULL value
 “Easy” case

 Contradiction
 Non-NULL value
vs (different)
vs. Uncer -
U
Uncer- Contra-
C
Contra- Uncer -
U
Uncer-
non-NULL value tainty diction tainty

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Semantics of NULL
29

 “unknown”
 There is a value, but I do not know it.
 E.g.: Unknown date-of-birth

 “not
not applicable
applicable”
 There is no meaningful value.
 E.g.:
E Spouse
S for
f singles
i l
 “withheld”
 There is a value, but we are not authorized to see it.
 E.g.: Private phone line

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Classification of Strategies
30

conflict resolution
strategies

conflict conflict conflict


ignorance avoidance resolution

instance metadata instance metadata


based based based based

deciding mediating deciding mediating

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Conflict Resolution Functions
31

Function Description Examples


Min, Max, Sum, Count, Avg Standard aggregation NumChildren, Salary, Height
Random Random choice Shoe size
Longest, Shortest Longest/shortest value First_name
Choose(source) Value from a particular source DoB (DMV), CEO (SEC)
Ch
ChooseDepending(val,
D di ( l col)l) Value
V l ddepends
d on value
l chosen
h in
i other
th city
it & zip,
i e-mailil & employer
l
column
Vote Majority decision Rating
Coalesce First non-null value First_name
Group, Concat Group or concatenate all values Book_reviews
MostRecent Most recent (up-to-date) value Address
MostAbstract, MostSpecific, Use a taxonomy / ontology Location
CommonAncestor
Escalate Export conflicting values gender
… Data Fusion | VLDB 2009
… Tutorial | Luna Dong & Felix Naumann …
Classification of Functions
32

conflict resolution
strategies

conflict conflict conflict


ignorance
g avoidance resolution
Escalate

instance metadata instance metadata


based based based based
Coalesce Choose
Ch
ChooseDepending
D di
Concat deciding mediating deciding mediating
MIN, MAX AVG, SUM MostRecent CommonAncestor
Random MostAbstract
Vote MostSpecific
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Data Fusion in MS Outlook 2007
33

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Overview
34

 Data fusion in the integration process


 Foundations of data fusion
 Conflict resolution strategies and functions
 Conflict resolution operators

 Advanced truth-discovery techniques


 Existing data fusion systems
 Open problems

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Data Fusion Goals
35

Source 1(A,B,C)  a, b, - a, b, -, -
  a, b, -, - Identical tuples
Source 2(A,B,D)  a,, b,, - a,, b,, -,, -

a, b, c a, b, c, -
  a,
a b,
b c,
c - Subsumed tuples
a, b, - a, b, -, -

a, b, c a, b, c, -
  a, b, c, d Complementing tuples
a, b, d a, b, -, d

a, b, c a, b, c, -
  a, f(b,e), c, d Conflicting tuples
a e,
a, e d a e,
a, e -, d

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Relational Operators – Overview
36

 Identical
d l tuples
l
 UNION, OUTER UNION
 Subsumed tuples (uncertainty)
 MINIMUM UNION
 Complementing tuples (uncertainty)
 COMPLEMENT UNION, MERGE
 Conflicting tuples (contradiction)
 Relational approaches: Match, Group, Fuse, …
 Other approaches
 Possible worlds, probabilistic answers, consistent answers

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Minimum Union
37

A B C A B D A B C D

 Union: Elimination a
e
b
f
c
g

+ a
e
b
f

h
=
a
a
b
b
c



of exact duplicates m n o m p  e f g 
 Minimum Union: e f  h
m n o 
Elimination of m p  
A tuple t1 subsumes a tuple t2,
subsumed tuples if it has same schema, has less
NULL-values, and coincides in R
 Outerunion allll non-NULL-values.
NULL l
 Subsumption A B C D
a b c 
 Rewriting in SQL using DWH extensions e f g 
(Windows) and assuming existence of e f  h
favorable ordering [RPZ04] m n o 
m p  
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Full Disjunction
38

A B C A B D A B C D
 Represents all a b c
|⋈|
a b 
=
a b c 

possible e f g e f h e f g h
k  o m p  m p  
combinations
bi ti off k  o 
k m  k q r
source tuples k m  
 Full
F ll outer
t jjoin
i on allll common attributes
tt ib t k q  r
 All combinations for more than two sources R
 Minimum union over results
A B C D
 Combines complementing tuples a b c 
(only inter-source)
inter source) e f g h

 Algorithms: [GL94,RU96,CS05] m p  
k  o 
k m  
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann k q  r
Complement Union – Proposal
39

A B C A B D A B C D

 Elimination of a
e
b
f
c
g

+ a
e
b
f

h
=
a
a
b
b
c



complementing m n o m p  e f g 
tuples e f  h
m n o 
 Outerunion
m p  
 Complementation A tuple t1 complements a
tuple t2, if it has same R⇅
 No known SQL schema and coincides in all
non-NULL-values.
rewriting A B C D
a b c 
Includes duplicate e f g h
removal and m n o 
subsumption m p  

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Merge and Prioritized Merge
40

 Mi
Mixes Join
J i and
d Union
U i tot a new operator
t [GPZ01]
 Idea: Build two versions for each common attribute, one
“favoring”
g S1,, the other “favoring”
g S2.
 Nulls in a source are replaced using COALESCE.
 Fuses complementing tuples, but only for inter-source
d li
duplicates
 Priorization possible: Removes conflicting tuples from right
relation.

( SELECT R.A, COALESCE(R.B, S.B), R.C, S.D


FROM R LEFT OUTER JOIN S ON R.A = S.A )
UNION
( SELECT S.A, COALESCE(S.B, R.B), R.C, S.D
FROM R RIGHT OUTER JOIN S ON R.A R A = S.A
SA )

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Merge and Prioritized Merge
41
A is real-world ID

A B C A B D A B C D A B C D
a b c a b  a COAL(b,b) c  a b c 
e f g |⋈ e f h = e COAL(f f)
COAL(f,f) g h = e f g h
m n o m p  m COAL(n,p) o  m n o 
m n  m COAL(n,p)   m n  
q r s q r s  q r s 

A B C A B D A B C D A B C D
a b c a b  a COAL(b,b) c  a b c 
e f g ⋈| e f h = e COAL(f,f) g h = e f g h
m n o m p  m COAL( )
COAL(p,n) o  m p o 
m n  m COAL(p,n)   m p  
q r s

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Match Join
42

 Context: AURORA Projectj [[YÖ99]]


 Handles columns individually using projections (with IDs)
 Performs UNION on each column across all sources
 Reassembles using FULL OUTER JOINS
 Uses “conflict-tolerant query model” to query these possible
worlds.
ld
WITH OU(A,B,C,D) AS (
( SELECT A, B, C, NULL AS D FROM U1 )
UNION
( SELECT A, B, NULL AS C, D FROM U2 ) ),
B_V (A,B) AS ( SELECT DISTINCT A, B FROM OU ),
C V (A,C)
C_V (A C) AS ( SELECT DISTINCT A,
A C FROM OU ),)
D_V (A,D) AS ( SELECT DISTINCT A, D FROM OU ),
SELECT A, B, C, D
FROM B_V
B V FULL OUTER JOIN C_VC V FULL OUTER JOIN D_V
D V
ON B_V.A=C_V.A AND C_V.A=D_V.A
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Match Join
43

 Conflict-tolerant query model


 Chooses tuples from result of MatchJoin
 Three semantics
 HighConfidence,
HighConfidence RandomEvidence PossibleAtAll
RandomEvidence,
 Resolution functions
 SUM,
SUM AVG,
AVG MAX,
MAX MIN,
MIN ANY,
ANY DISCARD
SELECT ID,, Name[ANY],
[ ], Age[MAX]
g [ ]
FROM MatchJoin(U1,U2)
WHERE Age>22
WITH PossibleAtAll

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Grouping and Aggregation
44

 Outer union then group by real-world ID


O
 Aggregate all other columns using conflict resolving
aggregate function
 Efficient implementations
 Catches inter- and intra-source duplicates
 Restricted to built
built-in
in WITH OU AS (
( SELECT A, B, C, NULL AS D FROM U1 )
aggregate-functions UNION (ALL)
 MAX,, MIN,, AVG,, ( SELECT A,, B,, NULL AS C,, D FROM U2 ) ),
VAR, STDDEV, SUM,
COUNT SELECT A, MAX(B), MIN(C), SUM(D)
FROM OU
GROUP BY A
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
FUSE BY
45

 SQL extensions to resolve uncertainties and contradictions


[BN05,BBB+05]
 FUSE FROM implies OUTER UNION
 Removes subsumed and duplicate tuples by default
 FUSE BY declares real-world ID
 RESOLVE specifies conflict resolution function from catalog
 Default: COALESCE
 Implemented on top of relational DBMS “XXL”
SELECT ID,
RESOLVE(Title, Choose(IMDB)),
RESOLVE(Year, Max), RESOLVE(Director),
RESOLVE(Rating), RESOLVE(Genre, Concat)
FUSE FROM IMDB, Filmdienst
FUSE BY ((ID))
ON ORDER Year DESC
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Summary of Operators
46

Duplicates Subsumed Complementing Contradictions


tuples tuples
Union,, Outer Union    
Minimum Union    
Full Disjunction   
(i
(inter-source)
)
Complement Union    
Merge    
(inter-source) (inter-source)
MatchJoin    
+ CTQM 
Group By    
Fuse By    

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
FuSem
47

 Tool to q
queryy and fuse data from diverse data sources
[BDN07]
 Based on HumMer project [BBB+05].
 htt //
http://www.hpi.uni-potsdam.de/naumann/sites/fusem/
h i i td d / / it /f /
 Explore data and find interesting subsets
 Execute explore and compare five different data fusion
Execute,
semantics, specified in their respective syntax:
 SQL (and extensions,
suchh as Subsumption)
S b ti )
 Merge
 MatchJoin
 FuseBy
 ConQuer

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
What else is there?
48

 C
Consistent Q
Query A
Answering
 Avoid conflicts and report only certain tuples
 Those that appear in every repair [FFM05]
 “Possible worlds” models
 Build all possible solutions, annotated with likelihood
 Yes/No/Maybe [DeM89]
 Probability
P b bili value
l [LSS94]
 Probabilistic databases [SD05]
 Extend
E t d algebra
l b to t produce
d probabilities
b biliti
 Extend query language to query and export probabilities

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Overview
49

 Data fusion in the integration process


 Foundations of data fusion
 Conflict resolution strategies and functions
 Conflict resolution operators

 Advanced truth-discovery techniques


 Existing data fusion systems
 Open problems

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Outline
50

 Data fusion in the integration process


 Foundations of data fusion
 Conflict resolution strategies and functions
 Conflict resolution operators

 Advanced truth-discovery techniques


 Data fusion in existing integration systems
 Open problems

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Basic Strategies
51

conflict resolution
strategies

conflict conflict conflict


ignorance avoidance resolution

instance metadata instance metadata


based based based based

deciding mediating deciding mediating

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Intuitions
52

 Data sources are of different


quality and we trust data from
accurate sources more

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Intuitions
53

 Data sources are of different


quality and we trust data from
accurate sources more

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Basic Strategies
54

conflict resolution
strategies

conflict conflict conflict


ignorance avoidance resolution

instance metadata instance metadata


based based based based

deciding mediating deciding mediating

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Intuitions
55

 Data sources are of different


quality and we trust data from
accurate sources more
 The real world is dynamic
y and the
true value often evolves over time
 E.g.,
g , person
p affiliation,, business
contact phone

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Intuitions
56

 Data sources are of different


quality and we trust data from
accurate sources more
 The real world is dynamic
y and the
true value often evolves over time
 E.g.,
g , person
p affiliation,, business
contact phone
 Data sources can copy
py from each
other and errors can be
propagated
p p g quickly
q y

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Advanced Truth-Discovery Techniques
57

 Data sources are of different Consider


quality and we trust data from accuracy of
sources
accurate sources more
 The real world is dynamic
y and the
true value often evolves over time Consider
freshness of
 E.g.,
g , person
p affiliation,, business sources
contact phone
 Data sources can copy
py from each
other and errors can be Consider
dependence
propagated
p p g quickly
q y between sources
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Advanced Truth-Discovery Techniques
58

 Data sources are of different Consider


quality and we trust data from accuracy of
sources
accurate sources more
 The real world is dynamic
y and the
true value often evolves over time Consider
freshness of
 E.g.,
g , person
p affiliation,, business sources
contact phone
 Data sources can copy
py from each
other and errors can be Consider
dependence
propagated
p p g quickly
q y between sources
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Trust Accurate Sources

 Considering accuracy can often improve truth


discovery

S1 S2 S3
Stonebraker MIT Berkeley MIT
Dewitt MSR MSR UWisc
Bernstein MSR MSR MSR
Carey UCI AT&T BEA
H l
Halevy G
Google
l G
Google
l UW
Trust Accurate Sources

 Considering accuracy can often improve truth


discovery

S1 S2 S3
Stonebraker MIT Berkeley MIT
Dewitt MSR MSR UWisc
Bernstein MSR MSR MSR
Carey UCI AT&T BEA
H l
Halevy G
Google
l G
Google
l UW

S1 is more accurate; trusting it more can help


find the correct affiliation for Carey
Trust Accurate Sources

 Considering accuracy can often improve truth


discovery

S1 S2 S3
Stonebraker MIT Berkeley MIT
Dewitt MSR MSR UWisc
Bernstein MSR MSR MSR
Carey UCI AT&T BEA
H l
Halevy G
Google
l G
Google
l UW

S1 is more accurate; trusting it more can help


find the correct affiliation for Carey
Find Trustable Sources (I)
62

 Deciding authority based on link analysis and


source popularity
 Survey: “Link analysis ranking: algorithms, theory, and
experiments” [Borodin et al., 05]
 PageRank [Brin and Page, 98]

 Authority-hub
y analysis
y [[Kleinberg,g, 98]]

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Find Trustable Sources (II)
63

 Assign a global trust rating to each data source


based on its behavior in a P2P network
 TrustMe [Singh and Liu, 03]
 EigenTrust
g [[Kamvar et al.,, 03]]
 Peer i&j:
sij  sat (i, j )  unsat (i, j )
max(sij ,0)
S cij 
 max(S
j
ij ,0)

S’ tij   cik ckj


k

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Find Trustable Sources (III)
64

 Compute accuracy of sources


 Corroborating answers from web sources [Wu
and Marian, 07]
 TruthFinder [Yin et al., 07]

 Solomon [Dong et al., 09a]

A( S )  Avg P(v)
vV ( S )

V (S ) -values
values provided by S;
S P(v)
P(v)-pr
pr of value v being true
How to compute P(v)?

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Apply Source Accuracy in Truth Discovery
[Yin et al.,
al 07] [Dong et al
al., 09a]
65

 I
Input:
 ObjectO
 Dom(O)={v
Dom(O)={ 0,v1,…,vn}
 Observation Ф on O
 Output: Pr(vi true|Ф) for each i=0,…,
i=0 n (sum up to 1)
 According to the Bayes Rule, we need to know
Pr(Ф|vi true)
 Assuming independence of sources, we need to know
Pr(Ф(S) |vi true)
 If S provides vi : Pr(Ф(S) |vi true) =A(S)
 If S does not provide vi : Pr(Ф(S) |vi true) =(1-A(S))/n

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Model and Algorithm [Dong et al., 09a]
Properties
 A value provided by more accurate S
Source accuracy
sources has a higher probability to
be true A( S )  Avg P (v)
 Assuming uniform accuracy, a value vV ( S )
provided by more sources has a
higher probability to be true

V l probability
Value b bilit S
Source ttrustworthy
t th
eC ( v )
P (v )  A' ( S )  ln
nA( S )
 e C ( v0 )
v0 D ( O )
1  A( S )

 Consider value similarity


C * (v)  C (v)    C (v' )  sim
i (v, v ' ) V l confidence
Value fid
v ' v
C (v )   A' (S )
SS ( v )

 Continue until source accuracy converges


An Example
S1 S2 S3
Stonebraker MIT Berkeley MIT
Dewitt MSR MSR UWisc
Bernstein MSR MSR MSR
Carey UCI AT&T BEA
Halevy Google Google UW

Value Carey
Accuracy S1 S2 S3 Confidence UCI AT&T BEA
Round 1 .69
69 .57
57 .45
45 Round 1 1 61
1.61 1 61
1.61 1 61
1.61
Round 2 .81 .63 .41 Round 2 2.40 1.89 1.42
Round 3 .87 .65 .40 Round 3 3.05 2.16 1.26
Round 4 .90 .64 .39 Round 4 3.51 2.23 1.19
Round 5 .93 .63 .40 Round 5 3.86 2.20 1.18
Round 6 .95 .62 .40 Round 6 4.17 2.15 1.19
Round 7 .96 .62 .40 Round 7 4.47 2.11 1.20
Round 8 .97 .61 .40 Round 8 4.76 2.09 1.20
Advanced Truth-Discovery Techniques
68

 Data sources are of different Consider


quality and we trust data from accuracy of
sources
accurate sources more
 The real world is dynamic
y and the
true value often evolves over time Consider
freshness of
 E.g.,
g , person
p affiliation,, business sources
contact phone
 Data sources can copy
py from each
other and errors can be Consider
dependence
propagated
p p g quickly
q y between sources
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
A Dynamic World
 True values can evolve over time
 A subtle third case: out-of-date

S1 S2 S3

Stonebraker ((03,, MIT)) ((00,, Berkeley)


y) ((06,, MIT))

Dewitt (09, MSR) (08, MSR) (01, UWisc)

Bernstein (00, MSR) (00, MSR) (01, MSR)

Carey (09, UCI) (05, AT&T) (06, BEA)

Halevy (07, Google) (05, Google) (06, UW)


A Dynamic World
 True values can evolve over time
 A subtle third case: out-of-date
 Low-quality data can be caused by different reasons
S1 S2 S3
Stonebraker (03, MIT) (00, Berkeley) (01, Berkeley)
(Ѳ Berkeley),
(Ѳ, B k l ) (02
(02, MIT) OUT-OF-DATE! (06 MIT)
(06,
Dewitt (00, UWisc) (00, UW) (01, UWisc)
(Ѳ, UWisc), (08, MSR) (09, MSR) (01, UWisc) OUT-OF-DATE!
(08, MSR)
Bernstein (Ѳ, MSR) (00, MSR) (00, MSR) (01, MSR)
C
Carey (Ѳ Propell),
(Ѳ, P ll) (04, BEA)
(04 (05, AT&T)
(05 (06 BEA)
(06,
(02, BEA), (08, UCI) (09, UCI) ERR! OUT-OF-DATE!

Halevy (00, UW) (00, UWisc) (01, UWisc)


(Ѳ, UW), (05, Google) (07, Google) (02, UW) (06, UW)
(05, Google) SLOW!
Refine Accuracy of Sources [Dong et al., 09b]

How many How many


Exact
t
transitions
iti Cove ness t
transitions
iti are
are captured rage not mis-captured
Fresh
ness
How quickly
transitions are
captured
Accuracy
Mis-capturable Mis-capturable Mis-capturable Mis-capturable
Mis-capturable
UWisc Capturable Capturable Capturable MSR Capturable
Dewitt
Ѳ(2000) 2008
UW  UWisc
S
2003 2005 2007
Mis-captured Mis-captured Captured
Coverage = #Captured/#Capturable
C #C t d/#C t bl (e.g., ( ¼
¼=.25)
25)
Exactness = 1-#Mis-Captured/#Mis-Capturable (e.g., 1-2/5=.6)
Freshness() = #(Captured w. length<=)/#Captured (e.g., F(0)=0, F(1)=0, F(2)=1/1=1…)
Freshness Measures in Other Work
72

 Other work on data freshness: Compare a


materialized view with the original source
 [Peralta, Ph.D. Thesis’06]: timeliness, currency
 [[Guo et al.,, 05]:
] completeness,
p , consistency,
y, currency
y
 [Olston and Widom, 05]: divergence

 [Labrinidis and Roussopoulos, 04]: QoD(freshness)

 [Theodoratos and Bouzeghoub, 01]: consistency

 [Cho and Garcia-Molina,


Garcia Molina 00]00]: freshness,
freshness age

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Discover Evolving True Values
73

Decide the initial Terminate when no


value v0 more transition

Decide the next


transition (t,v)

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Discover Evolving True Values
74

Decide the initial Terminate when no


value v0 more transition

Decide the next


transition (t,v)

 Decide the initial value: according to the Bayes Rule, we need


to know
 Pr(Ф(S)|vi) for each value vi
 If S provides vi : E(S)C(S)
 If S does not provide any value: E(S)(1-C(S))
 If S provides another value: (1-E(S))/n
 Pr(Ф(S)|)—the object does not exist initially
 If S does not provide any value: E(S)
 If S provides
p a value : (1-E(S))/(n+1)
( ( ))/( )

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Discover Evolving True Values
75

Decide the initial Terminate when no


value v0 more transition

Decide the next


transition (t,v)

v' v
O
t' t

v' v
S1
v' v
S2
v' v
S3
… v' v
Sk

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Discover Evolving True Values
76

Decide the initial Terminate when no


value v0 more transition

Decide the next vj


transition (t,v) O
ti
vj
 Decide the next transition (t,v): according to S
t
th Bayes
the B R l we need
Rule, d to
t know
k vj
O
 Pr(Ф(S)|(ti,vj)) for each time ti and value vj ti
 If S provides vj at time t : E(S)C(S)F(S, t-ti) S
 If S does not update any more: E(S)(1-C(S)F(S, tn-ti))
 If S makes a wrong update: (1-E(S))/n(tn-t’) vj
(tn—the last obs point, t’—time of the prev update) O
ti
 (Ф(S)|no more transition
Pr(Ф(S)| i i ):) similarly
l l computed
d S
v<>v
<> j

 If S does not update any more: E(S) t


 If S makes an update: (1-E(S))/(n+1)(tn-t’) vj
O
ti
v
S
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
t
An Example
S1 S2 S3
Halevy (00, UW) (00, UWisc) (01, UWisc)
(Ѳ, UW), (05, Google) (07, Google) (02, UW) (06, UW)
(05 Google)
(05, G l )

Affiliation for Halevy:


UWisc UW g
Google
Rnd1
2000 2002 2004
UW Google
Rnd2
2000 2004
UW Google
Rnd3
2000 2005
Advanced Truth-Discovery Techniques
78

 Data sources are of different Consider


quality and we trust data from accuracy of
sources
accurate sources more
 The real world is dynamic
y and the
true value often evolves over time Consider
freshness of
 E.g.,
g , person
p affiliation,, business sources
contact phone
 Data sources can copy
py from each
other and errors can be Consider
dependence
propagated
p p g quickly
q y between sources
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Copied Data Can Change Truth Discovery Results

 Previous methods assume source independence

S1 S2 S3 S4 S5
Stonebraker MIT Berkeley MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy
H l Google
G l Google
G l UW UW UW
Voting for Independence Sources

 10 sources voting for an object


Count =5
Countt =3
C 3 S4 1
S2
2
S5 S6
S1 S3

Count=2 S7 S8

S9 S10
3
Voting w. Knowledge of Copying

 10 sources voting for an object


Count =2
Countt =1
C 1 S4 1
S2
2
S5 S6
S1 S3

Count=1 S7 S8

S9 S10
3
Voting w. Probabilistic Copying

 10 sources voting for an object


How to compute Count =?
vote count?
Countt =??
C S4 1
S2 1
2 .4
4
.4 S5 S6
1
S1 S3 1
.4
Count=? S7 S8

How to detect .7
copying?
S9 S10
3
Considering Dependence
83

 Opinion pooling: combine probability distribution


from multiple experts
 Combination of opinions [Chang, Ph.D. thesis’85]
 Reconciliation of p probabilityy distributions [Lindley,
[ y, 83]]
 Updating of belief in the light of someone else’s opinion
[[French,, 80]]
 Data fusion w. source dependence
 [Dong et al.,
al 09a][Dong et al
al., 09b]
See Tomorrow’s talks in “Data Integration I”

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Outline
84

 Data fusion in the integration process


 Foundations of data fusion
 Conflict resolution strategies and functions
 Conflict resolution operators

 Advanced truth-discovery techniques


 Data fusion in existing integration systems
 Open problems

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Web Integration—Google Fusion Tables
85

 Allows discussion of values between users


Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Commercial DI Tools
86

Source: Gartner
 Typical ETL tools support rule-based fusion
 IIS (IBM Information Server)
 SSIS (Microsoft’s
(Microsoft s SQL Server Integration Services)
 Etc. See details in survey [Bleiholder and Naumann, 08]

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Research DI Systems w. Awareness of Data Conflicts
87

System Conflict types Methodology Strategy Specification


M ltib
Multibase S h
Schematic,
ti d data
t R l ti
Resolution Ch
Choose, A Min,
Avg, Mi Max,
M SSum, … M
Manually,
ll iin query
Hermes Schematic, data Resolution MostRecent, Choose Manually, in mediator
Fusionplex Schematic, object, data Resolution MostRecent, Min, Max, Avg, … Manually, in query
MostAbstract, Vote, Min,
HumMer Schematic, object, data Resolution Manually, in query
ChooseDepen…
Ajax Schematic, object, data Resolution Various Manually, in workflow definition
TSIMMIS Schematic data
Schematic, Avoidance Choose Manually rules in mediator
Manually,

SIMS/Ariadne Schematic, data Avoidance Choose Automatically

Infomix Schematic, data Avoidance onlyConsistentValue Automatically


Hi
Hippo S h
Schematic,
i object,
bj d
data A id
Avoidance onlyConsistentValue
l C i V l A
Automatically
i ll
ConQuer Schematic, object, data Avoidance onlyConsistentValue Automatically
Rainbow Schematic, object, data Avoidance onlyConsistentValue Automatically
Pegasus Schematic, data Ignorance Escalate Manually
Nimble Unknown Ignorance Escalate Manually
Carnot Schematic Ignorance Escalate Automatically
InfoSleuth Schematic Ignorance Escalate Unknown
Potter’s Wheel Schematic Ignorance Escalate Manually, transformation
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann See details in survey [Bleiholder and Naumann, 08]
Other DI Systems
88

 Research DI systems
 Trio: including accuracy and lineage into data model
 Information Manifold
 Garlic
 Disco (Distributed Information Search Component)
 etc.
 Peer data management systems
 Orchestra: allowing
g multiple
p viewpoints
p
 Hyper: isolating the minimum amount of data to reach
consistency
See details in survey [Bleiholder and Naumann, 08]

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Outline
89

 Data fusion in the integration process


 Foundations of data fusion
 Conflict resolution strategies and functions
 Conflict resolution operators

 Advanced truth-discovery techniques


 Data fusion in existing integration systems
 Open problems

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Open Problems
90

 Accuracy of fusion
 Efficiencyy of fusion
 Usability of fusion
 I t
Interaction
ti withith other
th components
t off d
data
t
integration

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Accuracy of Fusion (I)
91

 Challenge 1: Correlated values


 E.g.1, (firstName, lastName) from 4 sources
 S1: ((Xin,, Dong)
g)
 S2: (Xin Luna, Dong)
Voting (Dong, Dong)
 S3: (Dong, Xin)
 S4 (Dong,
S4: (D Xi LLuna))
Xin
 E.g.2, (ISBN, authors) from 3 sources
 S1: ((**1,
1, Peter Loshin) (**2,
( 2, Peter Loshin)
S2: (**1, Pete Loshin) (**1, Pete Loshin)
 Voting
(**2, Peter Loshin)
 S3: (**1, Pete Loshin)
 Current effort: ChooseDepending(val, col)
 Directions: consider correlation at the attribute level
and at the instance level.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Accuracy of Fusion (II)
92

 Challenge 2: Different formating styles


 E.g., (ISBN, authors) from 4 sources
Skip middle-
Src1 Src2
(**1, Pete Loshin) (**1, Pete Loshin) names
(**2, Dennis Suhanovs) (**2, Dennis Suhanovs)
(**3, Zhigang Xiang, Roy A Plastock) (**3, Zhigang Xiang, Roy Plastock)
(**4 Peter
(**4, P t Aiken,
Aik David
D id M Allen)…
All ) (**4 Peter
(**4, P t Aiken,
Aik D David
id Allen)…
All )

Src3
(**1 Pete
(**1, P t Loshin)
L hi )
(**2, Dennis Suhanovs)
(**3, Zhigang Xiang)
(**4, David Allen, Peter Aiken)…
)

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Accuracy of Fusion (II)
93

 Challenge 2: Different formating styles


 E.g., (ISBN, authors) from 4 sources
Skip middle-
Src1 Src2
(**1, Pete Loshin) (**1, Pete Loshin) names
(**2, Dennis Suhanovs) (**2, Dennis Suhanovs)
(**3, Zhigang Xiang, Roy A Plastock) (**3, Zhigang Xiang, Roy Plastock)
(**4 Peter
(**4, P t Aiken,
Aik David
D id M Allen)…
All ) (**4 Peter
(**4, P t Aiken,
Aik D David
id Allen)…
All )

Src3 Src4
(**1 Pete
(**1, P t Loshin)
L hi ) (**1 Pete
(**1, P t Loshin)
L hi )
(**2, Dennis Suhanovs) (**2, Dennis uhanovs) Only first-
(**3, Zhigang Xiang) (**3, Zhigang Xiang) authors
(**4, David Allen, Peter Aiken)…
) (**4, Peter Aiken)…
)

 Current effort: consider value similarity


 Directions: consider formatting styles used by each
source.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Accuracy of Fusion (III)
94

 Challenge 3: Source profiling Exact


 Current effort: accuracy (coverage, Cove
rage
ness
exactness, freshness)
Fresh
 Data properties can be different for ness
different categories of data
 Source A is a vertical source on
restaurants Accuracy
 Source B knows very well about NYC
 D t properties
Data ti can evolve
l over time
ti
 Source C improves its data over time
 Directions: partition data into
different portions and profile on
each portion

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Efficiency of Fusion (I)
95

 Challenge
g 4: Incremental fusion
 When we have more data sources (e.g., Src4) or lose some data
sources, shall we do data fusion from scratch?
Src1 Src2
(**1, Pete Loshin) (**1, Pete Loshin)
(**2, Dennis Suhanovs) (**2, Dennis Suhanovs)
((**3
3, Zhigang Xiang,
Xiang Roy A Plastock) ((**3
3, Zhigang Xiang,
Xiang Roy Plastock)
(**4, Peter Aiken, David M Allen)… (**4, Peter Aiken, David Allen)…

Src3 Src4
(**1, Pete Loshin) (**1, Pete Loshin)
(**2, Dennis Suhanovs) (**2, Dennis uhanovs)
(**3, Zhigang Xiang) (**3, Zhigang Xiang)
(**4, David Allen, Peter Aiken)… (**4, Peter Aiken)…

 When more data come, shall we start from scratch?


 Directions maintain metadata or statistics,
Directions: statistics retain data lineage
lineage.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Efficiency of Fusion (II)
96

 Challenge 5: Runtime fusion


 In some applications fusing data upfront is infeasible

Mediated Schema

D1 D5

D2 D4
D3
 Directions: maintain source profiles by sampling; emphasize efficiency.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Usability of Fusion (I)
97

 Challenge
g 6: Personalized fusion
 Express preference on certain
sources
 Emphasize certain property; e.g.,eg
up-to-date vs. high coverage
 Use certain formats; e.g., full
author list vs. only first author
 Current effort:
 Function choose(src)
 Operator Prioritized-Merge
 Directions:
 A language to express such user
preferences
 Algorithms for efficient execution.

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Usability of Fusion (II)
98

 Challenge 7: User feedback


 Correct certain errors
 Directions:
 Criticalq
questions that can best
improve the fusion results
 A wayy for users to browse
source data and fusion results,
and correct mistakes
 Quickly fixing errors and
propagation
p p g to related items
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Usability of Fusion (III)
99

 Challenge 8: Data Lineage


 Legal requirement
 Application requirement: e.g.,
fusing two customers
 HCI requirement: HOW did you
merge the data? And WHY?
 Directions:
 Effective representation of
lineage information
 Explanation of merging decisions
 Effective way to find
disappeared data items
 Reversibility and repeatability

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Interaction with Other Components of DI (I)
100

 Challenge 9: Fuse data w.


w different schemas
 E.g., Contact information from three sources
 S1: (pid
S1 ( id = “1”,
“1” workk phone
h = “1234”
“1234”, home
h phone
h = “8765”,
“8765”
mobile phone = “4321”)
 S2: (pid = “1”
1 , daytime phone = “1234”
1234 , evening phone =
“4321”)
 S3: (pid = “1”, phone = “4321”)

 Directions: Combine data fusion w. schema


matching

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Interaction with Other Components of DI (II)
101

 Challenge 10: Distinguish wrong values from


alternative representations of correct values
 E.g., A quiz

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
A Quiz
102

Which type of listing


are they?

A: are the same business

B: are different businesses sharing


the same phone#

C: are different businesses, only


one with correct phone#

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann 102
Interaction with Other Aspects of DI (II)
103

 Challenge 10: Distinguish wrong values from


alternative representations of correct values
 E.g., A quiz
 Directions: Combine data fusion w. record linkage

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Conclusions
104

 F d ti
Foundations Cleaned
Cl d
 Strategies and functions Data
 Operators
Data Fusion
 Advanced techniques
 Consider accuracy Duplicate
p
 Consider freshness Detection
 Consider dependence Schema
pp g
Mapping
 Open problems
 Accuracy
 Efficiency s s
 Usability s s s
 Interaction with other components of DI s

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
References
105

 Survey
 [BN08] J.
J Bleiholder,
Bleiholder F.
F Naumann.
Naumann Data Fusion
Fusion. ACM Computing Survey 2009
2009.
 Foundations of Fusion
 [BBB+05] A. Bilke, J. Bleiholder, C. Böhm, K. Draba, F. Naumann and M. Weis. Automatic Data Fusion with HumMer. VLDB demo
2005.
 [BDN07] Jens Bleiholder, Karsten Draba, and Felix Naumann. FuSem - Exploring Different Semantics of Data Fusion (demo) VLDB
demo 2007.
 [BN05] J. Bleiholder, F. Naumann. Declarative Data Fusion - Syntax, Semantics, and Implementation. ADBIS 2005.
 [CS05] S. Cohen and Y. Sagiv. An incremental algorithm for computing ranked full disjunctions. PODS 2005.
 [DeM89] DeMichiel, L. G. Resolving database incompatibility: An approach to performing relational operations over mismatched
domains. TKDE 1989.
 [FFM05] Ariel Fuxman, Elham Fazli, Renee J. Miller. ConQuer: Efficient Management of Inconsistent Databases. SIGMOD 2005.
 [GL94] C. A. Galindo-Legaria. Outerjoins as disjunctions. SIGMOD 1994.
 [GPZ01] S. Greco, L. Pontieri, and E. Zumpano. Integrating and managing conflicting data. International Andrei Ershov Memorial
Conference on Perspectives of System Informatics, 2001.
 [LSS94] Lim, E.-P., Srivastava, J., and Shekhar, S. Resolving attribute incompatibility in database integration: An evidential
reasoning approach. ICDE 1994.
 [RPZ04] J. Rao, H. Pirahesh, and C. Zuzarte. Canonical abstraction for outerjoin optimization. SIGMOD 2004.
 [RU96] A. Rajaraman and J. D. Ullman. Integrating information by outerjoins and full disjunctions. PODS1996.
 [SD05] Dan Suciu and Nilesh Dalvi.
Dalvi Probabilistic Databases
Databases. Tutorial at SIGMOD 2005.
2005
 [YÖ99] L. L. Yan and M. T. Özsu. Conflict tolerant queries in AURORA. CoopIS 1999.

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
References (con’t)
106

 Advanced truth-discovery techniques


 [ESD+09] L.L Berti-Equille,
Berti-Equille A
A. D
D. Sarma,
Sarma XX. LL. Dong
Dong, A
A. Marian
Marian, and D.
D Srivastava.
Srivastava Sailing the information ocean with awareness of
currents: Discovery and application of source dependence. In CIDR, 2009.
 [BRR+05] A. Borodin, G. Roberts, J. Rosenthal, and P. Tsaparas. Link analysis ranking: algorithms, theory, and experiments. ACM
TOIT, 5:231–297, 2005.
 [BP98] S.
S Brin and LL. Page.
Page The anatom
anatomy of a large
large-scale
scale hypertextual
h perte t al Web search engine.
engine Computer
C t Networks
N t k andd ISDN
Systems, 30(1–7):107–117, 1998.
 [CG00] J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In SIGMOD, 2000.
 [Cha85] K. Chang. Combination of opinions: the expert problem and the group consensus problem. PhD thesis, University of
C lif i BBerkeley,
California, k l 1985
1985.
 [DBS09a] X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. VLDB, 2009.
 [DBS09b] X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. In VLDB,
2009
 [French80] S. French. Updating of belief in the light of someone else’s opinion. Jour. of Roy. Statist. Soc. Ser. A, 143:43–48,
1980.
 [GLR05] H. Guo, P.- A. Larson, and R. Ramakrishnan. Caching with ’good enough’ currency, consistency, and completeness. In
VLDB, 2005.
 [KSG03] S. Kamvar, M. Schlosser, and H. Garcia-Molina. The Eigentrust algorithm for reputation management in P2P networks. In
Proc. of WWW, 2003.
 [Kle98] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, 1998.
 [Lin83] D.
D Lindley.
Lindle Reconciliation of probability
probabilit distributions.
distrib tions Oper.
O Res.,
R 31:866–880,
31 866 880 1983.
1983

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
References (con’t)
107

 [LR04] A. Labrinidis and N. Roussopoulos. Exploring the tradeoff between performance and data freshness in database-driven
web servers. VLDB J., 13(3):240–255, 2004.
 [OW05] C. Olston and J. Widom. Efficient monitoring and querying of distributed, dynamic data via approximate replication.
IEEE Data Eng. Bull., 28(1):11–18, 2005.
 [SL03] A. Singh and L. Liu. TrustMe: anonymous management of trust relationships in decentralized P2P systems. In IEEE Intl. Conf.
on Peer-to-Peer Computing, 2003.
 [TB01] D. Theodoratos and M. Bouzeghoub. Data currency quality satisfaction in the design of a data warehouse. Int. J.
ooperative Inf. Syst., 10(3):299–326, 2001.
 [WM07] M. Wu and A. Marian. Corroborating answers from multiple web sources. In Proc. of WebDB, 2007.
 [YHY07] X
X. Yi
Yin, JJ. H
Han, and
d PP. SS. Y
Yu. Truth
T th di
discovery with
ith multiple
lti l conflicting
fli ti iinformation
f ti providers
id on th
the W
Web.
b IIn Proc.
P Of
SIGKDD, 2007.

Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann

You might also like