Data Fusion
Data Fusion
RESOLVING DATA
CONFLICTS IN INTEGRATION
Tutorial at Xin Luna Dong – AT&T Labs-Research
VLDB 2009 Felix Naumann – Hasso Plattner Institute (HPI)
Origins of Data Conflicts
2
Original
ACM Computing
Survey [BN08]
Scanned
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Data Conflicts
3
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Data Conflicts
4
Integrated data
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Data Conflicts: German Names
5
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Data Conflicts: Difficult Names
6
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Intra-Source Conflicts
7
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Origins of Inter-Source Conflicts
8
LLocally
ll consistent
it tb butt globally
l b ll iinconsistent
it t
Different data types
Local spelling
p g variations and conventions
Addresses
St → Street, Ave → Avenue, etc.
Breitscheid Str. 72 a → Rudolf
R -Breitscheid-Str
R. Rudolf-Breitscheid
Breitscheid.-Str
Str. 72A
128 spellings for Frankfurt am Main
Frankfurt a.M., Frankfurt/M, Frankfurt, Frankfurt a. Main, …
Names
Dr. Ing. h.c. F. Porsche AG
Hewlett-Packard Development Company, L.P.
Numerical data
10.000 € = 10T EURO = 10k EUR = 10.000,00€ = 10,000.- €
Phone numbers, birth dates, etc.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Resolution of Data Conflicts? tables googlelabs com
tables.googlelabs.com
9
““… focus
f is on fusing
f data management and
collaboration: merging multiple data sources,
di
discussion
i off the
th ddata,
t querying,
i visualization,
i li ti and d
Web publishing.”
“Th power off data
“The d is
i truly
l harnessed
h d when
h you
combine data from multiple sources. Fusion Tables
enables you to fuse multiple sets of data when they
are about the same entities. In database speak, we
call this a join on a primary key but the data
originates from multiple independent sources.”
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Web Integration—Google Fusion Tables
10
Error correction
Reference tables
Cities, countries,
Citi t i products
d t ...
Similarity measures
Standardization and transformation
Domain-knowledge (meta data)
Conventions (country/region-specific spelling)
Ontologies
Thesauri, dictionaries for homonyms, synonyms, ...
O li d
Outlier detection
i and
d elimination
li i i
And data fusion…
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Overview
12
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Information Integration
13
<pub>
pub
<Titel> Federated Database
Systems </Titel>
Source A <Autoren>
<Autor>
Autor Amit Sheth </Autor>
/Autor
<Autor> James Larson </Autor>
</Autoren>
</pub>
<publication>
<title> Federated Database
Systems for Managing
Source B Distributed Heterogeneous,
Distributed, Heterogeneous
and Autonomous
Databases </title>
<author> Scheth & Larson </author>
<year> 1990 </year>
</publication>
<pub>
pub <pub>
pub
<Titel> Federated Database <title> </title>
Systems </Titel> <Autoren>
Source A <Autoren> <author> </author>
<Autor>
Autor Amit Sheth </Autor>
/Autor <author>
author </author>
/author
<Autor> James Larson </Autor> </Autoren>
</Autoren> <year> </year>
</pub> </pub>
<publication>
<title> Federated Database
Systems for Managing
Source B Schema Integration
Distributed Heterogeneous,
Distributed, Heterogeneous
and Autonomous
Databases </title>
<author> Scheth & Larson </author> Schema Mapping
<year> 1990 </year>
</publication>
<pub>
pub <pub>
<Titel> Federated Database <title> Federated Database
Systems </Titel> Systems </title>
Source A <Autoren> <Autoren>
<Autor>
Autor Amit Sheth </Autor>
/Autor <author> Amit Sheth </author>
<Autor> James Larson </Autor> <author> James Larson </author>
</Autoren> </Autoren>
</pub> </pub>
<pub>
<publication> <title> Federated Database Systems for
<title> Federated Database Managing Distributed,
Systems for Managing Heterogeneous, and Autonomous
Source B Distributed Heterogeneous,
Distributed, Heterogeneous Databases </title>
and Autonomous <Autoren>
Databases </title> <author> Scheth & Larson </author>
<author> Scheth & Larson </author> </Autoren>
<year> 1990 </year> <year> 1990 </year>
</publication> </pub>
<pub>
pub <pub>
<Titel> Federated Database <title> Federated Database
Systems </Titel> Systems </title>
Source A <Autoren> <Autoren>
<Autor>
Autor Amit Sheth </Autor>
/Autor <author> Amit Sheth </author>
<Autor> James Larson </Autor> <author> James Larson </author>
</Autoren> </Autoren>
</pub> </pub>
<pub>
<publication> <title> Federated Database Systems for
<title> Federated Database Managing Distributed,
Systems for Managing Heterogeneous, and Autonomous
Source B Distributed Heterogeneous,
Distributed, Heterogeneous Databases </title>
and Autonomous <Autoren>
Databases </title> <author> Scheth & Larson </author>
<author> Scheth & Larson </author> </Autoren>
<year> 1990 </year> <year> 1990 </year>
</publication> </pub>
<pub> <pub>
pub
<title> Federated Database <title> Federated Database Systems for
Systems </title> Managing Distributed,
Source A <Autoren> Heterogeneous, and
<author> Amit Sheth </author> Autonomous Databases </title>
/title
<author> James Larson </author> <Autoren>
</Autoren> <author> Amit Sheth </author>
</pub> <author> James Larson </author>
<pub>
pub </Autoren>
/Autoren
<title> Federated Database Systems for <year> 1990 </year>
Managing Distributed, </pub>
Heterogeneous, and Autonomous
Source B Databases </title>
/title
<Autoren>
<author> Scheth & Larson </author>
</Autoren>
<year>
year 1990 </year>
/year Preserve lineage
</pub>
Schema Matching:
Same attribute semantics
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Completeness, Conciseness, and Correctness
20
Duplicate detection:
Same real-world
entities
titi
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Completeness, Conciseness, and Correctness
21 Intensional conciseness
Data Fusion:
D F i Resolve
R l
uncertainties and
Extenssional coompleteeness
contradictions
Extensional
conciseness
Intensionall completeness
l
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Schema Matching
22
Problem
Given two schemata, find all correspondences between
their
h i attributes
ib
Difficulties
Schematic heterogeneity
(synonyms & homonyms)
Data
D t hheterogeneity
t it
n:m mappings
Transformation
T f i ffunctions
i
User interaction
Then:
h Derive a schema
h mapping
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Duplicate Detection
23
Problem
Given one or more data sets, find all sets of objects
that represent the same real-world entity.
Difficulties
Duplicates are not identical
Similarity measures – Levenshtein, Soundex, Jaccard, etc.
Large volume,
volume cannot compare all pairs
Partitioning strategies – Sorted neighborhood, Blocking, etc.
Duplicates
CRM1 CRM1 x CRM2
Partitioning ???
Si il it measure
Similarity
CRM2 Non-
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann Duplicates
Ironically, “Duplicate Detection” has many Duplicates
24
Doubles
Household matching Duplicate detection
Entity resolution
Entity clustering
Approximate match
Identity uncertainty
Reference reconciliation
Merge/purge
Hardening soft databases
Householding
Reference matching
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Data Fusion
25
Problem
bl
Given a duplicate, create a single object representation
while resolving conflicting data values.
values
Difficulties
Null values:
values Subsumption and complementation
Contradictions in data values
Uncertainty & truth: Discover the true value and model
uncertainty in this process
Metadata: Preferences, recency, correctness
Lineage: Keep original values and their origin
Implementation in DBMS: SQL, extended SQL, UDFs, etc.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
The Field of Data Fusion
26
Data Fusion
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Overview
27
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Uncertainty and Contradiction
28
Uncertainty
NULL value vs. non-NULL value
“Easy” case
Contradiction
Non-NULL value
vs (different)
vs. Uncer -
U
Uncer- Contra-
C
Contra- Uncer -
U
Uncer-
non-NULL value tainty diction tainty
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Semantics of NULL
29
“unknown”
There is a value, but I do not know it.
E.g.: Unknown date-of-birth
“not
not applicable
applicable”
There is no meaningful value.
E.g.:
E Spouse
S for
f singles
i l
“withheld”
There is a value, but we are not authorized to see it.
E.g.: Private phone line
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Classification of Strategies
30
conflict resolution
strategies
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Conflict Resolution Functions
31
conflict resolution
strategies
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Overview
34
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Data Fusion Goals
35
Source 1(A,B,C) a, b, - a, b, -, -
a, b, -, - Identical tuples
Source 2(A,B,D) a,, b,, - a,, b,, -,, -
a, b, c a, b, c, -
a,
a b,
b c,
c - Subsumed tuples
a, b, - a, b, -, -
a, b, c a, b, c, -
a, b, c, d Complementing tuples
a, b, d a, b, -, d
a, b, c a, b, c, -
a, f(b,e), c, d Conflicting tuples
a e,
a, e d a e,
a, e -, d
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Relational Operators – Overview
36
Identical
d l tuples
l
UNION, OUTER UNION
Subsumed tuples (uncertainty)
MINIMUM UNION
Complementing tuples (uncertainty)
COMPLEMENT UNION, MERGE
Conflicting tuples (contradiction)
Relational approaches: Match, Group, Fuse, …
Other approaches
Possible worlds, probabilistic answers, consistent answers
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Minimum Union
37
A B C A B D A B C D
Union: Elimination a
e
b
f
c
g
+ a
e
b
f
h
=
a
a
b
b
c
of exact duplicates m n o m p e f g
Minimum Union: e f h
m n o
Elimination of m p
A tuple t1 subsumes a tuple t2,
subsumed tuples if it has same schema, has less
NULL-values, and coincides in R
Outerunion allll non-NULL-values.
NULL l
Subsumption A B C D
a b c
Rewriting in SQL using DWH extensions e f g
(Windows) and assuming existence of e f h
favorable ordering [RPZ04] m n o
m p
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Full Disjunction
38
A B C A B D A B C D
Represents all a b c
|⋈|
a b
=
a b c
possible e f g e f h e f g h
k o m p m p
combinations
bi ti off k o
k m k q r
source tuples k m
Full
F ll outer
t jjoin
i on allll common attributes
tt ib t k q r
All combinations for more than two sources R
Minimum union over results
A B C D
Combines complementing tuples a b c
(only inter-source)
inter source) e f g h
Algorithms: [GL94,RU96,CS05] m p
k o
k m
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann k q r
Complement Union – Proposal
39
A B C A B D A B C D
Elimination of a
e
b
f
c
g
+ a
e
b
f
h
=
a
a
b
b
c
complementing m n o m p e f g
tuples e f h
m n o
Outerunion
m p
Complementation A tuple t1 complements a
tuple t2, if it has same R⇅
No known SQL schema and coincides in all
non-NULL-values.
rewriting A B C D
a b c
Includes duplicate e f g h
removal and m n o
subsumption m p
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Merge and Prioritized Merge
40
Mi
Mixes Join
J i and
d Union
U i tot a new operator
t [GPZ01]
Idea: Build two versions for each common attribute, one
“favoring”
g S1,, the other “favoring”
g S2.
Nulls in a source are replaced using COALESCE.
Fuses complementing tuples, but only for inter-source
d li
duplicates
Priorization possible: Removes conflicting tuples from right
relation.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Merge and Prioritized Merge
41
A is real-world ID
A B C A B D A B C D A B C D
a b c a b a COAL(b,b) c a b c
e f g |⋈ e f h = e COAL(f f)
COAL(f,f) g h = e f g h
m n o m p m COAL(n,p) o m n o
m n m COAL(n,p) m n
q r s q r s q r s
A B C A B D A B C D A B C D
a b c a b a COAL(b,b) c a b c
e f g ⋈| e f h = e COAL(f,f) g h = e f g h
m n o m p m COAL( )
COAL(p,n) o m p o
m n m COAL(p,n) m p
q r s
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Match Join
42
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Grouping and Aggregation
44
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
FuSem
47
Tool to q
queryy and fuse data from diverse data sources
[BDN07]
Based on HumMer project [BBB+05].
htt //
http://www.hpi.uni-potsdam.de/naumann/sites/fusem/
h i i td d / / it /f /
Explore data and find interesting subsets
Execute explore and compare five different data fusion
Execute,
semantics, specified in their respective syntax:
SQL (and extensions,
suchh as Subsumption)
S b ti )
Merge
MatchJoin
FuseBy
ConQuer
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
What else is there?
48
C
Consistent Q
Query A
Answering
Avoid conflicts and report only certain tuples
Those that appear in every repair [FFM05]
“Possible worlds” models
Build all possible solutions, annotated with likelihood
Yes/No/Maybe [DeM89]
Probability
P b bili value
l [LSS94]
Probabilistic databases [SD05]
Extend
E t d algebra
l b to t produce
d probabilities
b biliti
Extend query language to query and export probabilities
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Overview
49
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Outline
50
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Basic Strategies
51
conflict resolution
strategies
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Intuitions
52
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Intuitions
53
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Basic Strategies
54
conflict resolution
strategies
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Intuitions
55
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Intuitions
56
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Advanced Truth-Discovery Techniques
57
S1 S2 S3
Stonebraker MIT Berkeley MIT
Dewitt MSR MSR UWisc
Bernstein MSR MSR MSR
Carey UCI AT&T BEA
H l
Halevy G
Google
l G
Google
l UW
Trust Accurate Sources
S1 S2 S3
Stonebraker MIT Berkeley MIT
Dewitt MSR MSR UWisc
Bernstein MSR MSR MSR
Carey UCI AT&T BEA
H l
Halevy G
Google
l G
Google
l UW
S1 S2 S3
Stonebraker MIT Berkeley MIT
Dewitt MSR MSR UWisc
Bernstein MSR MSR MSR
Carey UCI AT&T BEA
H l
Halevy G
Google
l G
Google
l UW
Authority-hub
y analysis
y [[Kleinberg,g, 98]]
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Find Trustable Sources (II)
63
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Find Trustable Sources (III)
64
A( S ) Avg P(v)
vV ( S )
V (S ) -values
values provided by S;
S P(v)
P(v)-pr
pr of value v being true
How to compute P(v)?
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Apply Source Accuracy in Truth Discovery
[Yin et al.,
al 07] [Dong et al
al., 09a]
65
I
Input:
ObjectO
Dom(O)={v
Dom(O)={ 0,v1,…,vn}
Observation Ф on O
Output: Pr(vi true|Ф) for each i=0,…,
i=0 n (sum up to 1)
According to the Bayes Rule, we need to know
Pr(Ф|vi true)
Assuming independence of sources, we need to know
Pr(Ф(S) |vi true)
If S provides vi : Pr(Ф(S) |vi true) =A(S)
If S does not provide vi : Pr(Ф(S) |vi true) =(1-A(S))/n
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Model and Algorithm [Dong et al., 09a]
Properties
A value provided by more accurate S
Source accuracy
sources has a higher probability to
be true A( S ) Avg P (v)
Assuming uniform accuracy, a value vV ( S )
provided by more sources has a
higher probability to be true
V l probability
Value b bilit S
Source ttrustworthy
t th
eC ( v )
P (v ) A' ( S ) ln
nA( S )
e C ( v0 )
v0 D ( O )
1 A( S )
Value Carey
Accuracy S1 S2 S3 Confidence UCI AT&T BEA
Round 1 .69
69 .57
57 .45
45 Round 1 1 61
1.61 1 61
1.61 1 61
1.61
Round 2 .81 .63 .41 Round 2 2.40 1.89 1.42
Round 3 .87 .65 .40 Round 3 3.05 2.16 1.26
Round 4 .90 .64 .39 Round 4 3.51 2.23 1.19
Round 5 .93 .63 .40 Round 5 3.86 2.20 1.18
Round 6 .95 .62 .40 Round 6 4.17 2.15 1.19
Round 7 .96 .62 .40 Round 7 4.47 2.11 1.20
Round 8 .97 .61 .40 Round 8 4.76 2.09 1.20
Advanced Truth-Discovery Techniques
68
S1 S2 S3
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Discover Evolving True Values
73
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Discover Evolving True Values
74
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Discover Evolving True Values
75
v' v
O
t' t
v' v
S1
v' v
S2
v' v
S3
… v' v
Sk
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Discover Evolving True Values
76
S1 S2 S3 S4 S5
Stonebraker MIT Berkeley MIT MIT MS
Dewitt MSR MSR UWisc UWisc UWisc
Bernstein MSR MSR MSR MSR MSR
Carey UCI AT&T BEA BEA BEA
Halevy
H l Google
G l Google
G l UW UW UW
Voting for Independence Sources
Count=2 S7 S8
S9 S10
3
Voting w. Knowledge of Copying
Count=1 S7 S8
S9 S10
3
Voting w. Probabilistic Copying
How to detect .7
copying?
S9 S10
3
Considering Dependence
83
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Outline
84
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Web Integration—Google Fusion Tables
85
Source: Gartner
Typical ETL tools support rule-based fusion
IIS (IBM Information Server)
SSIS (Microsoft’s
(Microsoft s SQL Server Integration Services)
Etc. See details in survey [Bleiholder and Naumann, 08]
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Research DI Systems w. Awareness of Data Conflicts
87
Research DI systems
Trio: including accuracy and lineage into data model
Information Manifold
Garlic
Disco (Distributed Information Search Component)
etc.
Peer data management systems
Orchestra: allowing
g multiple
p viewpoints
p
Hyper: isolating the minimum amount of data to reach
consistency
See details in survey [Bleiholder and Naumann, 08]
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Outline
89
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Open Problems
90
Accuracy of fusion
Efficiencyy of fusion
Usability of fusion
I t
Interaction
ti withith other
th components
t off d
data
t
integration
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Accuracy of Fusion (I)
91
Src3
(**1 Pete
(**1, P t Loshin)
L hi )
(**2, Dennis Suhanovs)
(**3, Zhigang Xiang)
(**4, David Allen, Peter Aiken)…
)
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Accuracy of Fusion (II)
93
Src3 Src4
(**1 Pete
(**1, P t Loshin)
L hi ) (**1 Pete
(**1, P t Loshin)
L hi )
(**2, Dennis Suhanovs) (**2, Dennis uhanovs) Only first-
(**3, Zhigang Xiang) (**3, Zhigang Xiang) authors
(**4, David Allen, Peter Aiken)…
) (**4, Peter Aiken)…
)
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Efficiency of Fusion (I)
95
Challenge
g 4: Incremental fusion
When we have more data sources (e.g., Src4) or lose some data
sources, shall we do data fusion from scratch?
Src1 Src2
(**1, Pete Loshin) (**1, Pete Loshin)
(**2, Dennis Suhanovs) (**2, Dennis Suhanovs)
((**3
3, Zhigang Xiang,
Xiang Roy A Plastock) ((**3
3, Zhigang Xiang,
Xiang Roy Plastock)
(**4, Peter Aiken, David M Allen)… (**4, Peter Aiken, David Allen)…
Src3 Src4
(**1, Pete Loshin) (**1, Pete Loshin)
(**2, Dennis Suhanovs) (**2, Dennis uhanovs)
(**3, Zhigang Xiang) (**3, Zhigang Xiang)
(**4, David Allen, Peter Aiken)… (**4, Peter Aiken)…
Mediated Schema
D1 D5
D2 D4
D3
Directions: maintain source profiles by sampling; emphasize efficiency.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Usability of Fusion (I)
97
Challenge
g 6: Personalized fusion
Express preference on certain
sources
Emphasize certain property; e.g.,eg
up-to-date vs. high coverage
Use certain formats; e.g., full
author list vs. only first author
Current effort:
Function choose(src)
Operator Prioritized-Merge
Directions:
A language to express such user
preferences
Algorithms for efficient execution.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Usability of Fusion (II)
98
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Interaction with Other Components of DI (I)
100
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Interaction with Other Components of DI (II)
101
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
A Quiz
102
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann 102
Interaction with Other Aspects of DI (II)
103
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
Conclusions
104
F d ti
Foundations Cleaned
Cl d
Strategies and functions Data
Operators
Data Fusion
Advanced techniques
Consider accuracy Duplicate
p
Consider freshness Detection
Consider dependence Schema
pp g
Mapping
Open problems
Accuracy
Efficiency s s
Usability s s s
Interaction with other components of DI s
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
References
105
Survey
[BN08] J.
J Bleiholder,
Bleiholder F.
F Naumann.
Naumann Data Fusion
Fusion. ACM Computing Survey 2009
2009.
Foundations of Fusion
[BBB+05] A. Bilke, J. Bleiholder, C. Böhm, K. Draba, F. Naumann and M. Weis. Automatic Data Fusion with HumMer. VLDB demo
2005.
[BDN07] Jens Bleiholder, Karsten Draba, and Felix Naumann. FuSem - Exploring Different Semantics of Data Fusion (demo) VLDB
demo 2007.
[BN05] J. Bleiholder, F. Naumann. Declarative Data Fusion - Syntax, Semantics, and Implementation. ADBIS 2005.
[CS05] S. Cohen and Y. Sagiv. An incremental algorithm for computing ranked full disjunctions. PODS 2005.
[DeM89] DeMichiel, L. G. Resolving database incompatibility: An approach to performing relational operations over mismatched
domains. TKDE 1989.
[FFM05] Ariel Fuxman, Elham Fazli, Renee J. Miller. ConQuer: Efficient Management of Inconsistent Databases. SIGMOD 2005.
[GL94] C. A. Galindo-Legaria. Outerjoins as disjunctions. SIGMOD 1994.
[GPZ01] S. Greco, L. Pontieri, and E. Zumpano. Integrating and managing conflicting data. International Andrei Ershov Memorial
Conference on Perspectives of System Informatics, 2001.
[LSS94] Lim, E.-P., Srivastava, J., and Shekhar, S. Resolving attribute incompatibility in database integration: An evidential
reasoning approach. ICDE 1994.
[RPZ04] J. Rao, H. Pirahesh, and C. Zuzarte. Canonical abstraction for outerjoin optimization. SIGMOD 2004.
[RU96] A. Rajaraman and J. D. Ullman. Integrating information by outerjoins and full disjunctions. PODS1996.
[SD05] Dan Suciu and Nilesh Dalvi.
Dalvi Probabilistic Databases
Databases. Tutorial at SIGMOD 2005.
2005
[YÖ99] L. L. Yan and M. T. Özsu. Conflict tolerant queries in AURORA. CoopIS 1999.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
References (con’t)
106
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann
References (con’t)
107
[LR04] A. Labrinidis and N. Roussopoulos. Exploring the tradeoff between performance and data freshness in database-driven
web servers. VLDB J., 13(3):240–255, 2004.
[OW05] C. Olston and J. Widom. Efficient monitoring and querying of distributed, dynamic data via approximate replication.
IEEE Data Eng. Bull., 28(1):11–18, 2005.
[SL03] A. Singh and L. Liu. TrustMe: anonymous management of trust relationships in decentralized P2P systems. In IEEE Intl. Conf.
on Peer-to-Peer Computing, 2003.
[TB01] D. Theodoratos and M. Bouzeghoub. Data currency quality satisfaction in the design of a data warehouse. Int. J.
ooperative Inf. Syst., 10(3):299–326, 2001.
[WM07] M. Wu and A. Marian. Corroborating answers from multiple web sources. In Proc. of WebDB, 2007.
[YHY07] X
X. Yi
Yin, JJ. H
Han, and
d PP. SS. Y
Yu. Truth
T th di
discovery with
ith multiple
lti l conflicting
fli ti iinformation
f ti providers
id on th
the W
Web.
b IIn Proc.
P Of
SIGKDD, 2007.
Data Fusion | VLDB 2009 Tutorial | Luna Dong & Felix Naumann