Bayesian Data Cleaning for Web Data

Hu, Yuheng; De, Sushovan; Chen, Yi; Kambhampati, Subbarao

Computer Science > Databases

arXiv:1204.3677 (cs)

[Submitted on 17 Apr 2012]

Title:Bayesian Data Cleaning for Web Data

Authors:Yuheng Hu, Sushovan De, Yi Chen, Subbarao Kambhampati

View PDF

Abstract:Data Cleaning is a long standing problem, which is growing in importance with the mass of uncurated web data. State of the art approaches for handling inconsistent data are systems that learn and use conditional functional dependencies (CFDs) to rectify data. These methods learn data patterns--CFDs--from a clean sample of the data and use them to rectify the dirty/inconsistent data. While getting a clean training sample is feasible in enterprise data scenarios, it is infeasible in web databases where there is no separate curated data. CFD based methods are unfortunately particularly sensitive to noise; we will empirically demonstrate that the number of CFDs learned falls quite drastically with even a small amount of noise. In order to overcome this limitation, we propose a fully probabilistic framework for cleaning data. Our approach involves learning both the generative and error (corruption) models of the data and using them to clean the data. For generative models, we learn Bayes networks from the data. For error models, we consider a maximum entropy framework for combing multiple error processes. The generative and error models are learned directly from the noisy data. We present the details of the framework and demonstrate its effectiveness in rectifying web data.

Comments:	6 pages, 7 figures
Subjects:	Databases (cs.DB); Information Retrieval (cs.IR)
ACM classes:	H.3.3
Cite as:	arXiv:1204.3677 [cs.DB]
	(or arXiv:1204.3677v1 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.1204.3677

Submission history

From: Sushovan De [view email]
[v1] Tue, 17 Apr 2012 00:59:53 UTC (1,078 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.DB

< prev | next >

new | recent | 2012-04

Change to browse by:

cs
cs.IR

References & Citations

DBLP - CS Bibliography

listing | bibtex

Yuheng Hu
Sushovan De
Yi Chen
Subbarao Kambhampati

export BibTeX citation

Computer Science > Databases

Title:Bayesian Data Cleaning for Web Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:Bayesian Data Cleaning for Web Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators