DomainNet: Homograph Detection for Data Lake Disambiguation

Leventidis, Aristotelis; Di Rocco, Laura; Gatterbauer, Wolfgang; Miller, Renée J.; Riedewald, Mirek

Computer Science > Databases

arXiv:2103.09940 (cs)

[Submitted on 17 Mar 2021 (v1), last revised 23 Mar 2021 (this version, v2)]

Title:DomainNet: Homograph Detection for Data Lake Disambiguation

Authors:Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

View PDF

Abstract:Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top-200 precision is 89%.

Comments:	Full version of paper appearing in EDBT 2021
Subjects:	Databases (cs.DB)
Cite as:	arXiv:2103.09940 [cs.DB]
	(or arXiv:2103.09940v2 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2103.09940

Submission history

From: Aristotelis Leventidis [view email]
[v1] Wed, 17 Mar 2021 22:43:56 UTC (536 KB)
[v2] Tue, 23 Mar 2021 01:32:33 UTC (514 KB)

Computer Science > Databases

Title:DomainNet: Homograph Detection for Data Lake Disambiguation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:DomainNet: Homograph Detection for Data Lake Disambiguation

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators