Retrieval-Based Transformer for Table Augmentation

Michael Glass, Xueqing Wu, Ankita Rajaram Naik, Gaetano Rossiello, Alfio Gliozzo


Abstract
Data preparation, also called data wrangling, is considered one of the most expensive and time-consuming steps when performing analytics or building machine learning models. Preparing data typically involves collecting and merging data from complex heterogeneous, and often large-scale data sources, such as data lakes. In this paper, we introduce a novel approach toward automatic data wrangling in an attempt to alleviate the effort of end-users, e.g. data analysts, in structuring dynamic views from data lakes in the form of tabular data. Given a corpus of tables, we propose a retrieval augmented transformer model that is self-trained for the table augmentation tasks of row/column population and data imputation. Our self-learning strategy consists in randomly ablating tables from the corpus and training the retrieval-based model with the objective of reconstructing the partial tables given as input with the original values or headers. We adopt this strategy to first train the dense neural retrieval model encoding portions of tables to vectors, and then the end-to-end model trained to perform table augmentation tasks. We test on EntiTables, the standard benchmark for table augmentation, as well as introduce a new benchmark to advance further research: WebTables. Our model consistently and substantially outperforms both supervised statistical methods and the current state-of-the-art transformer-based models.
Anthology ID:
2023.findings-acl.348
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5635–5648
Language:
URL:
https://aclanthology.org/2023.findings-acl.348
DOI:
10.18653/v1/2023.findings-acl.348
Bibkey:
Cite (ACL):
Michael Glass, Xueqing Wu, Ankita Rajaram Naik, Gaetano Rossiello, and Alfio Gliozzo. 2023. Retrieval-Based Transformer for Table Augmentation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5635–5648, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Retrieval-Based Transformer for Table Augmentation (Glass et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.348.pdf