0% found this document useful (0 votes)
13 views14 pages

TURL: Table Understanding Through Representation Learning: Xiang Deng Huan Sun Alyssa Lees

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views14 pages

TURL: Table Understanding Through Representation Learning: Xiang Deng Huan Sun Alyssa Lees

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

TURL: Table Understanding through Representation Learning

Xiang Deng∗ Huan Sun∗ Alyssa Lees


The Ohio State University The Ohio State University Google Research
Columbus, Ohio Columbus, Ohio New York, NY
[email protected] [email protected] [email protected]

You Wu Cong Yu
Google Research Google Research
New York, NY New York, NY
[email protected] [email protected]
arXiv:2006.14806v2 [cs.IR] 3 Dec 2020

ABSTRACT
Relational tables on the Web store a vast amount of knowledge.
Owing to the wealth of such tables, there has been tremendous
progress on a variety of tasks in the area of table understanding.
However, existing work generally relies on heavily-engineered task-
specific features and model architectures. In this paper, we present
TURL, a novel framework that introduces the pre-training/fine-
tuning paradigm to relational Web tables. During pre-training, our
framework learns deep contextualized representations on relational
tables in an unsupervised manner. Its universal model design with
pre-trained representations can be applied to a wide range of tasks
with minimal task-specific fine-tuning.
Specifically, we propose a structure-aware Transformer encoder
to model the row-column structure of relational tables, and present
a new Masked Entity Recovery (MER) objective for pre-training to
capture the semantics and knowledge in large-scale unlabeled data. Figure 1: An example of a relational table from Wikipedia.
We systematically evaluate TURL with a benchmark consisting of
6 different tasks for table understanding (e.g., relation extraction, a total of 14.1 billion tables in 2008. More recently, Bhagavatula et al.
cell filling). We show that TURL generalizes well to all tasks and [4] extracted 1.6M high-quality relational tables from Wikipedia.
substantially outperforms existing methods in almost all instances. Owing to the wealth and utility of these datasets, various tasks such
as table interpretation [4, 16, 29, 35, 47, 48], table augmentation
PVLDB Reference Format: [1, 6, 12, 42, 45, 46], etc., have made tremendous progress in the
Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. TURL: Table past few years.
Understanding through Representation Learning. PVLDB, 14(3): 307 - 319, However, previous work such as [4, 45, 46] often rely on heavily-
2021. engineered task-specific methods such as simple statistical/language
doi:10.14778/3430915.3430921 features or straightforward string matching. These techniques suf-
PVLDB Artifact Availability:
fer from several disadvantages. First, simple features only capture
The source code, data, and/or other artifacts have been made available at
shallow patterns and often fail to handle the flexible schema and
https://github.com/sunlab-osu/TURL.
varied expressions in Web tables. Second, task-specific features and
model architectures require effort to design and do not generalize
1 INTRODUCTION well across tasks.
Recently, the pre-training/fine-tuning paradigm has achieved
Relational tables are in abundance on the Web and store a large notable success on unstructured text data. Advanced language mod-
amount of knowledge, often with key entities in one column and els such as BERT [15] can be pre-trained on large-scale unsuper-
attributes in the others. Over the past decade, various large-scale vised text and subsequently fine-tuned on downstream tasks using
collections of such tables have been aggregated [4, 6, 7, 25]. For task-specific supervision. In contrast, little effort has been extended
example, Cafarella et al. [6, 7] reported 154M relational tables out of to the study of such paradigms on relational tables. Our work fills
∗ Corresponding authors. this research gap.
This work is licensed under the Creative Commons BY-NC-ND 4.0 International Promising results in some table based tasks were achieved by the
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
representation learning model of [13]. This work serializes a table
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights into a sequence of words and entities (similar to text data) and learns
licensed to the VLDB Endowment. embedding vectors for words and entities using Word2Vec [28].
Proceedings of the VLDB Endowment, Vol. 14, No. 3 ISSN 2150-8097.
doi:10.14778/3430915.3430921 However, [13] cannot generate contextualized representations, i.e.,
it does not consider varied use of words/entities in different contexts
and only produces a single fixed embedding vector for word/entity. Table 1: Summary of notations for our table data.
In addition, shallow neural models like Word2Vec have relatively Symbol Description
limited learning capabilities, which hinder the capture of complex 𝑇 A relational table 𝑇 = (𝐶, 𝐻, 𝐸, 𝑒𝑡 )
semantic knowledge contained in relational tables. 𝐶 Table caption (a sequence of tokens)
We propose TURL, a novel framework for learning deep contex- 𝐻 Table schema 𝐻 = {ℎ 0, ..., ℎ𝑖 , ..., ℎ𝑚 }
tualized representations on relational tables via pre-training in an ℎ𝑖 A column header (a sequence of tokens)
unsupervised manner and task-specific fine-tuning. 𝐸 Columns in table that contains entities
There are two main challenges in the development of TURL: 𝑒𝑡 The topic entity of the table 𝑒𝑡 = (𝑒𝑡e, 𝑒𝑡m )
(1) Relational table encoding. Existing neural network encoders are 𝑒 An entity cell 𝑒 = (𝑒 e, 𝑒 m )
designed for linearized sequence input and are a good fit with un-
in addition to including results for existing datasets when pub-
structured texts. However, data in relational tables is organized
licly available. Experimental results show that TURL substantially
in a semi-structured format. Moreover, a relational table contains
outperforms existing task-specific and shallow Word2Vec based
multiple components including the table caption, headers and cell
methods.
values. The challenge is to develop a means of modeling the row-
Our contributions are summarized as follows:
and-column structure as well as integrating the heterogeneous
information from different components of the table. (2) Factual • To the best of our knowledge, TURL is the first framework
knowledge modeling. Pre-trained language models like BERT [15] that introduces the pre-training/fine-tuning paradigm to
and ELMo [31] focus on modeling the syntactic and semantic char- relational Web tables. The pre-trained representations along
acteristics of word use in natural sentences. However, relational with the universal model design save tremendous effort on
tables contain a vast amount of factual knowledge about entities, engineering task-specific features and architectures.
which cannot be captured by existing language models directly. Ef- • We propose a structure-aware Transformer encoder to model
fectively modelling such knowledge in TURL is a second challenge. the structure information in relational tables. We also present
To address the first challenge, we encode information from dif- a novel Masked Entity Recovery (MER) pre-training objec-
ferent table components into separate input embeddings and fuse tive to learn the semantics as well as the factual knowledge
them together. We then employ a structure-aware Transformer [38] about entities in relational tables.
encoder with masked self-attention. The conventional Transformer • To facilitate research in this direction, we present a bench-
model is a bi-directional encoder, so each element (i.e., token/entity) mark that consists of 6 different tasks for table interpretation
can attend to all other elements in the sequence. We explicitly model and augmentation. We show that TURL generalizes well to
the row-and-column structure by restraining each element to only various tasks and substantially outperforms existing models.
aggregate information from other structurally related elements. To Our source code, benchmark, as well as pre-trained models
achieve this, we build a visibility matrix based on the table structure will be available online.
and use it as an additional mask for the self-attention layer.
For the second challenge, we first learn embeddings for each 2 PRELIMINARY
entity during pre-training. We then model the relation between We now present our data model and give a formal task definition:
entities in the same row or column with the assistance of the visi- In this work, we focus on relational Web tables and are most
bility matrix. Finally, we propose a Masked Entity Recovery (MER) interested in the factual knowledge about entities. Each table 𝑇 ∈ T
pre-training objective. The technique randomly masks out enti- is associated with the following: (1) Table caption 𝐶, which is a
ties in a table with the objective of recovering the masked items short text description summarizing what the table is about. When
based on other entities and the table context (e.g., caption/header). the page title or section title of a table is available, we concatenate
This encourages the model to learn factual knowledge from the ta- these with the table caption. (2) Table headers 𝐻 , which define the
bles and encode it into entity embeddings. In addition, we utilize table schema; (3) Topic entity 𝑒𝑡 , which describes what the table is
the entity mention by keeping it as additional information for a about and is usually extracted from the table caption or page title;
certain percentage of masked entities. This helps our model build (4) Table cells 𝐸 containing entities. Each entity cell 𝑒 ∈ 𝐸 contains
connections between words and entities . We also adopt the Masked a specific object with a unique identifier. For each cell, we define
Language Model (MLM) objective from BERT, which aims to model the entity as 𝑒 = (𝑒 e, 𝑒 m ), where 𝑒 e is the specific entity linked to
the complex characteristics of word use in table metadata. the cell and 𝑒 m is the entity mention (i.e., the text string).
We pre-train our model on 570K relational tables from Wikipedia (𝐶, 𝐻, 𝑒𝑡 ) is also known as table metadata and 𝐸 is the actual
to generate contextualized representations for tokens and entities table content. Notations used in the data model are summarized in
in the relational tables. We then fine-tune our model for specific Table 1.
downstream tasks using task-specific labeled data. A distinguishing Explicitly, we study the unsupervised representation learning
feature of TURL is its universal architecture across different tasks on relational Web tables, which is defined as follows.
- only minimal modification is needed to cope with each down-
stream task. To facilitate research in this direction, we compiled a Definition 2.1. Given a relational Web table corpus, our repre-
benchmark that consists of 6 diverse tasks, including entity link- sentation learning task aims to learn in an unsupervised manner a
ing, column type annotation, relation extraction, row population, task-agnostic contextualized vector representation for each token
cell filling and schema augmentation. We created new datasets in all table captions 𝐶’s and headers 𝐻 ’s and for each entity (i.e., all
entity cells 𝐸’s and topic entities 𝑒𝑡 ’s).
3 RELATED WORK mentioned in a table. Since relational tables are centered around
Representation Learning. The pre-training/fine-tuning paradigm entities, entity linking is a key step for table interpretation, and
has drawn tremendous attention in recent years. Extensive effort a fundamental component to many table-related tasks [47]. [4]
has been devoted to the development of unsupervised representa- employed a graphical model, and used a collective classification
tion learning methods for both unstructured text and structured technique to optimize a global coherence score for a set of entities
knowledge bases, which in turn can be utilized for a wide variety in a table. [35] presented the T2K framework, which is an iterative
of downstream tasks via fine-tuning. matching approach that combines both schema and entity matching.
Earlier work, including Word2Vec [28] and GloVe [30], pre-train More recently, [16] introduced a hybrid method that combines both
distributed representations for words on large collections of doc- entity lookup and entity embeddings, which resulted in superior
uments. The resulting representations are widely used as input performance on various benchmarks.
embeddings and offer significant improvements over randomly ini- Column type annotation and relation extraction both work with
tialized parameters. However, pre-trained word embeddings suffer table columns. The former aims to annotate columns with KB types
from word polysemy: they cannot model varied word use across while the latter intends to use KB predicates to interpret relations
linguistic contexts. This complexity motivated the development of between column pairs. Prior work has generally coupled these two
contextualized word representations [15, 31, 43] . Instead of learning tasks with entity linking [29, 35, 48]. After linking cells to entities,
fixed embeddings per word, these works construct language mod- the types and relations associated with the entities in KB can then
els that learn the joint probabilities of sentences. Such pre-trained be used to annotate columns. In recent work, column annotation
language models have had huge success and yield state-of-the-art without entity linking has been explored [10, 11, 21]. These works
results on various NLP tasks [39]. modify text classification models to fit relational tables and have
Similarly, unsupervised representation learning has also been shown promising results. Moreover, relation extraction on web
adopted in the space of structured data like knowledge bases (KB) tables has also been studied for KB augmentation [8, 14, 36].
and databases. Entities and relations in KB have been embedded into Table Augmentation. Tables are a popular data format to organize
continuous vector spaces that still preserve the inherent structure and present relational information. Users often have to manually
of the KB [37]. These entity and relation embeddings are utilized compose tables when gathering information. It is desirable to offer
by a variety of tasks, such as KB completion [5, 40], relation ex- some intelligent assistance to the user, which motivates the study
traction [34, 41], entity resolution [18], etc. Similarly, [17] learned of table augmentation [45]. Table augmentation refers to the task of
embeddings for heterogeneous data in databases and used it for expanding a seed query table with additional data. Specifically, for
data integration tasks. relational tables this can be divided into three sub-tasks: row pop-
More recently, there has been a corpus of work incorporating ulation for retrieving entities for the subject column [12, 45], cell
knowledge information into pre-trained language models [32, 49]. filling that fills the cell values for given subject entities [1, 42, 46]
ERNIE [49] injects knowledge base information into a pre-trained and schema augmentation that recommends headers to complete
BERT model by utilizing pre-trained KB embeddings and a de- the table schema [6, 45]. For row population tasks, [12] searches
noising entity autoencoder objective. The experimental results for complement tables that are semantically related to seed entities
demonstrate that knowledge information is extremely helpful for and the top ranked tables are used for population. [45] further in-
tasks such as entity linking, entity typing and relation extraction. corporates knowledge base information with a table corpus, and
Despite the success of representation learning on text and KB, develops a generative probabilistic model to rank candidate enti-
few works have thoroughly explored contextualized representation ties with entity similarity features. For cell filling, [42] uses the
learning on relational Web tables. Pre-trained language models are query table to search for matching tables, and extracts attribute
directly adopted in [26] for entity matching. Two recent papers values from those tables. More recently, [46] proposed the CellAu-
from the NLP community [20, 44] study pre-training on Web tables toComplete framework that makes use of a large table corpus and
to assist in semantic parsing or question answering tasks on tables. a knowledge base as data sources, and incorporates preprocessing,
In this work, we introduce TURL, a new methodology for learning candidate value finding, and value ranking components. In terms
deep contextualized representations for relational Web tables that of schema augmentation, [6] tackles this problem by utilizing an
preserve both semantic and knowledge information. In addition, attribute correlation statistics database (ACSDb) collected from a
we conduct comprehensive experiments on a much wider range of table corpus. [45] utilizes a similar approach to the row population
table-related tasks. techniques and ranks candidate headers with sets of features.
Table Interpretation. The Web stores large amounts of knowl- Existing benchmarks. Several benchmarks have been proposed
edge in relational tables. Table interpretation aims to uncover the for table interpretation: (1) T2Dv2 [25] proposed in 2016 contains
semantic attributes of the data contained in relational tables, and 779 tables from various websites. It contains 546 relational ta-
transform this information into machine understandable knowl- bles, with 25119 entity annotations, 237 table-to-class annotations
edge. This task is usually accomplished with help from existing and 618 attribute-to-property annotations. (2) Limaye et al. [27]
knowledge bases. In turn, the extracted knowledge can be used for proposed a benchmark in 2010 which contains 296 tables from
KB construction and population. Wikipedia. It was used in [16] for entity linking, and was also used
There are three main tasks for table interpretation: entity link- in [11] for column type annotation. (3) Efthymiou et al. [16] cre-
ing, column type annotation and relation extraction [4, 47]. Entity ated a benchmark (referred to as “WikiGS” in our experiments)
linking is the task of detecting and disambiguating specific entities that includes 485,096 tables from Wikipedia. WikiGS was originally
4.2 Embedding Layer
Given a table 𝑇 =(𝐶, 𝐻, 𝐸, 𝑒𝑡 ), we first linearize the input into a
sequence of tokens and entity cells by concatenating the table meta-
data and scanning the table content row by row. The embedding
layer then converts each token in 𝐶 and 𝐻 and each entity in 𝐸 and
𝑒𝑡 into an embedding representation.
Input token representation. For each token 𝑤, its vector repre-
sentation is obtained as follows:
xt = w + t + p. (1)
Here w is the word embedding vector, t is called the type em-
Figure 2: Overview of our TURL framework. bedding vector and aims to differentiate whether token 𝑤 is in the
table caption or a header, and p is the position embedding vector
that provides relative position information for a token within the
used for entity linking with 4,453,329 entity matches. [11] further
caption or a header.
annotated a subset of it containing 620 entity columns with 31 DB-
Input entity representation. For each entity cell 𝑒 = (𝑒 e, 𝑒 m ) (
pedia types and used it for column type annotation. (4) The recent
same for topic entity 𝑒𝑡 ), we fuse the information from the linked
SemTab 2019 [23] challenge also aims at benchmarking systems
entity 𝑒 e and entity mention 𝑒 m together, and use an additional type
that match tabular data to KBs, including three tasks, i.e., assigning
embedding vector te to differentiate three types of entity cells (i.e.,
a semantic type to an column, matching a cell to an entity, and
subject/object/topic entities). Specifically, we calculate the input
assigning a property to the relationship between two columns. It
entity representation xe as:
used sampled tables from T2Dv2 [25] and WikiGS [16] in the first
two rounds, and automatically generated tables in later rounds. xe = LINEAR([ee ; em ]) + te ; (2)
In contrast to table interpretation, few benchmarks have been em = MEAN(w1, w2, . . . , w 𝑗 , . . .). (3)
released for table augmentation. Zhang et al. [45] studied row
ee
Here is the entity embedding learned during pre-training. To
population and schema augmentation with 2000 randomly sam-
represent entity mention 𝑒 m , we use its average word embedding
pled Wikipedia tables in total for validation and testing. [46] cu-
w 𝑗 ’s. LINEAR is a linear layer to fuse ee and em .
rated a test collection with 200 columns containing 1000 cells from
A sequence of token and entity representations (xt ’s and xe ’s)
Wikipedia tables for evaluating cell filling.
are then fed into the next module of TURL, a structure-aware Trans-
Although these benchmarks have been used in various recent
former encoder, which will produce contextualized representations.
studies, they still suffer from a few shortcomings: (1) They are
typically small sets of sampled tables with limited annotations. (2)
SemTab 2019 contains a large number of instances; however, most
of them are automatically generated and lack metadata/context of 4.3 Structure-aware Transformer Encoder
the Web tables. In this work, we compile a larger benchmark cover- We choose Transformer [38] as our base encoder block, since it
ing both table interpretation and table augmentation tasks. We also has been widely used in pre-trained language models [15, 33] and
use some of these existing datasets for more comprehensive evalu- achieves superior performance on various natural language process-
ation. By leveraging large-scale relational tables on Wikipedia and ing tasks [39]. Due to space constraints, we only briefly introduce
a curated KB, we ensure both the size and quality of our dataset. the conventional Transformer encoder and refer readers to [38]
for more details. Finally, we present a detailed explanation on our
4 METHODOLOGY proposed visibility matrix for modeling table structure.
In this section, we introduce our TURL framework for unsupervised Each Transformer block is composed of a multi-head self-attention
representation learning on relational tables. TURL is first trained on layer followed by a point-wise, fully connected layer [38]. Specifi-
an unlabeled relational Web table corpus with pre-training objec- cally, we calculate the multi-head attention as follows:
tives carefully designed to learn word semantics as well as relational MultiHead(h) = [head1 ; ...; headi ; ...; headk ]𝑊 𝑂 ;
knowledge between entities. The model architecture is general and 
𝑄

can be applied to a wide range of downstream tasks with minimal headi = Attention h𝑊𝑖 , h𝑊𝑖𝐾 , h𝑊𝑖𝑉 ;
(4)
modifications. Moreover, the pre-training process alleviates the 
𝑄𝐾 𝑇 
need for large-scale labeled data for each downstream task. Attention(𝑄, 𝐾, 𝑉 ) = Softmax √ 𝑀 𝑉 .
𝑑
4.1 Model Architecture Here h ∈ R𝑛×𝑑model is the hidden state output from the previous
Transformer layer or the input embedding layer and 𝑛 is the input
Figure 2 presents an overview of TURL which consists of three
sequence length. √1 is the scaling factor. 𝑊𝑖 ∈ R𝑑model ×𝑑 ,𝑊𝑖𝐾 ∈
𝑄
modules: (1) an embedding layer to convert different components of 𝑑
an input table into input embeddings, (2) N stacked structure-aware R𝑑model ×𝑑 ,𝑊𝑖𝑉 ∈ R𝑑model ×𝑑 and 𝑊 𝑂 ∈ Rk𝑑×𝑑intermediate are parameter
Transformer [38] encoders to capture the textual information and matrices. For each head, we have 𝑑 = 𝑑 model /k, where k is the
relational knowledge, and (3) a final projection layer for pre-training number of attention heads. 𝑀 ∈ R𝑛×𝑛 is the visibility matrix which
objectives. Figure 3 shows an input-output example. we detail next.
Figure 3: Illustration of the model input-output. The input table is first transformed into a sequence of tokens and entity cells, and processed
for structure-aware Transformer encoder as described in Section 4.4. We then get contextualized representations for the table and use them
for pre-training. Here [15th] (which means 15th National Film Awards), [Satyajit], ... are linked entity cells.

• If element𝑖 is the topic entity or a token in table caption,


∀𝑗, 𝑀𝑖,𝑗 = 1. Table caption and topic entity are visible to all
components of the table.
• If element𝑖 is a token or an entity in the table and element 𝑗
is a token or an entity in the same row or the same column,
𝑀𝑖,𝑗 = 1. Entities and text content in the same row or the same
column are visible to each other.
Example 4.1. Use Figure 5 as an example. “National Film ... re-
Figure 4: Graphical illustration of visibility matrix (symmetric). cipients ...” are tokens from the caption and [National Film Award
for Best Direction] is the topic entity, and hence they can aggre-
gate information from all other elements. “Year” is a token from a
column header, and it can attend to all elements except for entity
cells not belonging to that column. [Satyajit] is an entity from a
table cell, so it can only attend to the caption, topic entity, entity
cells in the same row/column as well as the header of that column.

4.4 Pre-training Objective


In order to pre-train our model on an unlabeled table corpus, we
adopt the Masked Language Model (MLM) objective from BERT to
learn representations for tokens in table metadata and propose a
Figure 5: Graphical illustration of masked self-attention by our vis- Masked Entity Recovery (MER) objective to learn entity cell repre-
ibility matrix. Each token/entity in a table can only attend to its sentations.
directly connected neighbors (shown as edges here). Masked Language Model. We adopt the same Masked Language
Model objective as BERT, which trains the model to capture the
Visibility matrix. To interpret relational tables and extract the lexical, semantic and contextual information described by table
knowledge embedded in them, it is important to model row-column metadata. Given an input token sequence including table caption
structure. For example, in Figure 1, [Satyajit] and [Chiriyakhana] and table headers, we simply mask some percentage of the tokens
are related because they are in the same row, which implies that at random, and then predict these masked tokens. We adopt the
[Satyajit] directs [Chiriyakhana]. In contrast, [Satyajit] should same percentage settings as BERT. The pre-training data processor
not be related to [Pratidwandi]. Similarly, [Hindi] is a “language” selects 20% of the token positions at random (note, we use a slightly
and its representation has little to do with the header “Film”. We larger ratio compared with 15% in [15] as we want to make the
propose a visibility matrix 𝑀 to model such structure information pre-training more challenging). For a selected position, (1) 80% of
in a relational table. Figure 4 shows an example of 𝑀. the time we replace it with a special [MASK] token, (2) 10% of the
Our visibility matrix acts as an attention mask so that each token time we replace it with another random token, and (3) 10% of the
(or entity) can only aggregate information from other structurally time we keep it unchanged.
related tokens/entities during the self-attention calculation. 𝑀 is a
symmetric binary matrix with 𝑀𝑖,𝑗 = 1 if and only if element 𝑗 is Example 4.2. Figure 3 shows an example of the above random
visible to element𝑖 . The element here can be a token in the caption process in MLM, where (1) “film”, “award” and “recipient” are chosen
or a header, or an entity in a table cell. Specifically, we define 𝑀 as randomly, and (2) the input word embedding of “film” is further
follows: chosen randomly to be replaced with the embedding of [MASK],
(3) the input word embedding of “recipient” to be replaced with Table 2: Dataset statistics (per table) in pre-training.
the embedding of a random word “milk”, and (4) the input word split min mean median max
embedding of “award” to remain the same. train 1 13 8 4670
# row dev 5 20 12 667
Given a token position selected for MLM, which has a contexu- test 5 21 12 3143
talized representation ht output by our encoder, the probability of train 1 2 2 20
# ent. columns dev 3 4 3 15
predicting its original token 𝑤 ∈ W is then calculated as:
test 3 4 3 15
exp LINEAR(ht ) · w

train 3 19 9 3911
𝑃 (𝑤) = Í t
 (5)
𝑤𝑘 ∈W exp LINEAR(h ) · w𝑘 # ent. dev 8 57 34 2132
test 8 60 34 9215
Masked Entity Recovery. In addition to MLM, we propose a novel
Masked Entity Recovery (MER) objective to help the model capture
the factual knowledge embedded in the table content as well as the
associations between table metadata and table content. Essentially,
we mask a certain percentage of input entity cells and then recover entity 𝑒 ∈ E is then calculated as follows.
the linked entity based on surrounding entity cells and table meta- exp (LINEAR(he ) · ee )
data. This requires the model to be able to infer the relation between 𝑃 (𝑒) = Í   (6)
e e
entities from table metadata and encode the knowledge in entity 𝑒𝑘 ∈E exp LINEAR(h ) · e𝑘
embeddings. In reality, considering the entity vocabulary E is quite large, we only
In addition, our proposed masking mechanism takes advantage use the above equation to rank entities from a given candidate set.
of entity mentions. Specifically, as shown in Eqn. 2, the input entity For efficient training, we construct the candidate set with (1) entities
representation has two parts: the entity embedding ee and the entity in the current table, (2) entities that have co-occurred with those in
mention representation em . For some percentage of masked entity the current table, and (3) randomly sampled negative entities.
cells, we only mask ee , and as such the model receives additional We use a cross-entropy loss function for both MLM and MER
entity mention information to help form predictions. This assists objectives and the final pre-training loss is given as follows:
the model in building a connection between entity embeddings and ∑︁ ∑︁
𝑙𝑜𝑠𝑠 = log (𝑃 (𝑤)) + log (𝑃 (𝑒)), (7)
entity mentions, and helps downstream tasks where only cell texts
are available. where the sums are over all tokens and entity cells selected in MLM
Specifically, we propose the following masking mechanism for and MER respectively.
MER: The pre-training data processor chooses 60% of entity cells at
random. Here we adopt a higher masking ratio for MER compared Pre-training details. In this work, we denote the number of Trans-
with MLM, because oftentimes in downstream tasks, none or few former blocks as N, the hidden dimension of input embeddings and
entities are given. For one chosen entity cell, (1) 10% of the time all Transformer block outputs as 𝑑 model , the hidden dimension of
we keep both em and ee unchanged (2) 63% (i.e., 70% of the left the fully connected layer in a Transformer block as 𝑑 intermediate ,
90%) of the time we mask both em and ee (3) 27% (i.e., 30% of the and the number of self-attention heads as k. We take advantage of
left 90%) of the time we keep em unchanged, and mask ee (among a pre-trained TinyBERT [22] model, which is a knowledge distilled
which we replace ee with embedding of a random entity to inject version of BERT with a smaller size, and set the hyperparameters
noise in 10% of the time). Similar to BERT, in both MLM and MER as follows: N = 4, 𝑑 model = 312, 𝑑 intermediate = 1200, k = 12. We ini-
we keep a certain portion of the selected positions unchanged so tialize our structure-aware Transformer encoder parameters, word
that the model can generate good representations for non-masked embeddings and position embeddings with TinyBERT [22]. Entity
tokens/entity cells. Trained with random tokens/entities replacing embeddings are initialized using averaged word embeddings in
the original ones, the model is robust and utilizes contextual infor- entity names, and type embeddings are randomly initialized. We
mation to make predictions rather than simply copying the input use the Adam [24] optimizer with a linearly decreasing learning
representation. rate. The initial learning rate is 1e-4 chosen from [1e-3, 5e-4, 1e-4,
1e-5] based on our validation set. We pre-trained the model for 80
Example 4.3. Take Figure 3 as an example. [15th], [Satyajit], epochs.
[17th] and [Mrinal] are first chosen for MER. Then, (1) the input
mention representation and entity embedding of [Satyajit] remain
the same. (2) The input mention representation and entity embed-
5 DATASET CONSTRUCTION FOR
ding of [15th] are both replaced with the embedding of [MASK] (3) PRE-TRAINING
The input entity embedding of [Mrinal] is replaced with embedding We construct a dataset for unsupervised representation learning
of [MASK], while the input entity embedding of [17th] is replaced based on the WikiTable corpus [4], which originally contains around
with the embedding of a random entity [10th]. In both cases, the 1.65M tables extracted from Wikipedia pages. The corpus contains a
input mention representation are unchanged. large amount of factual knowledge on various topics ranging from
sport events (e.g., Olympics) to artistic works (e.g., TV series). The
Given an entity cell selected for MER with a contexutalized rep- following sections introduce our data construction process as well
resentation he output by our encoder, the probability of predicting as characteristics of the dataset.
5.1 Data Pre-processing and Partitioning Table 3: An overview of our benchmark tasks and strategies to fine-
tune TURL.
Pre-processing. The corresponding Wikipedia page of a table of-
Task Finetune Strategy
ten provides much contextual information, such as page title and
section title that can aid in the understanding of a table topic. We
concatenate page title, section title and table caption to obtain a
comprehensive description.

Table Interpretation
In addition, each table in the corpus contains one or more header
rows and several rows of table content. For tables with more than
one header row, we concatenate headers in the same column to
obtain one header for each column. For each cell, we obtain hyper-
links to Wikipedia pages in it and use them to normalize different
entity mentions corresponding to the same entity. We treat each
Wikipedia page as an individual entity and do not use additional
tools to perform entity linking with an external KB. For cells con-
taining multiple hyperlinks, we only keep the first link. We also
discard rows that have merged columns in a table.
Identify relational tables. We first locate all columns that contain
at least one linked cell after pre-processing. We further filter out

Table Augmentation
noisy columns with empty or illegal headers (e.g., note, comment,
reference, digit numbers, etc.). The columns left are entity-centric
and are referred to as entity columns. We then identify relational
tables by finding tables that have a subject column. A simple heuris-
tic is employed for subject column detection: the subject column
must be located in the first two columns of the table and contain
unique entities which we treat as subject entities. We further filter
out tables containing less than three entities or more than twenty
columns. With this process, we obtain 670,171 relational tables. new datasets for other tasks based on our held-out set of relational
Data partitioning. From the above 670,171 tables, we select a high tables and an existing KB.
quality subset for evaluation: From tables that have (1) more than Next we introduce the definition, baselines, dataset and results
four linked entities in the subject column, (2) at least three entity for each task. Our pre-trained framework is general and can be
columns including the subject column, and (3) more than half of fine-tuned for all the independent tasks.
the cells in entity columns are linked, we randomly select 10000
to form a held-out set. We further randomly partition this set into 6.1 General Setup across All Tasks
validation/test sets via a rough 1:1 ratio for model evaluation. All
We use the pre-training tables to create the training set for each
relational tables not in the evaluation set are used for pre-training.
task, and always build data for evaluation using the held-out vali-
In sum, we have 570171 / 5036 / 4964 tables respectively for pre-
dation/test tables. This way we ensure that there is no overlapping
training/validation/test sets.
tables in training and validation/test. For fine-tuning, we initial-
ize the parameters with a pre-trained model, and further train all
5.2 Dataset Statistics in Pre-training parameters with a task-specific objective. To demonstrate the effi-
Fine-grained statistics of our datasets are summarized in Table 2. We ciency of pre-training, we only fine-tune our model for 10 epochs
can see that most tables in our pre-training dataset have moderate unless otherwise stated.
size, with median of 8 rows, 2 entity columns and 9 entities per
table. We build a token vocabulary using the BERT-based tokenizer
6.2 Entity Linking
[15] (with 30,522 tokens in total). For the entity vocabulary, we Entity linking is a fundamental task in table interpretation, which
construct it based on the training table corpus and obtain 926,135 is defined as:
entities after removing those that appear only once. Definition 6.1. Given a table 𝑇 and a knowledge base KB, entity
linking aims to link each potential mention in cells of 𝑇 to its
6 EXPERIMENTS referent entity 𝑒 ∈ KB.
To systematically evaluate our pre-trained framework as well as Entity linking is usually addressed in two steps: a candidate
facilitate research, we compile a table understanding benchmark generation module first proposes a set of potential entities, and an
consisting of 6 widely studied tasks covering table interpretation entity disambiguation module then ranks and selects the entity that
(e.g., entity linking, column type annotation, relation extraction) best matches the surface form and is most consistent with the table
and table augmentation (e.g., row population, cell filling, schema context. Following existing work [4, 16, 35], we focus on entity
augmentation). We include existing datasets for entity linking. How- disambiguation and use an existing Wikidata Lookup service for
ever, due to the lack of large-scale open-sourced datasets, we create candidate generation.
Baselines. We compare against the most recent methods for table Table 4: Model evaluation on entity linking task. All three datasets
entity linking T2K [35], Hybrid II [16], as well as the off-the-shelf are evaluated with the same TURL + fine-tuning model.
Wikidata Lookup service. T2K uses an iterative matching approach WikiGS Our Test Set T2D
Method
that combines both schema and entity matching. Hybrid II [16] F1 P R F1 P R F1 P R
T2K [35] 34 70 22 - - - 82 90 76
combines a lookup method with an entity embedding method. For
Hybrid II [16] 64 69 60 - - - 83 85 81
Wikidata Lookup, we simply use the top-1 returned result as the Wikidata Lookup 57 67 49 62 62 60 80 86 75
prediction. TURL + fine-tuning 67 79 58 68 71 66 78 83 73
Fine-tuning TURL. Entity disambiguation is essentially matching w/o entity desc. 60 70 52 60 63 58 - - -
a table cell with candidate entities. We treat each cell as a poten- w/o entity type 66 78 57 67 70 65 - - -
tial entity, and input its cell text (entity mention em in Eqn. 2) as + reweighting - - - - - - 82 88 77
well as table metadata to our Transformer encoder and obtain a WikiLookup (Oracle) 74 88 64 79 82 76 90 96 84
contextualized representation he for each cell. To represent each
candidate entity, we utilize the name and description as well as type implementations, we directly use the results of T2K and Hybrid II in
information from a KB. The intuition is that when the candidate [16]. We use F1, precision (P) and recall (R) measures for evaluation.
generation module proposes multiple entity candidates with similar False positive is the number of mentions where the model links to
names, we will utilize the description and type information to find wrong entities, not including the cases where the model makes no
the candidate that is most consistent with the table context. Specif- prediction (e.g., Wikidata Lookup returns empty candidate set).
ically, for a KB entity 𝑒, given its name 𝑁 and description 𝐷 (both As shown in Table 4, our model gets the best F1 score and sub-
are a sequence of words) and types 𝑇 , we get its representation ekb stantially improves precision on WikiGS and our own test set. The
as follows: disambiguation accuracy on WikiGS is 89.62% (predict the correct
entity if it is in the candidate set). A more advanced candidate
ekb = [MEAN𝑤 ∈𝑁 (w) , MEAN𝑤 ∈𝐷 (w) , MEAN𝑡 ∈𝑇 (t)]. (8) generation module can help achieve better results in the future.
Here, w is the embedding for word 𝑤, which is shared with the We also conduct an ablation study on our model by removing the
embedding layer of pre-trained model. t is the embedding for entity description or type information of a candidate entity from Eqn. 8.
type 𝑡 to be learned during this fine-tuning phase. We then calculate From Table 4, we can see that entity description is very important
a matching score between ekb and he similarly as Eqn. 6. We do not for disambiguation, while entity type information only results in
use the entity embeddings pre-trained by our model here, as the goal a minor improvement. This is perhaps due to the incompleteness
is to link mentions to entities in a target KB, not necessarily those of DBpedia, where a lot of entities have no types assigned or have
appear in our pre-training table corpus. The model is fine-tuned missing types.
with a cross-entropy loss. On the T2D dataset, all models perform much better than on
Task-specific Datasets. We use three datasets to compare differ- the two Wikipedia datasets, mainly because of its smaller size and
ent entity linking models: (1) We adopt the Wikipedia gold stan- limited types of entities. The Wikidata Lookup baseline achieved
dards (WikiGS) dataset from [16], which contains 4,453,329 entity high performance, and re-ranking using our model does not fur-
mentions extracted from 485,096 Wikipedia tables and links them to ther improve. However, we adopt simple reweighting2 to take into
DBpedia [3]. (2) Since tables in WikiGS also come from Wikipedia, account the original result returned by Wikidata Lookup, which
some of the tables may have already been seen during pre-training, brings the F1 score to 0.82. This demonstrates the potential of us-
despite their entity linking information is mainly used to pre-train ing features such as entity popularity (used in Wikidata Lookup)
entity embeddings (which are not used here). For a better com- and ensembling strong base models. Additionally, we conduct an
parison, we also create our own test set from the held-out test error analysis on T2D comparing our model (TURL + fine-tuning
tables mentioned in Section 5.1, which contains 297,018 entity men- + reweighting) with Wikidata Lookup. From Table 5, we can see
tions from 4,964 tables. (3) To test our model on Web tables (i.e., that while in many cases, our model can infer the correct entity
those from websites other than Wikipedia), we also include the type based on the context and re-rank the candidate list accord-
T2D dataset [25] which contains 26,124 entity mentions from 233 ingly, it makes mistakes when there are entities in the KB that look
Web tables.1 We use names and descriptions returned by Wikidata very similar to the mentions. To summarize, Table 4 and 5 show
Lookup, and entity types from DBpedia. that there is room for further improvement of our model on entity
The training set for fine-tuning TURL is based on our pre-training linking, which we leave as future work.
corpus, but with tables in the above WikiGS removed. We also re-
move duplicate entity mentions and mentions where Wikidata 6.3 Column Type Annotation
Lookup fails to return the ground truth entity in candidates, and fi- We define the task of column type annotation as follow:
nally obtain 1,264,217 entity mentions in 192,728 tables to fine-tune
our model for the entity linking task. Definition 6.2. Given a table 𝑇 and a set of semantic types L,
column type annotation refers to the task of annotating a column
Results. We set the maximum candidate size for Wikidata Lookup in 𝑇 with 𝑙 ∈ L so that all entities in the column have type 𝑙. Note
at 50 and also include the result of a Wikidata Lookup (Oracle), that a column can have multiple types.
which considers an entity linking instance as correct if the ground-
truth entity is in the candidate set. Due to lack of open-sourced 2 We simply reweight the score of the top-1 prediction by our model with a factor of
0.8 and compare it with the top-1 prediction returned by Wikidata Lookup. The higher
1 We use the data released by [16] (https://doi.org/10.6084/m9.figshare.5229847). one is chosen as final prediction.
Table 5: Further analysis for entity linking on T2D corpus.
Mention Page title Header Wikidata Lookup result TURL + fine-tuning + reweighting result Improve
Philip the Apostle, Christian saint
philip List of saints Saint Philip, male given name Yes
and apostle
All 214 Wainwright fells from
haycock Fell Name Haycock, family name Haycock, mountain in United Kingdom Yes
the pictorial guides - Wainwright Walks
Don’t You (Forget About Me),
don’t you forget Empty Ochestra Band Don’t You Forget About Me,
Name original song written and composed Yes
about me - Karaoke episode of Supernatural (S11 E12)
by Keith Forsey and Steve Schiff
The Global 2000 Scotiabank, Bank of Nova Scotia,
bank of nova scotia Company No
- Forbes.com Canadian bank based in Toronto bank building in Calgary
Purple Finch,
purple finch The Sea Ranch Association List of Birds Common Name Haemorhous purpureus, species of bird No
print in the National Gallery of Art

Column type annotation is a crucial task for table understanding Table 6: Model evaluation on column type annotation task.
and is a fundamental step for many downstream tasks like data Method F1 P R
integration and knowledge discovery. Earlier work [29, 35, 48] on Sherlock (only entity mention) [21] 78.47 88.40 70.55
TURL + fine-tuning (only entity mention)
88.86 90.54 87.23
column type annotation often coupled the task with entity linking.
TURL + fine-tuning 94.75 94.95 94.56
First entities in a column are linked to a KB and then majority w/o table metadata 93.77 94.80 92.76
voting is employed on the types of the linked entities. More recently, w/o learned embedding 92.69 92.75 92.63
[10, 11, 21] have studied column type annotation based on cell only table metadata 90.24 89.91 90.58
texts only. Here we adopt a similar setting, i.e., use the available only learned embedding 93.33 94.72 91.97
information in a given table directly for column type annotation
Table 7: Accuracy on T2D-Te and Efthymiou, where scores for
without performing entity linking first.
HNN + P2Vec are copied from [11] (trained with 70% of T2D and
Baselines. We compare our results with the state-of-the-art model Efthymiou respectively and tested on the rest). We directly apply
Sherlock [21] for column type annotation. Sherlock uses 1588 our models by type mapping without retraining.
features describing statistical properties, character distributions, Method T2D-Te Efthymiou
word embeddings, and paragraph vectors of the cell values in a HNN + P2Vec (entity mention + KB) [11] 0.966 0.865
column. It was originally designed to predict a single type for a TURL + fine-tuning (only entity mention) 0.888 0.745
given column. We change its final layer to |L| Sigmoid activation + table metadata 0.860 0.904
functions, each with a binary cross-entropy loss, to fit our multi-
Table 8: Accuracy on T2D-Te and Efthymiou. Here all models use
label setting. We also evaluate our model using two datasets in [11],
T2D-Tr (70% of T2D) as training set, following the setting in [11].
and include the HNN + P2Vec model as baseline. HNN + P2Vec
Method T2D-Te Efthymiou
employs a hybrid neural network to extract cell, row and column
HNN + P2Vec (entity mention + KB) [11] 0.966 0.650
features, and combines it with property features retrieved from KB.
TURL + fine-tuning (only entity mention) 0.940 0.516
Fine-tuning TURL. To predict the type(s) for a column, we first + table metadata 0.962 0.746
extract the contextualized representation of the column h𝑐 as fol-
lows: and conduct two auxiliary experiments: (1) We first directly test
   
h𝑐 = [MEAN h𝑖t , . . . ; MEAN he𝑗 , . . . ]. (9) our trained models and see how they generalize to existing datasets.
We manually map 24 out of the 37 types used in [11] to our types,
Here h𝑖t ’s are representations of tokens in the column header, he𝑗 ’s which results in 107 (of the original 133) columns in T2D-Te and
are representations of entity cells in the column. The probability of 416 (of the original 614) columns in Efthymiou. (2) We follow the
predicting type 𝑙 is then given as, setting in [11] and use 70% of T2D as training data, which contains
𝑃 (𝑙) = Sigmoid (h𝑐𝑊𝑙 + 𝑏𝑙 ) . (10) 250 columns.3
Same as with the baselines, we optimize the binary cross-entropy Results. For the main results on our test set, we use the validation
loss, 𝑦 is the ground truth label for type 𝑙 set for early stopping in training the Sherlock model, which takes
over 100 epochs. We evaluate model performance using micro F1,
∑︁
𝑙𝑜𝑠𝑠 = 𝑦log (𝑃 (𝑙)) + (1 − 𝑦) log (1 − 𝑃 (𝑙)) (11)
Precision (P) and Recall (R) measures. Results are shown in Table
Task-specific Datasets. We refer to Freebase [19] to obtain se- 6. Our model substantially outperforms the baseline, even when
mantic types L because of its richness, diversity, and scale. We using the same input information (only entity mention vs Sherlock).
only keep those columns in our relational table corpus that have Adding table metadata information and entity embedding learned
at least three linked entities to Freebase, and for each column, we during pre-training further boost the performance to 94.75 under
use the common types of its entities as annotations. We further F1. In addition, our model achieves such performance using only
filter out types with less than 100 training instances and keep only 10 epochs for fine-tuning, which demonstrates the efficiency of the
the most representative types. In the end, we get a total number of pre-training/fine-tuning paradigm. More detailed results for several
255 types, 628,254 columns from 397,098 tables for training, 13,025 types are shown in Table 9, where we observe that all methods work
(13,391) columns from 4,764 (4,844) tables for test (validation). We well for coarse-grained types like person. However, fine-grained
also test our model on two existing small-scale datasets, T2D-Te 3 We use the data released by [11] (https://github.com/alan-turing-institute/SemAIDA).
and Efthymiou (a subset of WikiGS annotated with types) from [11] The number of instances is slightly different from the original paper.
Table 9: Further analysis on column type annotation: Model performance for 5 selected types. Results are F1 on validation set.
Method person pro_athlete actor location citytown
Sherlock 96.85 74.39 29.07 91.22 55.72
TURL + fine-tuning 99.71 91.14 74.85 99.32 79.72
only entity mention 98.44 87.11 58.86 96.59 60.13
w/o table metadata 99.63 90.38 74.46 99.01 77.37
w/o learned embedding 99.38 90.56 71.39 98.91 75.55
only table metadata 98.26 88.80 70.86 98.11 72.54
only learned embedding 98.72 91.06 73.62 97.78 75.16

Table 10: Model evaluation on relation extraction task. Table 11: Relation extraction results of an entity linking based sys-
Method F1 P R tem, under different agreement ratio thresholds.
BERT-based 90.94 91.18 90.69 Min Ag. Ratio F1 P R
TURL + fine-tuning (only table metadata) 92.13 91.17 93.12 0 68.73 60.33 79.85
TURL + fine-tuning 94.91 94.57 95.25 0.4 82.10 94.65 72.50
w/o table metadata 93.85 93.78 93.91 0.5 77.68 98.33 64.20
w/o learned embedding 93.35 92.90 93.80 0.7 63.10 99.37 46.23

types like actor and pro_athlete are much more difficult to


predict. Specifically, it is hard for a model to predict such types for Most existing work [29, 35, 48] assumes that all relations be-
a column only based on entity mentions in cells. On the other hand, tween entities are known in KB and relations between columns
using table metadata works much better than using entity mentions can be easily inferred based on entity linking results. However,
(e.g., 70.86 vs 58.86 for actor). This indicates the importance of such methods rely on entity linking performance and suffer from
table context information for predicting fine-grained column types. KB incompleteness. Here we aim to conduct relation extraction
Results of the auxiliary experiments are summarized in Table without explicitly linking table cells to entities. This is important as
7 and 8. The scores shown are accuracy, i.e., the ratio of correctly it allows the extraction of new knowledge from Web tables for tasks
labeled columns, given each column is annotated with one ground like knowledge base population.
truth label. For HNN + P2Vec, the scores are directly copied from Baselines. We compare our model with a state-of-the-art text based
the original paper [11]. Note that in Table 7, the numbers from relation extraction model [49] which utilizes a pretrained BERT
our models are not directly comparable with HNN + P2Vec, due to model to encode the table information. For text based relation
mapping the types in the original datasets to ours as mentioned extraction, the task is to predict the relation between two entity
earlier. However, taking HNN + P2Vec trained on in-domain data as mentions in a sentence. Here we adapt the setting by treating the
reference, we can see that without retraining, our models still obtain concatenated table metadata as a sentence, and the headers of the
high accuracy on both Web table corpus (T2D-Te) and Wikipedia two columns as entity mentions. Although our setting is different
table corpus (Efthymiou). We also notice that adding table metadata from the entity linking based relation extraction systems in [29, 35,
slightly decreases the performance on T2D while increasing that on 48], here we implement a similar system using our entity linking
Efthymiou, which is possibly due to the distributional differences model described in Section 6.2, and obtain relation annotations
between Wikipedia tables and general web tables. From Table 8 we based on majority voting of linked entity pairs, i.e., predict a relation
can see that when trained on the same T2D-Tr split, our model with if it holds between a minimum portion of linked entity pairs in KB
both entity mention and table metadata still outperforms or is on (i.e., the minimum agreement ratio is larger than a threshold).
par with the baseline. However, when using only entity mention, Fine-tuning TURL. We use similar model architecture as column
our model does not perform as well as the baseline, especially when type annotation as follows.
generalizing to Efthymiou. This is because: (1) Our model is pre-
𝑃 (𝑟 ) = Sigmoid ([h𝑐 ; h𝑐 ′ ]𝑊𝑟 + 𝑏𝑟 ) . (12)
trained with both table metadata and entity embedding. Removing
both creates a big mismatch between pretraining and fine-tuning. Here h𝑐 , h𝑐 ′ are aggregated representation for the two columns
(2) With only 250 training instances, it is easy for deep models to obtained same as Eqn. 9. We use binary cross-entropy loss for
overfit. The better performance of models leveraging table meta- optimization.
data under both settings demonstrates the usefulness of context for Task-specific Datasets. We prepare datasets for relation extrac-
table understanding. tion in a similar way as the previous column type annotation task,
based on our pre-training table partitions. Specifically, we obtain
6.4 Relation Extraction relations R from Freebase. For each table in our corpus, we pair its
subject column with each of its object columns, and annotate the
Relation extraction is the task of mapping column pairs in a table
column pair with relations shared by more than half of the entity
to relations in a KB. A formal definition is given as follows.
pairs in the columns. We only keep relations that have more than
Definition 6.3. Given a table 𝑇 and a set of relations R in KB. For 100 training instances. Finally, we obtain a total number of 121 rela-
a subject-object column pair in 𝑇 , we aim to annotate it with 𝑟 ∈ R tions, 62,954 column pairs from 52,943 tables for training, and 2072
so that 𝑟 holds between all entity pairs in the columns. (2,175) column pairs from 1467 (1,560) tables for test (validation).
Table 12: Model evaluation on row population task. Recall is the
same for all methods because they share the same candidate gener-
ation module.
# seed 0 1
Method MAP Recall MAP Recall
EntiTables [45] 17.90 63.30 42.31 78.13
Table2Vec [13] - 63.30 20.86 78.13
TURL + fine-tuning 40.92 63.30 48.31 78.13

Table 13: Model evaluation on cell filling task.


Method P@1 P@3 P@5 P @ 10
Figure 6: Comparison of fine-tuning our model and BERT for rela-
Exact 51.36 70.10 76.80 84.93
tion extraction: Our model converges much faster.
H2H 51.90 70.95 77.33 85.44
H2V 52.23 70.82 77.35 85.58
Results. We fine-tune the BERT-based model for 25 epochs. We TURL 54.80 76.58 83.66 90.98
use micro F1, Precision (P) and Recall (R) measures for evaluation.
Results are summarized in Table 10. Table 14: Model evaluation on schema augmentation task.
From Table 10 we can see that: (1) Both the BERT-based baseline #seed column labels
and our model achieve good performance, with F1 scores larger Method
0 1
than 0.9. (2) Our model outperforms the BERT-based baseline under kNN 80.16 82.01
all settings, even when using the same information (i.e., only table TURL + fine-tuning 81.94 77.55
metadata vs BERT-based). Moreover, we plot the mean average pre-
cision (MAP) curve on our validation set during training in Figure multi-label soft margin loss as shown below:
6. As one can see, our model converges much faster in comparison 𝑃 (𝑒) = Sigmoid LINEAR(he ) · ee ,

to the BERT-based baseline, demonstrating that our model learns a ∑︁
𝑙𝑜𝑠𝑠 = 𝑦log (𝑃 (𝑒)) + (1 − 𝑦) log (1 − 𝑃 (𝑒)) . (13)
better initialization through pre-training.
As mentioned earlier, we also experiment with an entity linking 𝑒 ∈E𝐶
based system. Results are summarized in Table 11. We can see that Here E𝐶 is the candidate entity set, and 𝑦 is the ground truth label
it can achieve high precision, but suffers from low recall: The upper- of whether 𝑒 is a subject entity of the table.
bound of recall is only 79.85%, achieved at an agreement ratio of 0 Task-specific Datasets. Tables in our pre-training set with more
(i.e., taking all relations that exist between the linked entity pairs as than 3 subject entities are used for fine-tuning TURL and developing
positive). As seen from Table 10 and 11, our model also substantially baseline models, while tables in our held-out set with more than 5
outperforms the system based on a strong entity linker. subject entities are used for evaluation. In total, we obtain 432,660
tables for fine-tuning with 10 subject entities on average, and 4,132
6.5 Row Population (4,205) tables for test (validation) with 16 (15) subject entities on
Row population is the task of augmenting a given table with more average.
rows or row elements. For relational tables, existing work has tack- Results. The experiments are conducted under two settings: with-
led this problem by retrieving entities to fill the subject column out any seed entity and with one seed entity. For experiments
[45, 47]. A formal definition of the task is given below. without the seed entity, we only use table caption for candidate
generation. For entity ranking in EntiTables [45], we use the combi-
Definition 6.4. Given a partial table 𝑇 , and an optional set of seed nation of caption and label likelihood when there is no seed entity,
subject entities, row population aims to retrieve more entities to and only use entity similarity when seed entities are available. This
fill the subject column. strategy works best on our validation set. As shown in Table 12,
our method outperforms all baselines. In particular, previous meth-
Baselines. We adopt models from [45] and [13] as baselines. [45]
ods rely on entity similarity and are not applicable or have poor
uses a generative probabilistic model which ranks candidate entities
results when there is no seed entity available. Our method achieves
considering both table metadata and entity co-occurrence statistics.
a decent performance even without any seed entity, which demon-
[13] further improves upon [45] by utilizing entity embeddings
strates the effectiveness of TURL for generating contextualized
trained on the table corpus to estimate entity similarity. We use the
representations based on both table metadata and content.
same candidate generation module from [45] for all methods, which
formulates a search query using either the table caption or seed
6.6 Cell Filling
entities and then retrieves tables via the BM25 retrieval algorithm.
Subject entities in those retrieved tables will be candidates for row We examine the utility of our model in filling other table cells,
population. assuming the subject column is given. This is similar to the setting
in [42, 46], which we formally define as follows.
Fine-tuning TURL. We adopt the same candidate generation mod-
ule used by baselines. We then append the [MASK] token to the Definition 6.5. Given a partial table 𝑇 with the subject column
input, and use the hidden representation he of [MASK] to rank filled and an object column header, cell filling aims to predict the
these candidates as shown in Table 3. We fine-tune our model with object entity for each subject entity.
Table 15: Case study on schema augmentation. Here we show average precision (AP) for each example. Support Caption is the caption of the
source table that kNN found to be most similar to the query table. Our model performs worse when there exist source tables that are very
similar to the query table (e.g., comparing support caption vs query caption).
Method Query Caption Seed Target AP Predicted Support Caption
moving to, name, player,
kNN name, 1.0 2007 santos fc season out
2010 santos fc season out pos. moving from, to
moving to
moving to, fee/notes,
Ours 0.58 -
destination club, fee, loaned to
country, runner-up, first ladies of chile
kNN first ladies and gentlemen name, 0.20
no. champion, player, team team list of first ladies
of panama list president
year, runner-up, spouse,
Ours 0.14 -
name, father
format, covered location, company, list of radio stations in
kNN list of radio stations in format, 1.0
name call sign, owner metro manila fm stations
metro manila am stations covered location
format, owner, covered location,
Ours 0.83 -
city of license, call sign

Baselines. We adopt [46] as our base model. It has two main com- instances with the target object entity in the candidate set and eval-
ponents, candidate value finding and value ranking. The same can- uate them under Precision@K (or, P@K). Results are summarized
didate value finding module is used for all methods: Given a subject in Table 13, from which we show: (1) Simple Exact match achieves
entity 𝑒 and object header ℎ for the to-be-filled cells, we find all decent performance, and using H2H or H2V only sightly improves
entities that appear in the same row with 𝑒 in our pre-training table the results. (2) Even though our model directly ranks the candidate
corpus , and only keep entities whose source header ℎ ′ is related to entities without explicitly using their source table information, it
ℎ. Here we use the formula from [46] to measure the relevance of outperforms other methods. This indicates that our model already
two headers 𝑃 (ℎ ′ |ℎ), encodes the factual knowledge in tables into entity embeddings
𝑛(ℎ ′, ℎ) through pre-training.
𝑃 (ℎ ′ |ℎ) = Í ′′ . (14)
ℎ′′ 𝑛(ℎ , ℎ)
Here 𝑛(ℎ ′, ℎ) is the number of table pairs in the table corpus that 6.7 Schema Augmentation
contain the same entity for a given subject entity in columns ℎ ′ and Aside from completing the table content, another direction of table
ℎ. The intuition is that if two tables contain the same object entity augmentation focuses on augmenting the table schema, i.e., discov-
for a given subject entity 𝑒 in columns with headings ℎ𝑎 and ℎ𝑏 , ering new column headers to extend a table with more columns
then ℎ𝑎 and ℎ𝑏 might refer to the same attribute. For value ranking, [6, 13, 42, 45]. Following [13, 42, 45], we formally define the task
the key is to match the given header ℎ with the source header ℎ ′ , below.
we can then get the probability of the candidate entity 𝑒 belongs to
the cell 𝑃 (𝑒 |ℎ) as follows: Definition 6.6. Given a partial table 𝑇 , which has a caption and
𝑃 (𝑒 |ℎ) = MAX sim(ℎ ′, ℎ) .

(15) zero or a few seed headers, and a header vocabulary H , schema
Here ℎ ′ ’s are the source headers associated with the candidate augmentation aims to recommend a ranked list of headers ℎ ∈ H
entity in the pre-training table corpus. sim(ℎ ′, ℎ) is the similarity to add to 𝑇 .
between ℎ ′ and ℎ. We develop three baseline methods for sim(ℎ ′, ℎ):
(1) Exact: predict the entity with exact matched header, (2) H2H: Baselines. We adopt the method in [45] which searches our pre-
use the 𝑃 (ℎ ′ |ℎ) described above. (3) H2V: similar to [13], we train training table corpus for related tables, and use headers in those
header embeddings with Word2Vec on the table corpus. We then related tables for augmentation. More specifically, we encode the
measure the similarity between headers using cosine similarity. given table caption as a tf-idf vector and then use the K-nearest
neighbors algorithm (kNN) [2] with cosine similarity to find the
Fine-tuning TURL. Since cell filling is very similar to the MER
top-10 most related tables. We rank headers from those tables by
pre-training task, we do not fine-tune the model, and directly use
aggregating the cosine similarities for tables they belong to. When
[MASK] to select from candidate entities same as MER (Eqn. 6).
seed headers are available, we re-weight the tables by the overlap
Task-specific Datasets. To evaluate different methods on this task, of their schemas with seed headers same as [45].
we use the held-out test tables in our pre-training phase and extract
Fine-tuning TURL. We concatenate the table caption, seed head-
from them those subject-object column pairs that have at least
ers and a [MASK] token as input to our model. The output for
three valid entity pairs. Finally we obtain 9,075 column pairs for
[MASK] is then used to predict the headers in a given header vo-
evaluation.
cabulary H . We fine-tune our model use binary cross-entropy loss.
Results. For candidate value finding, using all entities appearing in
Task-specific Datasets. We collect H from the pre-training table
the same row with a given subject entity 𝑒 achieves a recall of 62.51%
corpus. We normalize the headers using simple rules, only keep
with 165 candidates on average. After filtering with 𝑃 (ℎ ′ |ℎ) > 0,
those that appear in at least 10 different tables, and finally obtain
the recall drops slightly to 61.45% and the average number of candi-
5652 unique headers, with 316,858 training tables and 4,646 (4,708)
dates reduces to 86. For value ranking, we only consider those test
test (validation) tables.
Figure 7a clearly demonstrates the advantage of our visibility
matrix design. Without the visibility matrix (an element can attend
to every other element during pre-training), it is hard for the model
to capture the most relevant information (e.g., relations between
entities) in the table for prediction. From Figure 7b, we observe that
at a mask ratio of 0.8, the objective entity prediction performance
drops in comparison with other lower ratios. This is because this
task requires the model to not only understand the table metadata,
but also learn the relation between entities. A high mask ratio forces
the model to put more emphasis on the table metadata, while a
lower mask ratio encourages the model to leverage the relation
(a) Effect of visibility matrix.
between entities. Meanwhile, a very low mask ratio such as 0.2 also
hurts the pre-training performance, because only a small portion
of entity cells are actually used for training in each iteration. A low
mask ratio also creates a mismatch between pre-training and fine-
tuning, since for many downstream tasks, only few seed entities
are given. Considering both aspects as well as that the results are
not sensitive w.r.t. this parameter, we set the MER mask ratio at 0.6
in pre-training.

7 CONCLUSION
This paper presents a novel pre-training/fine-tuning framework
(b) Effect of different MER mask ratios. (TURL) for relational table understanding. It consists of a structure-
aware Transformer encoder to model the row-column structure
Figure 7: Ablation study results. as well as a new Masked Entity Recovery objective to capture
the semantics and knowledge in relational Web tables during pre-
Results. We fine-tune our model for 50 epochs for this task, based training. On our compiled benchmark, we show that TURL can
on the performance on the validation set. We use mean average be applied to a wide range of tasks with minimal fine-tuning and
precision (MAP) for evaluation. achieves superior performance in most scenarios. Interesting future
From Table 14, we observe that both kNN baseline and our model work includes: (1) Focusing on other types of knowledge such as
achieve good performance. Our model works better when no seed numerical attributes in relational Web tables, in addition to entity
header is available, but does not perform as well when there is one relations. (2) Incorporating the rich information contained in an
seed header. We then conduct a further analysis in Table 15 using a external KB into pre-training.
few examples: One major reason why kNN works well is that there
exist tables in the pre-training table corpus that are very similar ACKNOWLEDGMENTS
to the query table and have almost the same table schema. On the We would like to thank the anonymous reviewers for their helpful
other hand, our model oftentimes suggests plausible, semantically comments. Authors at the Ohio State University were sponsored
related headers, but misses the ground-truth headers. in part by Google Faculty Award, the Army Research Office under
cooperative agreements W911NF-17-1-0412, NSF Grant IIS1815674,
6.8 Ablation Study NSF CAREER #1942980, Fujitsu gift grant, and Ohio Supercom-
In this section, we examine the effects of two important designs puter Center [9]. The views and conclusions contained herein are
in TURL: the visibility matrix and MER with different mask ratios. those of the authors and should not be interpreted as represent-
During the pre-training phase, at each training step, we evaluate ing the official policies, either expressed or implied, of the Army
TURL on the validation set for object entity prediction. We choose Research Office or the U.S. Government. The U.S. Government is
this task because it is similar to the cell filling downstream task and authorized to reproduce and distribute reprints for Government
it is convenient to conduct during pre-training (e.g., ground-truth purposes notwithstanding any copyright notice herein.
is readily available and no need to modify the model architecture,
etc.). REFERENCES
Given a table in our validation set, we predict each object entity [1] Ahmad Ahmadov, Maik Thiele, Julian Eberius, Wolfgang Lehner, and Robert
Wrembel. 2015. Towards a hybrid imputation approach using web tables. In 2015
by first masking the entity cell (both ee and em ) and obtaining a IEEE/ACM 2nd International Symposium on Big Data Computing (BDC). IEEE,
contextualized representation of the [MASK] (which attends to the 21–30.
[2] Naomi S. Altman. 1992. An Introduction to Kernel and Nearest Neighbor Non-
table caption, corresponding header, as well as other entities in parametric Regression.
the same row/column before the current cell position) and then [3] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak,
applying Eqn. 6. We compare the top-1 predicted entity with the and Zachary G. Ives. 2007. DBpedia: A Nucleus for a Web of Open Data. In
ISWC/ASWC.
ground truth and show the accuracy (ACC) on average. Results are [4] Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2015. TabEL:
summarized in Figure 7. Entity Linking in Web Tables. In Proceedings of the 14th International Conference
on The Semantic Web-ISWC 2015-Volume 9366. 425–441. arXiv:2004.00584 (2020).
[5] Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Ok- [27] LimayeGirija, SarawagiSunita, and ChakrabartiSoumen. 2010. Annotating and
sana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational searching web tables using entities, types and relationships. In VLDB 2010.
Data. In NIPS. [28] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
[6] Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. Distributed representations of words and phrases and their compositionality. In
2008. Webtables: exploring the power of tables on the web. Proceedings of the Advances in neural information processing systems. 3111–3119.
VLDB Endowment 1, 1 (2008), 538–549. [29] Varish Mulwad, Timothy W. Finin, Zareen Syed, and Anupam Joshi. 2010. Using
[7] Michael J. Cafarella, Alon Y. Halevy, Yang Zhang, Daisy Zhe Wang, and Eugene Linked Data to Interpret Tables. In COLD.
Wu. 2008. Uncovering the Relational Web. In 11th International Workshop on the [30] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove:
Web and Databases, WebDB 2008, Vancouver, BC, Canada, June 13, 2008. Global Vectors for Word Representation. In EMNLP.
[8] Matteo Cannaviccio, Lorenzo Ariemma, Denilson Barbosa, and Paolo Merialdo. [31] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher
2018. Leveraging wikipedia table schemas for knowledge graph augmentation. Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word
In Proceedings of the 21st International Workshop on the Web and Databases. 1–6. representations. In NAACL-HLT.
[9] Ohio Supercomputer Center. 1987. Ohio Supercomputer Center. http://osc.edu/ [32] Matthew E Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi,
ark:/19495/f5s1ph73 Sameer Singh, and Noah A Smith. 2019. Knowledge Enhanced Contextual Word
[10] Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles A. Sutton. 2018. Representations. In Proceedings of the 2019 Conference on Empirical Methods in
ColNet: Embedding the Semantics of Web Tables for Column Type Prediction. In Natural Language Processing and the 9th International Joint Conference on Natural
AAAI. Language Processing (EMNLP-IJCNLP). 43–54.
[11] Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles A. Sutton. 2019. [33] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Learning Semantic Annotations for Tabular Data. In IJCAI. Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).
[12] Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, [34] Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013.
Reynold Xin, and Cong Yu. 2012. Finding Related Tables. In Proceedings of the 2012 Relation Extraction with Matrix Factorization and Universal Schemas. In HLT-
ACM SIGMOD International Conference on Management of Data. ACM, 817–828. NAACL.
[13] Lei Min Deng, Shuo Zhang, and Krisztian Balog. 2019. Table2Vec: Neural Word [35] Dominique Ritze, Oliver Lehmberg, and Christian Bizer. 2015. Matching HTML
and Entity Embeddings for Table Population and Retrieval. In SIGIR’19. Tables to DBpedia. In WIMS ’15.
[14] Xiang Deng and Huan Sun. 2019. Leveraging 2-hop Distant Supervision from [36] Yoones A Sekhavat, Francesco Di Paolo, Denilson Barbosa, and Paolo Merialdo.
Table Entity Pairs for Relation Extraction. arXiv preprint arXiv:1909.06007 (2019). [n.d.]. Knowledge Base Augmentation using Tabular Data.
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: [37] Qi shan Wang, Zhendong Mao, Biwu Wang, and Li Guo. 2017. Knowledge Graph
Pre-training of Deep Bidirectional Transformers for Language Understanding. In Embedding: A Survey of Approaches and Applications. IEEE Transactions on
Proceedings of the 2019 Conference of the North American Chapter of the Association Knowledge and Data Engineering 29 (2017), 2724–2743.
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and [38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Short Papers). 4171–4186. Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
[16] Vasilis Efthymiou, Oktie Hassanzadeh, Mariano Rodriguez-Muro, and Vassilis you need. In Advances in neural information processing systems. 5998–6008.
Christophides. 2017. Matching Web Tables with Knowledge Base Entities: From [39] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel
Entity Lookups to Entity Embeddings. In International Semantic Web Conference. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for
[17] Raul Castro Fernandez and Samuel Madden. 2019. Termite: a system for tunneling Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop
through heterogeneous data. In Proceedings of the Second International Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 353–355.
on Exploiting Artificial Intelligence Techniques for Data Management. 1–8. [40] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zhigang Chen. 2014. Knowledge
[18] Xavier Glorot, Antoine Bordes, Jason Weston, and Yoshua Bengio. 2013. A seman- Graph Embedding by Translating on Hyperplanes. In AAAI.
tic matching energy function for learning with multi-relational data. Machine [41] Jason Weston, Antoine Bordes, Oksana Yakhnenko, and Nicolas Usunier. 2013.
Learning 94 (2013), 233–259. Connecting Language and Knowledge Bases with Embedding Models for Relation
[19] Google. 2015. Freebase Data Dumps. https://developers.google.com/freebase/ Extraction. ArXiv abs/1307.7973 (2013).
data. [42] Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri.
[20] Jonathan Herzig, P. Nowak, Thomas Müller, Francesco Piccinno, and Julian Martin 2012. Infogather: entity augmentation and attribute discovery by holistic match-
Eisenschlos. 2020. TAPAS: Weakly Supervised Table Parsing via Pre-training. In ing with web tables. In Proceedings of the 2012 ACM SIGMOD International Con-
ACL. ference on Management of Data. ACM, 97–108.
[21] Madelon Hulsebos, Kevin Zeng Hu, Michiel Bakker, Emanuel Zgraggen, Arvind [43] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdi-
Satyanarayan, Tim Kraska, cCaugatay Demiralp, and C’esar A. Hidalgo. 2019. nov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for
Sherlock: A Deep Learning Approach to Semantic Data Type Detection. Proceed- Language Understanding. In NeurIPS.
ings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & [44] Pengcheng Yin, Graham Neubig, Wen tau Yih, and Sebastian Riedel. 2020. TaBERT:
Data Mining (2019). Pretraining for Joint Understanding of Textual and Tabular Data. In ACL.
[22] Xiaoqi Jiao, Y. Yin, Lifeng Shang, Xin Jiang, Xusong Chen, Linlin Li, Fang Wang, [45] Shuo Zhang and Krisztian Balog. 2017. Entitables: Smart assistance for entity-
and Qun Liu. 2019. TinyBERT: Distilling BERT for Natural Language Under- focused tables. In Proceedings of the 40th International ACM SIGIR Conference on
standing. ArXiv abs/1909.10351 (2019). Research and Development in Information Retrieval. ACM, 255–264.
[23] Ernesto Jiménez-Ruiz, Oktie Hassanzadeh, Vasilis Efthymiou, Jiaoyan Chen, and [46] Shuo Zhang and Krisztian Balog. 2019. Auto-completion for data cells in relational
Kavitha Srinivas. 2020. SemTab 2019: Resources to Benchmark Tabular Data tables. In Proceedings of the 28th ACM International Conference on Information
to Knowledge Graph Matching Systems. In European Semantic Web Conference. and Knowledge Management. 761–770.
Springer, 514–530. [47] Shuo Zhang and Krisztian Balog. 2020. Web Table Extraction, Retrieval, and
[24] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti- Augmentation: A Survey. ACM Transactions on Intelligent Systems and Technology
mization. CoRR abs/1412.6980 (2015). (TIST) 11 (2020), 1 – 35.
[25] Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. A [48] Ziqi Zhang. 2017. Effective and efficient Semantic Table Interpretation using
Large Public Corpus of Web Tables containing Time and Context Metadata. In TableMiner+. Semantic Web 8 (2017), 921–957.
Proceedings of the 25th International Conference Companion on World Wide Web. [49] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu.
75–76. 2019. ERNIE: Enhanced Language Representation with Informative Entities.
[26] Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. In Proceedings of the 57th Annual Meeting of the Association for Computational
2020. Deep Entity Matching with Pre-Trained Language Models. arXiv preprint Linguistics. 1441–1451.

You might also like