0% found this document useful (0 votes)
7 views14 pages

Large Language Models

The paper introduces Deep Tabular Transformer (DTT), a framework that leverages large language models for transforming tabular data into a joinable format, addressing challenges in data integration from heterogeneous sources. DTT efficiently learns mappings from a few examples, outperforming existing methods in accuracy and runtime, even with less complexity than models like GPT-3. The authors provide a comprehensive evaluation and release their resources for further research in the field.

Uploaded by

timmyparks2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views14 pages

Large Language Models

The paper introduces Deep Tabular Transformer (DTT), a framework that leverages large language models for transforming tabular data into a joinable format, addressing challenges in data integration from heterogeneous sources. DTT efficiently learns mappings from a few examples, outperforming existing methods in accuracy and runtime, even with less complexity than models like GPT-3. The authors provide a comprehensive evaluation and release their resources for further research in the field.

Uploaded by

timmyparks2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DTT: An Example-Driven Tabular Transformer by Leveraging

Large Language Models


Arash Dargahi Nobari Davood Rafiei
[email protected] [email protected]
University of Alberta University of Alberta
Edmonton, Alberta, Canada Edmonton, Alberta, Canada
ABSTRACT S-id Source Column T-id Target Column
Many organizations rely on data from government and third-party 1 Jocelyne Thomas 1 j.thomas
arXiv:2303.06748v1 [cs.DB] 12 Mar 2023

sources, and those sources and organizations do not follow the 2 Gerard Herbert Little 2 g.h.litt
same data formatting. This introduces challenges in integrating data
3 Norm Adams 3 n.adams
from multiple sources. Commercial database systems do not offer
adequate support for integrating data from heterogeneous sources, 4 Julian , 4 julian
and manual integration is both time-consuming and inefficient. 5 Therese Vicki Lee 5 t.v.lee
While state-of-the-art approaches rely on similarity functions and 6 Max Anderson 6 m.anders
textual transformations, they often fail to handle challenging cases
7 Julie Lauzon 7 j.lauzon
where multiple mappings are required, or the mappings go beyond
simple textual transformations. 8 . Kumar 8 kumar
In this paper, we study the potential of deep neural models for
transforming tables for joinability. In particular, we cast the prob- Figure 1: Two columns, representing the name and user id of
lem as a prediction task and develop a framework that leverages individuals, where a mapping from name to user id is sought
large deep-learning language models to transform tabular data
from a source formatting to a desired target representation. Our
target, with applications in joining data from different sources [29,
framework can efficiently learn the pattern for mapping the source
53], filling in the missing values and auto-completion [10, 40], and
formatting into the expected target using just a few examples, which
error detection and correction [12].
can then be used for table joining, filling in missing values, and
error detection. Compared to state-of-the-art mapping and joining Example 1. Figure 1 depicts a source and a target, representing the
approaches, our framework delivers noticeably more accurate and same entities (people) but in different formattings. The source table
scalable performance on both real-world and synthetic datasets. shows the names of the individuals, while the target table shows their
Our experimental evaluation also shows that the performance of corresponding user ids. Suppose that the target column is incomplete
the proposed framework using our fine-tuned model is at par or or unavailable, and the objective is to predict the missing values of
better than large language models such as GPT-3, despite the signif- the target column, based on a few examples of source-target pairs.
icant difference in size, and that integrating large language models Or, one may want to join the two columns despite the differences in
into our framework improves their performance. formatting. The transformation process requires some reformatting
rules and the choice of a rule may be conditional on the input. For
1 INTRODUCTION example, in the first, third, and seventh rows, the reformatting rule
involves concatenating the initial and the last name, converting them
The drive towards data publishing and sharing by entities and gov-
to lowercase, and using a period as a separator. However, in the second
ernments over the past couple of years has led many organizations
row, there is a middle name, while the last row lacks a first name, and
to rely on data from third-party sources. However, gathering data
these variations can affect the transformations. In general, detecting
from multiple sources inevitably leads to data mismatches. Convert-
such transformations from a set of examples is not straightforward
ing data from one format to another has long been a challenge in
and must account for various challenges.
data integration and management [10, 11, 13, 20], with traditional
approaches relying on manual development of guidelines and rules Challenges The search space for possible transformations is huge.
for integration or transformation [27, 39]. However, the sheer size If each transformation is synthesized into a sequence of edit op-
and complexity of modern databases make manual transforma- erations, the search space grows exponentially with the number
tion and integration impractical, fueling a significant research on of edit operations in a transformation and the parameter space of
automatic data transformation [15, 19, 29, 53]. the operations. Also, both the search space and the runtime for
Our focus in this paper is on automated transformation of tabular many state-of-the-art approaches [10, 29, 40, 53] further grow with
data such as spreadsheets, web tables, and relational databases, the input size, such as the number of rows and the length of a
which is widely adopted by both organizations and governments row. Despite some attempts to reduce the search space by limit-
for representing, storing, and sharing data. In particular, given a few ing the number of operations (e.g. split and substring) within a
examples of matched rows between a source and a target, the goal is transformation [29], sampling the input [53], and applying pruning
to learn a mapping. Those mappings can then be used to transform strategies [29], the search space still remains relatively large. Hence
arbitrary rows in the source formatting into the formatting of the the time needed to find a mapping is often much more than what
Arash Dargahi Nobari and Davood Rafiei

may be considered as acceptable, for example, in an online setting. and efficient data joining. Our experimental results show that DTT
Also, some of these improvements and prunings are lossy, and they outperforms existing state-of-the-art approaches in terms of ac-
can miss transformations that are of a better quality than those curacy, is applicable to a larger set of tables, and maintains an
found. outstanding runtime performance even when dealing with large
Another challenge is the availability of input examples and noise input size. Remarkably, the performance of DTT is at par or better
handling. The examples are usually either user-provided or auto- than large language models such as GPT-3 despite having an or-
matically generated from input. In the former, the examples are der of magnitude less parameters and requiring dramatically less
(extremely) limited but accurate, whereas in the latter, the examples resources during inference. We are releasing DTT as a pretrained
can be extensive, but less accurate. In real-world settings, noise model, which demonstrates exceptional performance across multi-
is usually unavoidable and inconsistencies can exist in data. Also ple domains without the need for fine-tuning. Our hope is that this
when examples are automatically generated, some of them can be release will drive further advancements in the field.
incorrect. A good model should perform well with only a limited Our contributions can be summarized as follows:
number of examples, and it should also benefit from the abundance (1) We propose DTT, a novel example-driven approach for tabu-
of examples, maybe ignoring those that are less useful. A good lar data transformation leveraging pretrained language mod-
model should also be robust against any possible noise in data and els.
deal with inaccuracy in the provided examples. (2) We develop a diverse dataset for training our model, com-
Existing Approaches A wide array of studies target the problem prising of synthetic examples. Our experiments demonstrate
of matching entity descriptions or records that describe the same that our model performs exceptionally well on real-world
real-world entities but differ in terms of formatting or representa- data from various domains.
tion [1, 7, 14, 24, 25, 47, 52]. Traditional approaches rely on textual (3) We present an end-to-end framework for tabular data trans-
and semantic similarity, whereas more recent approaches incorpo- formation which includes a decomposer, serializer, model,
rate machine learning and deep neural models. While these models and aggregator. As an application of our framework, we
provide effective solutions to the problem of formatting mismatch demonstrate its effectiveness in table joining.
in joining tabular data, their use cases are limited. For example, (4) We conduct extensive evaluation on a wide range of datasets
these models cannot predict or rectify missing values, provide sug- from different domains and show that our approach outper-
gestions, or detect outliers in a table. This has led to another line of forms existing state-of-the-art baselines in terms of both
research where the focus is on finding a mapping between source accuracy and runtime.
and target tables and leveraging this mapping to transform the (5) We make all our resources, including our code, framework,
source formatting into that of the target. pretrained model, synthetic data generator, and real-world
The majority of approaches aimed at detecting a mapping be- benchmarks publicly available for the research community.1
tween two tables rely heavily on a limited set of string-based trans-
formations [10, 29, 40, 53] and an exhaustive search of the parameter 2 PROBLEM DEFINITION
space. While the search space can be bounded by limiting the num- We want to transform tables from a source formatting to a target
ber of string-based transformations, this can negatively impact the formatting using a few provided examples. Let 𝑆 = {𝑠 1, 𝑠 2, . . .}
accuracy. Consider the source and target tables in Figure 1, where denote a set of values in the source. For a small subset 𝑆 ′ ⊂ 𝑆, let
different rows may require different formatting rules. To transform 𝐸 = {(𝑠𝑖 , 𝑡𝑖 )|𝑠𝑖 ∈ 𝑆 ′ } denote a set of 𝑘 examples where the target
all rows in the source to their corresponding target values, six values are given to guide the process for finding a transformation.
distinct textual transformations may be needed, as illustrated in Ex- The aim is to find the target formatting of every value in the source,
ample 1. Some studies [10, 40, 53] limit their search space and find a i.e.
single transformation that covers all input rows, which will not be
𝑅 = {(𝑠𝑖 , 𝑓 (𝑠𝑖 ))|𝑠𝑖 ∈ 𝑆 ∧ ∀𝑠 𝑗 ∈ 𝑆 ′ ((𝑠 𝑗 , 𝑓 (𝑠 𝑗 ) ∈ 𝐸)}. (1)
effective in this scenario. Other studies [13, 29] can produce more
than one transformation, but the problem of selecting a transfor- As an example, suppose we have a source table 𝑆 that lists the
mation from the set to apply to an arbitrary row is left unanswered. recent prime ministers of Canada and an example set 𝐸 that consists
For instance, Nobari et al. [29] provide a set of transformations that of three rows:
are required for a mapping but do not provide much hint on how S = {‘Justin Trudeau’, ‘Stephen Harper’, ‘Paul Martin’,
to select a transformation for an input row. Furthermore, many ‘Jean Chretien’, ‘Kim Campbell’},
state-of-the-art methods [10, 29, 40, 53] exhaustively search the E = { (‘Justin Trudeau’, ‘jtrudeau’),
transformation space, and despite their pruning strategies, their (‘Stephen Harper’, ‘sharper’),
runtimes increase dramatically when the input size grows [29]. (‘Paul Martin’, ‘pmartin’) }.
Our Approach In this paper, we introduce Deep Tabular Trans- Our aim is to find the target formatting for any arbitrary value in
former (DTT), a novel framework for transforming tabular data into 𝑆. For instance, the values ‘Jean Chretien’ and ‘Kim Campbell’
a joinable format using the power of deep learning for language may be mapped to ‘jchretien’ and ‘kcampbell’ respectively.
modeling. Unlike traditional approaches that rely on a limited set of Tables can be transformed for joinability, for example, allowing
pre-defined string-based transformations and an exhaustive search a column of a source table to be joined with a column in the target.
process, DTT overcomes these limitations by leveraging advanced Tables may also be transformed to fill in missing values in a target
deep learning techniques. DTT predicts an expected output row column. In both cases, 𝑆 can be the set of all values in the source
in the target table for each row of the source table, enabling easy 1 https://github.com/arashdn/dtt
DTT: An Example-Driven Tabular Transformer by Leveraging Large Language Models

column. In this study, we assume the source values and examples are is to overcome these limitations on the search space by utilizing a
provided. This a common practice to limit the scope of the problem Language Model (LM) to transform source values into the desired
and focus on data transformations [29]. If user-provided examples target representation.
are not available, an unequal joining method [14, 24, 46, 47] or
token-based example generation [29, 53] may be used to provide a
set of examples, with the caveat that the automatically generated 3.2 Language Modeling and Text to Text
examples may contain noise and invalid pairs. We will discuss how Transformers
our approach can deal with such noisy examples. With large language models forming an integral component of
our framework, we provide a brief background of those models.
3 BACKGROUND AND RELATED WORK A vast majority of machine-learned LMs are based on the con-
Our work is related to the lines of work on (1) example-driven cept of masked language modeling, where some tokens in a given
tabular data transformation, (2) language modeling and text-to-text sequence are masked (or corrupted), and the model is trained to
transformers, and (3) language models applied to tabular data. We predict those masked tokens. Word2Vec [28] and GloVe [31] are
review the relevant recent works in these areas while also providing among the earliest models for pretraining, which generate static
some background (when necessary). vectorized embeddings of each word using a shallow neural net-
work. A later work, ELMo [32] uses two layers of bidirectional
LSTM [17] to observe the context before and after the word and
3.1 Example-Driven Tabular Data
generates contextualized embedding of the words, unlike the static
Transformation embeddings in Word2Vec. In recent years, Vaswani et al. [44] intro-
This is probably the closest line of work to ours. There are numer- duce transformers that use self attention [44], allowing the model
ous studies in this area [10, 11, 29, 40, 53], and FlashFill [10] and to parallelize better than LSTM models and not giving more weight
BlinkFill [40] are among the pioneers, with a focus on spreadsheet to nearby words. Transformer-based models consist mostly of an
data. These two approaches construct an input graph based on a encoder, a decoder [41], or both. Encoder-only models, such as
given set of user-provided examples, which is then traversed to BERT [9], aim to learn the natural languages and generate a latent
generate a sequence of substring-based textual transformations representation of the input that can be used for tasks requiring an
that map source values to their corresponding targets in the input understanding of the input. Decoder-only models, such as GPT-
examples. However, FlashFill and BlinkFill heavily rely on the ac- 2 [35] and GPT-3 [4], are widely used to generate natural language
curacy of the provided examples and are unable to handle noise in text given a context. Finally, Encoder-Decoder models, also referred
the examples. To address this issue, Zhu et al. propose a method to as sequence-to-sequence or text-to-text models, such as T5 [36]
called Auto-join [53], which uses a set of pre-defined string-based and BART [23], use an encoder to create a latent representation of
transformation units, such as substring and split, to describe the the input, which is passed to the decoder to generate a new text for
transformations. The examples are automatically generated by to- a desired task.
ken matching, and the method creates several subsets of the input
examples to handle noise. A recursive backtracking algorithm is
then applied to each subset to find the best transformation. While 3.3 Language Models Applied to Tabular Data
Auto-join is able to handle minor noise in the input and limits The rise of pretrained language models has led to their increasing
the search space by using pre-defined transformation units, it is a use in various tasks, including those involving tabular data. In
backtracking method and needs to search the entire transformation particular, these models are being applied to tasks such as entity
space in the worst case, which can be computationally expensive. matching [2, 26], text to SQL [22, 45], question answering [5, 6,
Also, it may not perform well if the noise level in the examples is 18, 50] and data to text [21, 30, 43]. Since many deep learning and
significant. NLP models can only process data as a sequence of tokens, table
In a more recent work by Nobari et al., referred to as Common serialization has become a common module in many of these tasks.
String-based Transformer (CST), the search space for string-based Several serialization techniques have been developed to transform
transformations is further constrained by considering common text tables into sequences of tokens [8, 18, 42, 50], while preserving the
sequences between source and target examples as textual evidence structural relationships that may be needed for these tasks. Since
to form the skeleton of transformations. CST uses the same string- the relationships that need to be preserved can be task-dependent,
based transformation units as Auto-join, but transformations for various serialization methods are used in the literature. For example,
each row are generated independently to better handle input noise. Iida et al. [18] pass the rows and columns as two separate sequences
The transformations are then ranked based on their coverage to to two transformer blocks and average the row and column values
build a final transformation set. While CST offers better noise han- for each cell to generate a cell representation. In RPT [42], tables
dling and runtime performance compared to Auto-join, it is still are serialized using two special tokens, [A] and [V], to encode
limited to substring-based transformation units and performs well attribute names and their corresponding values respectively. While
only when long matching sequences exist between source and tar- this serialization keeps the structural information about the tables,
get examples. While the pruning conditions in Auto-join and CST it is not very efficient as the attribute names are repeated in each
limit their search space and improve their runtime, they can end row of the table. Our aim is not to generate a dense representation of
up missing some transformations, particularly those that cannot be the entire table, and this requires a different serialization approach,
covered by a small set of pre-defined transformation units. Our aim which we discuss in Section 4.1.
Arash Dargahi Nobari and Davood Rafiei

(s4 , t4)
s1 t1 Decomposer (s5 , t5)
s2 t2 Tokenizer Model Aggregator ..
s3 t3 & Serializer .
s4
s5
s6 s1<tr>t1<eoe>s2<tr>t2<eoe>s4<tr> t41
[246,52,123,90,12, … ] t42
s1<tr>t1<eoe>s3<tr>t3<eoe>s4<tr> [246,52,71,209,86, … ] t43
.. .. t51
. . t52
t53
..
.

Figure 2: The architecture of the framework

4 APPROACH context. If the input is passed to the model 𝑛 times, each time with
As depicted in Figure 2, our framework consists of a few com- a different context, the model will predict 𝑛 possible targets, which
ponents: (1) a decomposer and serializer, which decomposes the can be aggregated as discussed in Section 4.3.
problem into smaller subtasks and performs an input serialization; It is common to use special tokens to mark the beginning and
(2) a tokenizer, which performs tokenization to obtain a vector- the end of sentences in natural language as it helps with train-
ized representation of the input; (3) a sequence-to-sequence model, ing a model. The same convention is followed in encoding tabular
which predicts an output for each subtask; and (4) an aggregator, data to describe the relationships between different input fields.
which is responsible for combining the predictions of the subtasks Following this convention, we separate the source and target in
to generate a final prediction. In the rest of this section, we will an example with a <tr> token and two examples with <eoe>. We
discuss the details of those components. also mark the beginning of input with <sos> and the end of input
with <eos>. With these symbols, our example given earlier can
4.1 Decomposer and Serializer be encoded as: <sos>Justin Trudeau<tr>jtrudeau<eoe>Paul
Martin<tr>pmartin<eoe>Jean Chretien<tr><eos>, and the ex-
Given a set of rows 𝑆 from a source table and an example set 𝐸 of
pected label is <sos>jchretien<eos>.
source-target row pairs, the aim is to transform every row 𝑠𝑖 ∈ 𝑆 to
In general, the size of a sub-problem can vary depending on the
a target based on the examples. Many large language models impose
lengths of the records in source and target, the length limitation of
a limit on the length of the input. This limit usually varies from 512
the large language model being employed, and maybe the complex-
tokens (e.g. BERT) to 2048 tokens (e.g. GPT-3) and is associated with
ity of transformations. In our case, each example consists of two
the quadratic memory requirement of the self-attention mechanism
rows, a source and a target. Assuming that the input consists of 𝑘
with input length [51]. However, the encoding of a table can be
examples and a source row to be transformed, and using a language
much longer for large tables and when there are many examples.
model that takes 512 tokens, the length of each row is limited to
To reduce this dependency on input length, we decompose the
⌊512/(2𝑘 + 1)⌋ tokens, ignoring special tokens and separators. Also,
problem into smaller tasks, with each task being small enough to
more complex transformations generally require more examples
easily fit the input length requirement of many language models.
to better describe the operations. For instance, consider the exam-
This decomposition process is discussed in this section and the
ple (‘john junior turner’, ‘jturner’) in the context of the
aggregation of the results is discussed in Section 4.3.
example given in Section 2. With only one example, one cannot
Suppose the number of examples that describe the context in a
tell if the letter j in the target is derived from ‘john’ or ‘junior’.
sub-problem is set to two. For any arbitrary row in the source table,
However, with two examples, there is less chance of an ambiguity.
any subset of 𝐸 of size two can be selected as the context. Let 𝐸 2
Unless explicitly stated otherwise, we set the number of examples
denote the set of all subsets of 𝐸 of size two, i.e.
in our contexts to two.
𝐸 2 = {(𝑠 1, 𝑜 1 ), (𝑠 2, 𝑜 2 )|(𝑠 1, 𝑜 1 ) ∈ 𝐸 ∧ (𝑠 2, 𝑜 2 ) ∈ 𝐸 ∧ 𝑠 1 < 𝑠 2 }. (2)
For each input row 𝑠𝑖 ∈ 𝑆 to be transformed, there are |𝐸 2 | 4.2 Tokenizer and Model
possible contexts that can be chosen. As an example, consider sets As large language models expect vectors as inputs, the input needs
𝑆 and 𝐸 in Section 2. The set 𝐸 2 of all subsets of size two of 𝐸 will to be tokenized and each token is assigned a vector. Even though to-
be E2 = { kenization is considered as the first step in many NLP pipelines, the
<(‘Justin Trudeau’,‘jtrudeau’),(‘Stephen Harper’,‘sharper’)>, choice of tokenization is less obvious when working with tabular
<(‘Justin Trudeau’,‘jtrudeau’),(‘Paul Martin’,‘pmartin’)>, data. Conventional NLP models (e.g. word2vec, GloVe) use the vo-
<(‘Paul Martin’,‘pmartin’),(‘Stephen Harper’,‘sharper’)>}, cabulary words as the unit for tokenization, and out-of-vocabulary
and an encoding of the input ‘Jean Chretien’ ∈ 𝑆 using one of tokens are usually all collapsed into an unknown token. Recent
these contexts is <(‘Justin Trudeau’, ‘jtrudeau’), (‘Paul deep models utilize a more efficient tokenization that may better
Martin’, ‘pmartin’), (‘Jean Chretien’,>. Each input row 𝑠𝑖 handle the morphological structure of the words, out-of-vocabulary
can be fed to the model multiple times, each time with a different words, or low-resource languages. Transformer-based language
DTT: An Example-Driven Tabular Transformer by Leveraging Large Language Models

models [23, 34–36] generally use either WordPiece [37] or Byte Pair is carried out using a sequence-to-sequence model, as discussed
Encoding (BPE) [38] tokenizations, where more frequent consecu- in Section 4.2. To exploit all provided examples in the prediction
tive byte pairs are represented as new symbols in the vocabulary process, each input is fed into the model multiple times, each time
(analogous to text compression techniques). In both cases, words with a different context. If we denote the number of trials with 𝑛, the
are broken down into subword tokens based on the frequency of model will predict 𝑛 target candidates, denoted as 𝑂𝑖 = {𝑜𝑖1 . . . 𝑜𝑖𝑛 },
each subword in the language. As a result, common words are rep- for each row 𝑠𝑖 ∈ 𝑆 in the source.
resented as a single token whereas rare words are broken down In an ideal setting where there is no noise or inconsistency in
into multiple tokens. the data and the model performs with no error, all of the predicted
However, in our problem setting, a subword-level tokenizer may values for a specific source 𝑠𝑖 should be the same, i.e. 𝑜𝑖1 = 𝑜𝑖2 =
not be the right choice. Subword tokenizers are mainly optimized . . . = 𝑜𝑖𝑛 . However, inaccurate examples, noisy input rows, and
to pick subwords of natural language or words in the input domain inconsistencies among multiple rows can lead to different predic-
while in tabular data, the values can be from any domain including tions for a particular source row. It should be noted that due to the
words not from a specific language. Understanding meaning and limitations in the model’s input length, it is not feasible to pass
semantics of the words or splitting them into smaller meaningful the entire example set to the model, and instead, we create various
parts is not essentially helpful in predicting the output as each subsets, each of which is treated as an independent problem. While
character may independently contribute to the output value. For noise in the examples may affect the output in some subsets, we
instance, consider the pair (“Justin Trudeau”, “J.Trudeau”), ensemble the outputs generated under different contexts to obtain
where the first character J in Justin is producing the letter J the best possible output. Consequently, the predicted target 𝑡𝑖 for
in J.Trudeau. In a different example pair, (“Justin Trudeau”, the source 𝑠𝑖 can be estimated as
“Trudeau, Justin”), the word Justin is used in the output as a
𝑡𝑖 = argmax 𝑃 (𝐶𝑖 |𝑜𝑖 𝑗 ), (3)
single token. A pretrained subword-level tokenizer may not be the 𝑜𝑖 𝑗 ∈𝑂𝑖
best choice for such input tokenization. A similar problem arises
in low-resource languages that lack enough data for training a where 𝐶𝑖 ⊆ 𝐶 is a subset of contexts that may include example sets
tokenizer. It is shown that character- or byte-level tokenizers work that are relevant to source 𝑠𝑖 , and 𝐶𝑖 may also be limited in size, for
better in those settings [48, 49]. On the same basis, we adopt byte- example to 𝑛. By applying Bayes’ theorem, we have
level tokenizer ByT5 [48] in our work. Recent work has shown 𝑃 (𝑜𝑖 𝑗 |𝐶𝑖 )𝑃 (𝐶𝑖 )
𝑃 (𝐶𝑖 |𝑜𝑖 𝑗 ) = .
that byte-level models are competitive with their subword-level 𝑃 (𝑜𝑖 𝑗 )
counterparts [48], especially in tasks dealing with short-length
Assuming a uniform prior probability 𝑃 (𝑜𝑖 𝑗 ) for the predictions
texts.
and treating 𝑃 (𝐶𝑖 ) the same for all predictions, these terms can be
Generally, table cells store short-length content and our serial-
ignored, and 𝑃 (𝐶𝑖 |𝑜𝑖 𝑗 ) ∝ 𝑃 (𝑜𝑖 𝑗 |𝐶𝑖 ) can be used as a proxy for find-
ization technique also generates short-length contexts with only
ing the argmax. Also, assuming independence among predictions,
two examples. Taking this into account, we use a byte-level UTF-
it is possible to use the maximum likelihood estimation to calculate
8 encoder as the tokenizer, which benefits from the accuracy of
𝑃 (𝑜𝑖 𝑗 |𝐶𝑖 ), i.e.
character-level tokenizers and maintains a relatively short input
|𝑜𝑖 𝑗 |
length passed to the model. 𝑡𝑖 = argmax 𝑃 (𝑜𝑖 𝑗 |𝐶𝑖 ) ∝ (4)
With the input represented as a sequence of tokens, the problem 𝑜𝑖 𝑗 ∈𝑂𝑖 |𝑂𝑖 |
becomes a text-to-text transformation, where a suitable choice for where |𝑜𝑖 𝑗 | is the frequency of 𝑜𝑖 𝑗 in 𝑂𝑖 and |𝑂𝑖 | is number of
the model architecture is a sequence-to-sequence model that com- possible predictions.
prises an encoder and a decoder block. Recent models are stacking
a same number of transformer [44] layers for both encoder and 4.4 Downstream Tasks
decoder. However, it is shown that when the input is a sequence
Given a source row and a set of source-target example pairs, the
of characters, using a deeper encoder containing more layers of
proposed model generates a target row following the examples.
transformers, referred to as unbalanced architecture, performs bet-
This framework can be useful in many downstream tasks such as
ter than a balanced model [48]. ByT5 [48] is a recent byte-level
auto-completion and auto-filling spreadsheets [10, 40], predicting
text-to-text model with an encoder block three times deeper than
missing values, error correction [12], and joining [29, 53]. In this
the decoder block, which we use as a starting point for the training
section, we review two particular tasks: (1) filling missing values,
process. Unlike the original model, which masks parts of the out-
and (2) joining heterogeneous tables.
put, we mask all characters in the target, and the training objective
is to predict the masked bytes. The decoder is an auto-regressive 4.4.1 Filling missing values. It is common to have missing values
decoder, and only the initial token, <sos>, is passed to the decoder. in real-world tables [33], and sometimes those missing values can
In the next sections, we will delve into the details of passing the be predicted from the values of other columns [16]. Consider a
input and predicting an output. scenario where two columns 𝑠 and 𝑡 are given, and column 𝑡 has
some missing or incorrect values. Those columns can be from the
same or different tables. If there exists a mapping relationship from
4.3 Aggregator 𝑠 to 𝑡, our approach may be used to fill in the missing values. In this
We have decomposed the problem of transforming a table into a case, the given samples in 𝑠 and 𝑡 where the values of both columns
set of smaller tasks (see Section 4.1), where each of these tasks are present may serve as examples from which a mapping is learned.
Arash Dargahi Nobari and Davood Rafiei

The examples can then be utilized by the model as context to find 5.1.1 Training data features. The training data must possess sev-
the missing or incorrect values in 𝑠. eral key features. First, it should be organized as source-target pairs,
4.4.2 Joining heterogeneous tables. Consider a join scenario where categorized by their corresponding transformations, as previously
a source table 𝑆 and a target table 𝑇 must be joined, based on some discussed. It is worth noting that the mapping function can be gen-
columns that are not formatted the same but there is a mapping eral, as the model does not need to know the function itself; rather,
from source to target. Examples of mappings may be provided by it only requires the output of the function for the source examples
the user or obtained automatically [29]. The model can be invoked and that the generated examples by the same mapping are grouped
to generate a target 𝑓 (𝑠𝑖 ) for each source row 𝑠𝑖 , utilizing the exam- together. Second, the dataset must be sufficiently large to train a
ples as discussed earlier. Unlike the case for filling missing values model with hundreds of millions of parameters, which is typical
where an exact prediction of the missing value is needed, an exact for many language models. Third, the dataset should cover a broad
prediction is not necessary for a join. Instead, the goal is to use range of textual transformation patterns and various input lengths.
𝑓 (𝑠𝑖 ) as a bridge to establish a connection between rows in 𝑆 and Finally, the generated data should not be limited to the words in any
𝑇 . For instance, under the setting where each value in the source is specific language since many terms in table cells are not dictionary
matched with a single value in the target (e.g. a primary-foreign words and may not be limited to a particular language.
key relationship), one needs to find the closest match in 𝑇 for 𝑓 (𝑠𝑖 ). Overall, the primary purpose of training data in our case is guid-
This allows small discrepancies between predicted and target values, ing the model to understand the mapping corresponding to a set
without affecting the join. There is a great chance that a signifi- of source-target example pairs. In this context, different combina-
cant string similarity exists between a model predicted value 𝑓 (𝑠𝑖 ) tions of edit operations can be performed on the source examples
and the corresponding value in 𝑇 . In many cases, this similarity to generate the target outputs. Unlike NLP models that rely on
is enough to perform the join. Therefore, for each (𝑠𝑖 , 𝑓 (𝑠𝑖 )) pair, understanding the syntax and the semantics of input, our model
we can select 𝑡 𝑗 ∈ 𝑇 such that it yields the minimum edit distance primarily focuses on discovering textual patterns and string op-
between the two strings. This can be formalized as follows: erations. Hence, character-level tokenization is preferred in our
case. In the rest of this section, we will delve into the process of
𝑚𝑖 = argmin 𝑒𝑑𝑖𝑡_𝑑𝑖𝑠𝑡 (𝑓 (𝑠𝑖 ), 𝑡 𝑗 ) (5)
generating a synthetic dataset to train our model.
𝑡 𝑗 ∈𝑇
where 𝑚𝑖 is considered a match in the target for 𝑠𝑖 . The approach
5.1.2 Training data generation. To generate our synthetic dataset
can be generalized to cases where a value in the source is matched
we first build a set of textual transformations, denoted as 𝑇 , each
with either no values or multiple values in the target. To allow such
consisting of a sequence of basic transformation units. We use
many-to-many joins, one may set lower and upper bounds for the
the basic transformation units in Nobari et al. [29], which include
edit distance instead of aiming for the minimum distance.
substring, split, lowercase, uppercase, and literal. These
units have their natural meanings: substr selects a portion of
5 EXPERIMENTS AND ANALYSIS the input based on start and end parameters, split breaks the
In this section, we evaluate our proposed model and analyze its input by a given character and selects one part, literal returns
performance under different settings. We also discuss our training a constant, and lowercase and uppercase return the lowercase
data generation and the process of training our model. and uppercase forms of the input respectively. Each unit copies
either parts of the input or a literal to the output, and the output
5.1 Dataset for Training DTT of a transformation is the concatenation of the outputs of its units.
Pretrained language models are generally trained on large text We randomly choose the units, parameters, and the length of each
corpora, and a common approach for training them involves mask- transformation in terms of the number of units, to provide a diverse
ing out a word and predicting it, as seen in popular models such set of examples. While the aforementioned transformations are
as T5 [36], BERT [9] and GPT-2 [35]. By using this approach, expected to cover many cases of transformations in real-world
large amounts of unlabeled data can be utilized to train the model. settings [29, 53], our aim is not to limit the model to learning a fixed
Nonetheless, our particular task requires a vast set of source and set of transformations. Our findings indicate that, with sufficient
target examples, grouped according to the transformations that independent training examples, the model can learn to find any
map source examples to their corresponding targets. To the best of necessary transformation even with a limited set of pre-defined
our knowledge, such a dataset is not currently available, and our transformations. The construction of transformations mainly helps
experiments have shown that even advanced generative models us group input examples that follow a same mapping, but the model
pretrained on natural language text, such as T5 and GPT-2, are not is not aware of the transformations and uses regular cross-entropy
capable of performing this task without extensive fine-tuning and loss at the character level to learn a mapping that transforms the
training. This is because entries in real-world tables are typically source into the target.
short and have little relevance to other entries in the same column, The transformations in Nobari et al. [29] and Zhu et al. [53] do
aside from sharing the same domain (e.g. individual names). As not allow stacking of the units where one unit is applied on top of
a result, the prior language knowledge of these general models another unit. For the same reason, they introduce complex transfor-
is less likely to be useful for this task. To address this challenge, mation units such as splitsubsting which stacks substring on top
we propose generating synthetic data to train the model. Before of split, with the output of one operation fed to the other. Instead of
delving into the details of data generation, however, it is important introducing many such new units, we allow random stacking of up
to first review the desired features of the training data. to three transformation units. The stacking here refers to passing
DTT: An Example-Driven Tabular Transformer by Leveraging Large Language Models

the output of one transformation unit to another one. Since our one character with another (for example, the character ‘/’ might
units include lower case and upper case transformations, the case be replaced with ‘-’ for all rows). This dataset resembles simple
of input may change in some transformations and not others. formatting changes such as replacing a slash in a phone number
For each transformation 𝑡𝑟 ∈ 𝑇 , a set of examples is generated. with a hyphen. This replacement operation is not a transformation
To create these examples, a source text is randomly generated con- unit that exists in the model’s training data and is thus unseen
sisting of a mix of alphabetic and numeric characters, symbols, and by the trained model. Each table contains 50 rows, and the length
special characters. The length of the input is selected at random. of input sources is randomly selected from a range of 8 to 35, un-
The transformation 𝑡𝑟 is then applied to source texts to generate less stated otherwise. Since our model is generating the output
a set of examples, denoted as 𝐼𝑡𝑟 = {(𝑠𝑖 , 𝑡𝑖 )}1≤𝑖 ≤𝑢 . Using random character-by-character, we measure the difficulty of datasets based
text instead of dictionary words avoids any potential bias towards on the number of required edit operations. Accordingly, this is an
natural language words and grammatical structures. To form exam- easy dataset considering that only a few characters in the input
ple sets, subsets of size 3 are selected from 𝐼𝑡𝑟 . Each example set is need to be changed to generate the desired output.
then serialized, as discussed in Section 4.1, with the target of the Medium Synthetic Dataset (Syn-ST) This synthetic dataset is
last example masked and labeled as the target for use in forming similar to the previous one in terms of the number of table pairs
context sets for model training. and the input length. Each table pair is constructed by applying a
single substring transformation unit to the input, with the start and
end parameters selected randomly. Substring is one of the units
5.2 Dataset for Evaluation included in the model’s training data. In terms of difficulty, this
To evaluate the effectiveness of our approach and compare its per- dataset is considered to be medium-level based on the number of
formance with state-of-the-art baselines, we use two real-world edit operations required.
datasets as well as four synthetic datasets. In what follows, we Difficult Synthetic Dataset (Syn-RV) This synthetic dataset con-
provide a detailed explanation of each dataset. sists of 5 tables, each containing 50 rows with input sources ran-
Web Tables Dataset (WT) This benchmark was initially intro- domly selected to have a length between 8 to 35 characters. In this
duced by Zhu et al. [53] and was also used as a benchmark in dataset, the target output is obtained by reversing all characters
Nobari et al. [29]. The dataset includes 31 pairs of tables from 17 in the source (for instance, “Hello” is changed to “olleH”). This
distinct topics, with an average of 92.13 rows per table and an av- benchmark is considered difficult since almost all characters in the
erage length of 31 characters per input source. The tables were input source must be changed to generate the expected target.
sampled from Google Fusion tables by identifying tables that ap-
pear in the results of the same queries but are formatted differently. 5.3 Experimental Setup
This benchmark contains natural noise and inconsistencies, and
Our model, DTT, was trained on a synthetic dataset containing 2000
not all entities can be transformed using traditional string-based
groupings of transformations, each corresponding to a transforma-
transformations, which makes this dataset a relatively challenging
tion, as discussed in Section 5.1. For each grouping, we generated 10
benchmark [29].
pairs of source-target samples with randomly chosen input lengths
Spreadsheet Dataset (SS) This dataset includes 108 pairs of ta-
ranging from 8 to 35. 80% of the samples were used for training and
bles, sourced from Microsoft Excel product team and user help
the other 20% were the validation set. We also conducted experi-
forums, specifically focused on users’ data cleaning issues. The
ments with other sample sizes and input lengths for training the
tables are comprised of spreadsheet pages that present the same in-
model, and the results are discussed in Section 5.7.
formation in different formats. The dataset encompasses the public
To evaluate the performance of our model, we divided the rows
benchmarks presented in FlashFill [10] and BlinkFill [40], and was
of each input table in our datasets into two equal-sized sets, denoted
published in 2016 Syntax-Guided Synthesis Competition (SyGuS-
as 𝑆𝑒 and 𝑆𝑡 . The former provided context examples to be passed
Comp) [3]. On average, each table in the dataset contains 34.43
to the model, while the latter was used for testing. Since DTT
rows and 19 characters per input source. Compared to web tables,
is an example-driven method, the selection of these examples is
this dataset features considerably less noise and inconsistency.
critical to the model’s performance. To ensure the robustness of our
General Synthetic Dataset (Syn) This is a synthetic dataset that
predictions, we employ a technique where each input is fed to the
contains 10 table pairs. Each pair is generated by applying a ran-
model five times, and each time a distinct set of randomly chosen
domly generated textual transformation to a set of random input
examples from 𝑆𝑒 were given as context. The results of those trials
sources to create the output table. The transformations are con-
were aggregated, as discussed in Section 4.3, to produce a final
structed by putting together a random sequence of 3 to 6 units,
prediction.
the same as those discussed in Section 5.1, with random parameter
sets. Unless stated differently, the dataset contains 10 tables, each
of which contains 100 rows. Input length is randomly chosen in 5.4 Evaluation Metrics
the range of 8 to 35, and no artificial noise is added to the dataset. We evaluate the performance of our models based on precision, re-
While the model has been exposed to the units during the training, call, and F1-Score. This evaluation is in the context of heterogeneous
the transformations, the parameter sets of the units, and the inputs join, as discussed in Section 4.4, where for a given source-target
are unseen during the training process. sample (𝑠, 𝑡), we consider a model prediction correct if it has the
Easy Synthetic Dataset (Syn-RP) This is a synthetic dataset con- minimum edit distance with the target 𝑡. In our case, precision
taining 5 pairs of tables. Each pair is formed by randomly replacing represents the fraction of correct predictions that join with the
Arash Dargahi Nobari and Davood Rafiei

Table 1: Performance compared to heterogenous join baselines

Our Approach CST AFJ


Dataset P R F AED ANED P R F P R F
WT 0.951 0.950 0.950 6.155 0.232 0.879 0.726 0.713 0.935 0.672 0.708
SS 0.954 0.952 0.953 2.399 0.135 0.995 0.792 0.812 0.943 0.662 0.691
Syn 0.934 0.934 0.934 6.986 0.150 0.990 0.259 0.324 0.993 0.490 0.511
Syn-RP 1.000 1.000 1.000 0.816 0.027 1.000 0.816 0.897 1.000 1.000 1.000
Syn-ST 0.880 0.880 0.880 5.032 0.316 1.000 1.000 1.000 1.000 1.000 1.000
Syn-RV 0.632 0.632 0.632 33.600 0.852 1.000 0.000 0.000 0.990 0.020 0.037

target, recall measures the fraction of source rows that are correctly There are significant differences between DTT and the base-
mapped, and F1-score is the harmonic mean of the two. It is im- lines CST is limited in its ability to extract transformations, and
portant to note that not all source rows may be mapped due to cannot perform a join when there is no clear copying relation-
various reasons2 . In addition to the above metrics, we also report ship between the source and target, as is the case with the Syn-RV
the Average Edit Distance(AED) and Average Normalized Edit Dis- dataset where the target is obtained by reversing the input. As
tance (ANED), which indicates the extent to which a prediction a result, CST achieves a 0% F1-score on this dataset. AFJ, on the
may differ from the given target. The normalization is performed other hand, employs similarity functions to determine if source
based on the target length, enabling comparability across different and target values can be joined. However, this method struggles
datasets and lengths. All reported metrics for each dataset are the when there is not much similarity between the source and target,
average over all tables in the dataset. as demonstrated by its performance on the Syn-RV dataset. Such
challenges are common in real-world data. DTT, in contrast, lever-
5.5 Performance Compared to Heterogeneous ages the provided examples to generate the desired output without
Join Baselines relying on textual similarity or being bounded by the length of
transformations. Hence, DTT performs significantly better than the
In this section, we evaluate the performance of our model on the baselines on more challenging datasets, such as the real-world WT
end-to-end task of heterogeneous or unequal table join. The task dataset and the synthetic Syn and Syn-RV datasets. For instance,
simulates the scenario where source and target columns are in DTT outperforms the baselines by a large margin on Syn-RV, where
two different tables that need to be joined. To provide a point of the target is obtained by reversing the order of characters in the
reference, we compare the performance of our model to two cur- input.
rent state-of-the-art baselines: Common String-based Transformer Two more interesting observations can be made here. Firstly,
(CST) [29] and Auto-FuzzyJoin (AFJ) [24]. CST finds a set of textual to achieve a good performance on the join, it is not necessary to
transformations given a set of examples to transform tables for predict every single character correctly. Our framework can tol-
joinability and AFJ uses a set of similarity functions to detect the erate inaccuracies by aggregating results from multiple examples
most probable rows to be joined. and using an edit-distance-based metric to form join candidates.
Table 1 summarizes the performance of DTT and the baselines, For example, in Syn-RV dataset, while the average normalized edit
in terms of precision, recall, and F1-score, (denoted as P, R, and F distance is more than 80%, the F1-score for join prediction is 63%.
respectively). The results show that DTT outperforms the baselines Secondly, our model performs very well on all real-world datasets
on all real-world datasets in terms of F1-Score and recall. On the and two synthetic datasets Syn-RP and Syn-RV, despite the fact
synthetic datasets, our approach outperforms the baselines on three that our training data did not include any operation that simulates
out of four datasets. On Syn-RP and Syn-ST datasets, our approach reversing the input or replacing a character, and no real-world ex-
is either comparable or slightly worse than the baselines. The rea- amples or transformations were included in the training data. This
son is that these datasets are relatively easy, with a significant highlights that the model is not limited to a given set of transfor-
textual similarity between the source and target. CST exhaustively mations, but rather focuses on extracting character-level patterns
searches the space for substring transformation, which is the only from the given set of input examples.
transformation used in the Syn-ST dataset. Moreover, AFJ is based Finally, in terms of comparing the runtime of DTT and our
on the textual similarity, and every target in Syn-ST is a substring baselines, a direct comparison is not possible since DTT requires a
of the source, leading to a significant similarity between source and GPU architecture whereas our baselines require CPU and memory.
target. Therefore, these datasets favor the baselines. Nevertheless, That said, some observations can be made on the scalability of
DTT still achieves an F1-score of 88% on the Syn-ST dataset and a the models The time required to predict a mapping for each row
perfect F-score of 100% on Syn-RP, which is equal to AFJ and better in DTT is independent of the number of rows and grows linearly
than CST. with the length of the rows, whereas this time grows quadratically
with the number of rows and polynomially with the length in CST.
2 ForAFJ, a threshold for similarity distance is set and based on that threshold, some While the edit distance calculation in the joining process depends
source rows will not have a match. In CST, a match may not still be found after applying on the number of rows, our experiments suggest the growth in
the detected transformations to all input rows. The language models may just return
<eos> with no prediction.
the runtime of DTT is noticeably less than CST when input length
DTT: An Example-Driven Tabular Transformer by Leveraging Large Language Models

1
DTT-2e
GPT3-1e
GPT3-DTT-1e
0.8
GPT3-2e
GPT3-DTT-2e
F1-Score

0.6

0.4

0.2

0
WT SS Syn Syn-RP Syn-ST Syn-RV
Dataset

Figure 3: Performance compared to GPT-3 as well as the performance of the combined model

increases. For instance, with our machine setup 3 , processing a be extremely powerful, fast, and capable of many advanced tasks6 .
table with row length set to 5 characters from our synthetic dataset Nevertheless, the model specification and the number of parameters
takes 5 seconds for DTT and 3 seconds for CST. However, when are not publicly announced. Comparing the general performance
the input length increases to 50 characters, DTT needs less than of the Curie model with the performance reported for various sizes
17 seconds, while CST takes around 90 seconds to complete the of GPT-3 [4], it can be assumed that the Curie model has about
join. It should be noted that the runtimes reported for DTT are the 7B parameters and is trained on a huge text data from different
summation of decomposition, all 5 trails, and the aggregation time. datasets.
For scalability in terms of the number of rows, we compared their We run two sets of experiments to analyze the performance
performance on two tables from our spreadsheet dataset, “phone-10- of GPT-3 for unequal join. First, as the common method of using
short” and “phone-10-long”, both with an average of 17 characters LLMs, we pass our examples as an input sequence to the GPT-3 and
per row. The former has 7 rows, while the latter has 100. DTT takes consider the model output as the expected target. The serialization
3 and 22 seconds respectively for short and long tables, while the used for GPT-3 is the same as DTT, as discussed in Section 4.1.
same experiments require 4 and 366 seconds for CST, and 4 and 38 In the second experiment, we use GPT-3 as a replacement for our
seconds for AFJ. This indicates how our framework scales better fine-tuned ByT5 model (and byte-level tokenizer) inside our frame-
in terms of runtime when the input grows either horizontally or work, keeping the serializer and aggregator from DTT. Figure 3
vertically. depicts the F1-Score of the model with 1 and 2 examples under both
experimental settings for all datasets compared to DTT and Table 2
5.6 Performance Compared to Large Language reports the F1-score and ANED of GPT-3 model for all 1, 3, and 5
Model Baselines input examples.
As shown in Figure 3, GPT-3 struggles to perform well on the
Large Language Models (LLM) can be employed in many down-
task with just one example despite some recent work suggesting
stream tasks including joining heterogeneous tables. It has been
that LLMs are capable of one-shot table reasoning [5]. However, pro-
shown that the recent models perform relatively well under zero
viding two examples significantly boost its performance, especially
or few shot settings [4, 5], hence they set a strong baseline. In this
on real-world data, bringing it on par with DTT. In our synthetic
section, we compare the performance of our model to GPT-3 [4], a
datasets, however, DTT performs significantly better than GPT-3.
state-of-the-art LLM with exceptional performance on many tasks.
The lack of publicly available data on GPT-3 Curie model’s size
Compared to our ByT5-base [48] model that is fine-tuned on only
and specification makes it challenging to conduct a more in-depth
20,000 synthetically-generated samples and contains near 582M
comparison. It can be noted that GPT-3 is trained on numerous web
parameters, GPT-3 models are trained on billions of documents and
resources, including web tables, which increases the likelihood that
resources such as web tables and has at least one to two orders of
it has encountered various representations of common entities and
magnitude more parameters. Our experiment with GPT-3 is under
tables on the web. Since our real-world datasets are gathered from
few-shot setting with 1, 2, 3, 5, and 10 randomly selected samples
tables on web pages, this could explain why the model performs sig-
from each table given as examples. Zero-shot setting is not appli-
nificantly better on WT and SS datasets than on synthetic datasets.
cable in our case since an input source row can be mapped to an
Conversely, synthetic datasets consist of sequences of random char-
unlimited number of targets without any examples.
acters that may not be tokens from natural language, and GPT-3
At the time of writing, GPT-3 models are not published publicly
may not have encountered them during its training. Consequently,
and are only accessible through OpenAI commercial API4 . We
its performance on most of the synthetic datasets is weak and, in
use the Curie5 model of GPT-3 from the API. Curie is claimed to
some cases, significantly inferior to DTT, especially in the Syn-RV
3 Our experiments were conducted on a machine with Nvidia RTX 3090 GPU and AMD dataset, where the target and source are substantially different. Our
EPYC 7601 CPU with 64GB RAM. ByT5-based model, however, is trained to extract patterns among
4 https://openai.com
5 The complete model name is “text-curie-001” 6 Based on platform documentations on openai.com
Arash Dargahi Nobari and Davood Rafiei

Table 2: Performance of GPT-3 as well as that of the combined model

GPT3-1e GPT3-2e GPT3-3e GPT3-5e GPT3-DTT-1e GPT3-DTT-2e GPT3-DTT-3e GPT3-DTT-5e


Dataset F ANED F ANED F ANED F ANED F ANED F ANED F ANED F ANED
WT 0.625 0.499 0.933 0.151 0.954 0.108 0.966 0.088 0.759 0.341 0.979 0.072 0.985 0.074 0.987 0.073
SS 0.724 0.533 0.949 0.128 0.973 0.094 0.968 0.079 0.760 0.483 0.960 0.113 0.973 0.079 0.982 0.056
Syn 0.372 0.889 0.502 0.619 0.528 0.522 0.614 0.418 0.380 0.902 0.506 0.567 0.552 0.495 0.720 0.387
Syn-RP 0.264 0.824 0.920 0.195 0.976 0.127 0.984 0.111 0.352 0.748 0.968 0.125 1.000 0.098 1.000 0.095
Syn-ST 0.152 0.941 0.328 0.812 0.464 0.726 0.728 0.527 0.176 0.923 0.488 0.717 0.736 0.589 0.728 0.510
Syn-RV 0.120 0.947 0.112 0.944 0.112 0.944 0.144 0.940 0.152 0.944 0.104 0.948 0.120 0.944 0.146 0.939

a sequence of characters, allowing it to perform better on more The general expectation for the model is to perform better when
challenging synthetic datasets. more training samples are provided and the trend for our experi-
In our second set of experiments with GPT-3, we used our frame- ments is not much different. However, some observations should
work and replaced the LLM module with GPT-3. By default, our be taken into account. As shown in Figure 4, when the number of
framework employs two context example pairs because ByT5 has training samples surpasses 2,000 7 , the model performance does not
a maximum limit of 512 character-level tokens, and if a longer se- significantly change, and it reaches its optimal performance level
quence is given to the model, it will be truncated. However, the limit on our datasets. Beyond this point, a slight decrease in the perfor-
in GPT-3 Curie model is 2048 subword-level tokens. This allows us mance can be observed on real-world data and synthetic datasets
to increase the number of example pairs that are given to the model. that contain transformations not covered in the training data. This
In our experiment with GPT-3 integrated into the DTT framework, behavior can be attributed to the bias that the model acquires from
we varied the number of examples from one to five. As demon- seeing more transformations of the same type, which hinders its
strated in Figure 3 and Table 2, using GPT-3 within our framework ability to effectively use its prior knowledge of real-world data. Our
boosts its performance, in terms of both the F1-score and ANED, extensive experiments show that even with a significantly larger
on nearly all datasets when the same number of examples were training dataset, the decrease in performance is not significant.
provided. For instance, the average F1-score across all datasets of Thus the model performance will converge when 2000 or more
the GPT-3 model increased from 0.624 to 0.667 with two examples training samples are provided.
and from 0.734 to 0.760 with five examples when integrated into To examine how the length of input affects the training process
the DTT framework. This demonstrates how the model inside the of the model, we conducted another experiment where we changed
DTT framework can be substituted with other larger models and the length range of the training samples by randomly selecting
gain a performance boost. values between 5 and 60 characters. The right panel of Figure 4
shows the performance when the model is trained with sequences
that are generally longer and have an extended range. Increasing the
5.7 Performance Varying the Number and length range of input sample pairs does not lead to any noticeable
Length of Training Samples improvement on the performance of the model. That being said,
Our trained model has two important parameters: the number of increasing the length is expected to have an impact on how the
samples and their length. To gain a deeper insight into the relation- model performs on longer inputs, which is discussed next.
ship between these parameters and the model’s performance, we
conducted an experiment where we varied the number of training 5.8 Performance Varying the Input Length
samples from 0 to 10,000. Each sample here is a grouping of trans- In this section, we explore how the model performs under different
formations that consists of 10 source-target pairs, and we kept the input lengths. We also investigate how the length of input data
sequence length consistent with our other experiments, ranging during training affects the model’s ability to handle longer input
between 8 and 35. When the number of samples was set to zero, during inference time.
the ByT5 model did not undergo any fine-tuning. To conduct our experiments, we regenerated synthetic datasets
As shown in the top left panel of Figure 4, the F1-Score of the Syn-RP, Syn-ST, and Syn-RV, this time with the input lengths vary-
model is typically less than 0.5 when no fine-tuning is performed. ing from 5 to 50 characters. We utilized two versions of the model
Also on all datasets, over 80% of characters are predicted incorrectly in our analysis. The first version was trained on input examples
(i.e. ANED > 0.8) when the model is not fine-tuned, as indicated with lengths randomly sampled between 8 and 35, while the second
by the 0 training samples in the figure. For example, in the Syn-ST version was trained on examples with extended lengths selected
dataset, over 84% of output characters are predicated incorrectly by randomly between 5 to 60 characters.
the ByT5 model without fine-tuning. However, this error is reduced As shown in Figure 5, when the benchmark dataset is easy in
to 27% after a proper fine-tuning of the model. This finding suggests terms of edit distance between source and target, such as Syn-RP,
that, unlike GPT-3, the ByT5 model without fine-tuning struggles to the performance of the model is not significantly influenced by the
perform well for unequal join. Nevertheless, our fine-tuning plays 7 Thisnumber refers to the number of transformation groupings, and it translates to
a crucial role in significantly improving the performance of the 20,000 source-target examples of which 16,000 examples are used for training and the
model. remaining 4,000 is kept as the validation set.
DTT: An Example-Driven Tabular Transformer by Leveraging Large Language Models

1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
F1

F1
0.4 0.4
0.3 0.3
0.2 0.2
Dataset Dataset
0.1 SS Syn-RP Syn-ST 0.1 SS Syn-RP Syn-ST
Syn Syn-RV WT Syn Syn-RV WT
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Traning Samples Number of Traning Samples
(a) F1 score for models trained on shorter-length data (b) F1 score for models trained on longer-length data

Dataset Dataset
1 SS Syn-RP Syn-ST 1 SS Syn-RP Syn-ST
Syn Syn-RV WT Syn Syn-RV WT
0.9 0.9
0.8 0.8
Normalized Edit Distance

0.7 Normalized Edit Distance 0.7


0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Traning Samples Number of Traning Samples
(c) Normalized edit distance for models trained on shorter-length data (d) Normalized edit distance for models trained on longer-length data

Figure 4: Performance of the model varying the number of training data samples

length of the input. Both the model trained on short input examples influence the prediction of subsequent characters. The results of our
and the model trained on long input examples deliver the highest F1- experiment suggest that the extent of the decrease in performance
Score and near zero edit distance across almost all lengths of input. is influenced by the training data of the model. When trained on
On the medium dataset (i.e., Syn-ST ), the models start with almost shorter-length data, there is a significant decrease in both F1-Score
perfect performance and the performance is sustained for input and ANED as the input length increases. However, when trained
lengths that are shorter than the length of the majority of samples on lengthier data, the decrease is relatively minimal. Overall, such
used in training. However, the performance begins to decrease once cases are not very common in real-world datasets, and our experi-
the input length surpasses this threshold. Nonetheless, even with an ments demonstrate that our model can perform well under various
increase in ANED, the model still manages to predict a reasonable input lengths in real-world settings. Based on our experiment, we
portion of the expected characters in the output. Interestingly, the can assume that the model can accurately detect transformation
drop in performance does not occur when the model is trained on patterns for various input lengths when the difficulty level of the
longer input samples. It should be noted that if the input is too transformation is reasonable.
short (not a typical real-world scenario), such as when it contains
only 5 characters, there may be a slight decrease in performance as
the model may not fully comprehend the relationship between the 6 CONCLUSION AND FUTURE WORK
source and target with such limited information. We have studied the problem of mapping tabular data from a source
On the other hand, on more challenging datasets, such as Syn-RV, formatting to a desired target formatting using a set of few examples.
the performance drops even for input lengths that are shorter than Tables may be transformed to enable joining heterogeneous tables,
the majority of training samples. This behavior is not unexpected filling missing values, data corrections, and other data integration
for Auto-regressive models since a single incorrect prediction can tasks. To address this challenge, we proposed a framework that
leverages the power of large language models. We generated the
Arash Dargahi Nobari and Davood Rafiei

1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
F1

F1
0.4 0.4
0.3 0.3
0.2 0.2
0.1 Dataset 0.1 Dataset
Syn-RP Syn-RV Syn-ST Syn-RP Syn-RV Syn-ST
0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Input length Input length
(a) F1 score for models trained on shorter-length data (b) F1 score for models trained on longer-length data

Dataset Dataset
1 Syn-RP Syn-RV Syn-ST 1 Syn-RP Syn-RV Syn-ST
0.9 0.9
0.8 0.8
Normalized Edit Distance

Normalized Edit Distance


0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Input length Input length
(c) Normalized edit distance for models trained on shorter-length data (d) Normalized edit distance for models trained on longer-length data

Figure 5: Performance of the model varying the input length

required training data and fine-tuned a character-level LLM based [2] Mehdi Akbarian Rastaghi, Ehsan Kamalloo, and Davood Rafiei. 2022. Probing the
on ByT5 for this task. Our extensive experiments demonstrate that Robustness of Pre-Trained Language Models for Entity Matching. In Proceedings of
the 31st ACM International Conference on Information &; Knowledge Management
our model achieves impressive performance on a wide range of (Atlanta, GA, USA) (CIKM ’22). Association for Computing Machinery, New York,
real-world and synthetic datasets, outperforming state-of-the-art NY, USA, 3786–3790. https://doi.org/10.1145/3511808.3557673
[3] Rajeev Alur, Dana Fisman, Rishabh Singh, and Armando Solar-Lezama. 2016.
models in the field. SyGuS-Comp 2016: Results and Analysis. Electronic Proceedings in Theoretical
Our work suggests several possible avenues for future research. Computer Science 229 (Nov 2016), 178–202.
One potential direction is to explore the use of synthetic data gen- [4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
eration to enhance model training for a variety of data integration Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
tasks. Additionally, there is value in investigating the challenges Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris
and limitations of synthetic data in model training, as well as strate- Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and
gies for addressing those challenges. Furthermore, given concerns Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in
around privacy, federated learning may be a preferred approach Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F.
Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901.
for table transformation tasks. As such, an exploration of feder- [5] Wenhu Chen. 2022. Large language models are few (1)-shot table reasoners.
ated learning methods for this purpose is yet another promising arXiv preprint arXiv:2210.06710 (2022).
direction for future research. [6] Wenhu Chen, Ming-Wei Chang, Eva Schlinger, William Wang, and William W
Cohen. 2020. Open question answering over tables and text. arXiv preprint
arXiv:2010.10439 (2020).
REFERENCES [7] Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, and Jianhua Feng. 2014.
MassJoin: A mapreduce-based method for scalable string similarity joins. In
[1] Mehdi Akbarian Rastaghi, Ehsan Kamalloo, and Davood Rafiei. 2022. Probing the 2014 IEEE 30th International Conference on Data Engineering. 340–351. https:
Robustness of Pre-trained Language Models for Entity Matching. In Proceedings of //doi.org/10.1109/ICDE.2014.6816663
the 31st ACM International Conference on Information & Knowledge Management.
3786–3790.
DTT: An Example-Driven Tabular Transformer by Leveraging Large Language Models

[8] Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: Table (ICDE). IEEE, 1649–1661.
Understanding through Representation Learning. Proc. VLDB Endow. 14, 3 (2020), [30] Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan
307–319. Dhingra, Diyi Yang, and Dipanjan Das. 2020. ToTTo: A Controlled Table-To-Text
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Generation Dataset. In Proceedings of the 2020 Conference on Empirical Methods in
Pre-training of Deep Bidirectional Transformers for Language Understanding. In Natural Language Processing (EMNLP). Association for Computational Linguistics,
Proceedings of the 2019 Conference of the North American Chapter of the Associa- Online, 1173–1186. https://doi.org/10.18653/v1/2020.emnlp-main.89
tion for Computational Linguistics: Human Language Technologies, NAACL-HLT. [31] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
Association for Computational Linguistics, 4171–4186. Global vectors for word representation. In Proceedings of the 2014 conference on
[10] Sumit Gulwani. 2011. Automating String Processing in Spreadsheets Using Input- empirical methods in natural language processing (EMNLP). 1532–1543.
Output Examples. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT [32] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher
Symposium on Principles of Programming Languages (Austin, Texas, USA) (POPL Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Rep-
’11). Association for Computing Machinery, New York, NY, USA, 317–330. resentations. In Proceedings of the 2018 Conference of the North American Chapter
[11] Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet Data of the Association for Computational Linguistics: Human Language Technologies,
Manipulation Using Examples. Commun. ACM 55, 8 (Aug. 2012), 97–105. Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans,
[12] Jian He, Enzo Veltri, Donatello Santoro, Guoliang Li, Giansalvatore Mecca, Paolo Louisiana, 2227–2237.
Papotti, and Nan Tang. 2016. Interactive and deterministic data cleaning. In [33] Abdulhakim A Qahtan, Ahmed Elmagarmid, Raul Castro Fernandez, Mourad
Proceedings of the 2016 International Conference on Management of Data. 893–907. Ouzzani, and Nan Tang. 2018. FAHES: A robust disguised missing values detector.
[13] Yeye He, Xu Chu, Kris Ganjam, Yudian Zheng, Vivek Narasayya, and Surajit In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Chaudhuri. 2018. Transform-data-by-example (TDE): an extensible search engine Discovery & Data Mining. 2100–2109.
for data transformations. Proceedings of the VLDB Endowment 11, 10 (2018), [34] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018.
1165–1177. Improving language understanding by generative pre-training. (2018).
[14] Yeye He, Kris Ganjam, and Xu Chu. 2015. SEMA-JOIN: Joining Semantically- [35] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
Related Tables Using Big Table Corpora. Proc. VLDB Endow. 8, 12 (Aug. 2015), et al. 2019. Language models are unsupervised multitask learners. OpenAI blog
1358–1369. 1, 8 (2019), 9.
[15] Yeye He, Kris Ganjam, Kukjin Lee, Yue Wang, Vivek Narasayya, Surajit Chaud- [36] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
huri, Xu Chu, and Yudian Zheng. 2018. Transform-Data-by-Example (TDE): Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the
Extensible Data Transformation in Excel. In Proceedings of the 2018 International Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of
Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Associa- Machine Learning Research 21 (2020), 140:1–140:67.
tion for Computing Machinery, New York, NY, USA, 1785–1788. [37] Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search.
[16] Alireza Heidari, Joshua McGrath, Ihab F Ilyas, and Theodoros Rekatsinas. 2019. In 2012 IEEE international conference on acoustics, speech and signal processing
Holodetect: Few-shot learning for error detection. In Proceedings of the 2019 (ICASSP). IEEE, 5149–5152.
International Conference on Management of Data. 829–846. [38] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine
[17] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual
Neural Computation 9, 8 (1997), 1735–1780. Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
[18] Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Iyyer. 2021. TABBIE: 1715–1725.
Pretrained Representations of Tabular Data. In Proceedings of the 2021 Conference [39] A. Simitsis, P. Vassiliadis, and T. Sellis. 2005. Optimizing ETL processes in data
of the North American Chapter of the Association for Computational Linguistics: warehouses. In 21st International Conference on Data Engineering (ICDE’05). 564–
Human Language Technologies. Association for Computational Linguistics, Online, 575.
3446–3456. [40] Rishabh Singh. 2016. BlinkFill: Semi-Supervised Programming by Example for
[19] Zhongjun Jin, Michael R. Anderson, Michael Cafarella, and H. V. Jagadish. 2017. Syntactic String Transformations. Proc. VLDB Endow. 9, 10 (June 2016), 816–827.
Foofah: Transforming Data By Example. In Proceedings of the 2017 ACM Interna- [41] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence
tional Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD ’17). Learning with Neural Networks. In Advances in Neural Information Processing
Association for Computing Machinery, New York, NY, USA, 683–698. Systems, Vol. 27. Curran Associates, Inc.
[20] Zhongjun Jin, Yeye He, and Surajit Chauduri. 2020. Auto-Transform: Learning- [42] Nan Tang, Ju Fan, Fangyi Li, Jianhong Tu, Xiaoyong Du, Guoliang Li, Sam Madden,
to-Transform by Patterns. Proc. VLDB Endow. 13, 12 (July 2020), 2368–2381. and Mourad Ouzzani. 2021. RPT: Relational Pre-Trained Transformer is Almost
[21] Mihir Kale and Abhinav Rastogi. 2020. Text-to-Text Pre-Training for Data-to-Text All You Need towards Democratizing Data Preparation. Proc. VLDB Endow. 14, 8
Tasks. In Proceedings of the 13th International Conference on Natural Language (2021), 1254–1261.
Generation. Association for Computational Linguistics, Dublin, Ireland, 97–102. [43] James Thorne, Majid Yazdani, Marzieh Saeidi, Fabrizio Silvestri, Sebastian Riedel,
https://aclanthology.org/2020.inlg-1.14 and Alon Halevy. 2021. From natural language processing to neural databases.
[22] George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep In Proceedings of the VLDB Endowment, Vol. 14. VLDB Endowment, 1033–1039.
learning approaches for text-to-SQL. The VLDB Journal (2023), 1–32. [44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[23] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: you Need. In Advances in Neural Information Processing Systems, Vol. 30. Curran
Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Associates, Inc.
Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of [45] Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew
the Association for Computational Linguistics. Association for Computational Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for
Linguistics, Online, 7871–7880. Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association
[24] Peng Li, Xiang Cheng, Xu Chu, Yeye He, and Surajit Chaudhuri. 2021. Auto- for Computational Linguistics. Association for Computational Linguistics, Online,
FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples. In 7567–7578. https://doi.org/10.18653/v1/2020.acl-main.677
Proceedings of the 2021 International Conference on Management of Data (Virtual [46] Jin Wang, Chunbin Lin, and Carlo Zaniolo. 2019. MF-Join: Efficient Fuzzy String
Event, China) (SIGMOD/PODS ’21). Association for Computing Machinery, New Similarity Join with Multi-level Filtering. In 2019 IEEE 35th International Confer-
York, NY, USA, 1064–1076. ence on Data Engineering (ICDE). 386–397.
[25] Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. [47] Yue Wang and Yeye He. 2017. Synthesizing Mapping Relationships Using Table
2020. Deep Entity Matching with Pre-Trained Language Models. Proc. VLDB Corpus. In Proceedings of the 2017 ACM International Conference on Manage-
Endow. 14, 1 (sep 2020), 50–60. ment of Data (Chicago, Illinois, USA) (SIGMOD ’17). Association for Computing
[26] Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. Machinery, New York, NY, USA, 1117–1132.
2020. Deep Entity Matching with Pre-Trained Language Models. Proc. VLDB [48] Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
Endow. 14, 1 (oct 2020), 50–60. https://doi.org/10.14778/3421424.3421431 Kale, Adam Roberts, and Colin Raffel. 2022. ByT5: Towards a Token-Free Fu-
[27] P. McBrien and A. Poulovassilis. 2003. Data integration by bi-directional schema ture with Pre-trained Byte-to-Byte Models. Transactions of the Association for
transformation rules. In Proceedings 19th International Conference on Data Engi- Computational Linguistics 10 (03 2022), 291–306.
neering (Cat. No.03CH37405). 227–238. [49] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya
[28] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A Massively Multilingual
Distributed Representations of Words and Phrases and their Compositionality. Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the
In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, North American Chapter of the Association for Computational Linguistics: Human
M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Asso- Language Technologies. 483–498.
ciates, Inc. [50] Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020.
[29] Arash Dargahi Nobari and Davood Rafiei. 2022. Efficiently Transforming Tables TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In
for Joinability. In 2022 IEEE 38th International Conference on Data Engineering
Arash Dargahi Nobari and Davood Rafiei

Proceedings of the 58th Annual Meeting of the Association for Computational Lin- [52] Chen Zhao and Yeye He. 2019. Auto-EM: End-to-End Fuzzy Entity-Matching
guistics. Association for Computational Linguistics, Online, 8413–8426. Using Pre-Trained Deep Models and Transfer Learning. In The World Wide Web
[51] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing
Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Machinery, New York, NY, USA, 2413–2424.
et al. 2020. Big bird: Transformers for longer sequences. Advances in neural [53] Erkang Zhu, Yeye He, and Surajit Chaudhuri. 2017. Auto-join: Joining tables by
information processing systems 33 (2020), 17283–17297. leveraging transformations. Proceedings of the VLDB Endowment 10, 10 (2017),
1034–1045.

You might also like