A Survey On Text-to-SQL Parsing: Concepts, Methods, and Future Directions
A Survey On Text-to-SQL Parsing: Concepts, Methods, and Future Directions
Abstract—Text-to-SQL parsing is an essential and challenging task. The goal of text-to-SQL parsing is to convert a natural language
(NL) question to its corresponding structured query language (SQL) based on the evidences provided by relational databases. Early
text-to-SQL parsing systems from the database community achieved a noticeable progress with the cost of heavy human engineering
and user interactions with the systems. In recent years, deep neural networks have significantly advanced this task by neural
arXiv:2208.13629v1 [cs.CL] 29 Aug 2022
generation models, which automatically learn a mapping function from an input NL question to an output SQL query. Subsequently, the
large pre-trained language models have taken the state-of-the-art of the text-to-SQL parsing task to a new level. In this survey, we
present a comprehensive review on deep learning approaches for text-to-SQL parsing. First, we introduce the text-to-SQL parsing
corpora which can be categorized as single-turn and multi-turn. Second, we provide a systematical overview of pre-trained language
models and existing methods for text-to-SQL parsing. Third, we present readers with the challenges faced by text-to-SQL parsing and
explore some potential future directions in this field.
Index Terms—Text-to-SQL Parsing, Semantic Parsing, Natural Language Understanding, Table Understanding, Deep Learning
1 I NTRODUCTION
With the popularity of electronic devices, tables have improve text-to-SQL parsing by neural generation models.
become the mainstream to store large structural data from A typical neural generation method is the sequence-to-
various resources (e.g., webpages, databases and spread- sequence (Seq2Seq) [3] model, which automatically learns a
sheets), which represent the data as a grid-like format of mapping function from the input NL question to the output
rows and columns so that users can easily inquire the SQL under encoder-decoder schemes. The key idea is to
patterns and discover insights from data. Although the construct an encoder to understand the input NL questions
tables can be efficiently accessed by skilled professionals together with related table schema and leverage a grammar-
via the handcrafted structured query languages (SQLs), a based neural decoder to predict the target SQL. The Seq2Seq
natural language (NL) interface can facilitate the ubiquitous based approaches have become the mainstream for text-
relational data to be accessed by a wider range of non- to-SQL parsing mainly because they can be trained in an
technical users [1]. Therefore, text-to-SQL parsing, which end-to-end way and reduce the need for specialized domain
aims to translate NL questions to machine-executable SQLs, knowledge.
has attracted noticeable attention from both industrial and So far, various neural generation models have been
academic communities. It can empower non-expert users to developed to improve the encoder and the decoder respec-
effortlessly query tables and plays a central role in various tively. On the encoder side, several general neural networks
real-life applications such as intelligent customer service, are widely used to globally reason over natural language
question answering, and robotic navigation. query and database schema. IRNet [4] encoded the ques-
Early text-to-SQL parsing work [2] from the Database tion and the table schema separately with bi-directional
community made a noticeable progress with the cost of LSTM [5] and self-attention mechanism [6]. RYANSQL [7]
heavy human engineering and user interactions with the employed convolutional neural network [8] with dense
systems. It is difficult, if not impossible, to design SQL connection [9] for question/schema encoding. With the
templates in advance for various scenarios or domains. advance of pre-trained language models (PLMs), SQLova
In recent years, recent advances of deep learning and the [10] first proposed to leverage the pre-trained language
availability of large-scale training data have significantly models (PLMs) such as BERT [11] as the base encoder.
RATSQL [12], SADGA [13] and LGESQL [14] adopted graph
• B. Qin, L. Wang and M. Yang are with Shenzhen Institutes of Advanced neural network to encode the relational structure between
Technology, Chinese Academy of Sciences, Shenzhen, China, 518055. B. the database schema and a given question. On the decoder
Qin and L. Wang are also with University of Chinese Academy of Sciences,
Beijing, China, 101408.
side, there are two categories of SQL generation approaches
E-mail: {lh.wang1, bw.qin, min.yang}@siat.ac.cn including the sketch-based methods and the generation-
• B. Hui, B. Li, R. Geng, R. Gao, J. Sun, L. Si, F. Huang and Y. Li are with based methods. Specifically, the sketch-based methods [15,
Alibaba Group, Beijing, China. 10, 16] decompose the SQL generation procedure into sub-
E-mail: {binyuan.hby, binhua.lbh, ruiying.gry, caorongyu.cry, jian.sun,
luo.si, f.huang, shuide.lyb}@alibaba-inc.com modules, where each sub-module corresponds to the type
• J. Li is with The University of Hong Kong, Hong Kong. of the prediction slot to be filled. These sub-modules are
E-mail: [email protected] later gathered together to generate the final SQL query. To
Min Yang and Yongbin Li are corresponding authors. enhance the performance of the generated SQL logic form,
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2
the generation-based methods [4, 12, 14, 17] usually decoded 2.1 Task Formulation
the SQL query as an abstract syntax tree in the depth-first Text-to-SQL (T2S) parsing aims to convert a natural lan-
traversal order by employing an LSTM [5] decoder. guage (NL) question under database items to its corre-
In parallel, PLMs have proved to be powerful in en- sponding structured query language (SQL) that can be
hancing text-to-SQL parsing and yield state-of-the-art per- executed against a relational database. As shown in Table
formances, which benefit from the rich linguistic knowledge 1, we provide formal notation to normalise task definitions.
in large-scale corpora. However, as revealed in previous Generally, existing T2S parsing approaches can be cate-
works, there are intrinsic differences between the distribu- gorized into single-turn (context-independent) and multi-
tion of tables and plain texts. Directly fine-tuning the PLMs turn (context-dependent) settings. Formally, for the single-
trained on large-scale plain texts to downstream text-to- turn T2S parsing setting, given a NL question Q and the
SQL parsing hinders the models from effectively modeling corresponding database schema S = hT , Ci, our goal is to
the relational relational structure in question/schema, and generate a SQL query Y . To be specific, the question Q =
thus leads to sub-optimal performances. Current studies
q1 , q2 , · · · , q|Q| is a sequence of |Q| tokens. The database
to alleviate the above limitation attempt to build Tabular schema consists of |T | tables T = t1 , t2 , · · · , t|T | and |C|
Language Models (TaLMs) by directly encoding tables and
columns C = c1 , c2 , · · · , c|C| . Each table ti is described by
texts, which show improved results on downstream text- its name that contains multiple words [ti,1 , ti,2 , · · · , ti,|ti | ].
to-SQL parsing tasks. For example, TaBERT [18] jointly Each column ctji in table ti is represented by words (a
encoded texts and tables with masked language modeling phrase) [ctj,1 i
, ctj,2
i
, · · · , cti ti ]. We denote the whole input as
(MLM) and masked column prediction (MCP) respectively, j,|cj |
which was trained on a large corpus of tables and their X = hQ, Si.
corresponding English contexts. TaPas [19] extended BERT For the multi-turn T2S parsing setting, we aims to
[11] by using additional positional embeddings to encode convert a sequence of NL questions to the corresponding
tables. In addition, two classification layers are applied to SQL queries, where the NL questions may contain ellipsis
choose table cells and aggregation operators which oper- and anaphora that refers to earlier items in the previous
ate on the table cells. Grappa [20] introduced a grammar- NL questions. Formally, let U = {U1 , . . . , UT } denote a
augmented pre-training framework for table semantic pars- sequence of utterances with T turns, where Ut = (Xt , Yt )
ing, which explores the schema linking in table semantic represents the t-th utterance which is the combination of a
parsing by encouraging the model to capture table schema NL question Xi and a SQL query Yi . In addition, there is
items which can be grounded to logical form constituents. corresponding database schema S . At the t-th turn, the goal
Grappa achieved the state-of-the-art performances for text- of multi-turn T2S parsing is to produce the SQL query Yt
to-SQL parsing. conditioned on the current NL question Xt , the historical
utterances {Ui }t−1
i=1 , and the database schema S .
Contributions of this survey. This manuscript aims at
providing a comprehensive review of the literature on text-
2.2 Evaluation metrics
to-SQL parsing as shown in Fig. 1. By providing this survey,
we hope to provide a useful resource for both academic The text-to-SQL (T2S) parsers are generally evaluated by
and industrial communities. First, we introduce the ex- comparing the generated SQL queries against the ground-
perimental datasets and present a taxonomy that classifies truth SQL answers. Concretely, there are two types of eval-
the representative text-to-SQL approaches. Moreover, we uation metrics that are used for evaluating the single-turn
present readers with the challenges faced by text-to-SQL T2S setting, including exact set match accuracy (EM) and
parsing and explore some potential future directions in this execution accuracy (EX) [23]. For the multi-turn T2S setting,
field. question match accuracy (QM) and interaction match accu-
This manuscript is organized as follows. In Section 2, racy (IM) [31] are commonly employed.
we define the text-to-SQL parsing formally and introduce
the official evaluation metrics. Section 3 presents the main 2.2.1 Single-turn T2S Evaluation
scenarios (single-turn and multi-turn utterances) and the Exact Set Match Accuracy (EM) The exact set match ac-
corresponding datasets for text-to-SQL parsing. We intro- curacy (without values) is calculated by comparing the
duce the representative pre-training, encoding and decod- ground-truth SQL query and the predicted SQL query.
ing techniques for text-to-SQL parsing in Section 4 and Sec- Both ground-truth and predicted queries are parsed into
tion 5 respectively. Section 6 concludes this manuscript and normalized data structures which have the following SQL
outlines the future directions, followed by the references. clauses such as SELECT • GROUP BY • WHERE • ORDER BY
• KEYWORDS (including all SQL keywords without column
names and operators).
We treat the predicted SQL query as correct only if all of
2 BACKGROUND the SQL clauses are correct by a set comparison as follows:
(
In this section, we first provide a formal problem definition 1, Ŷ = Y
score(Ŷ , Y ) = (1)
of text-to-SQL parsing. Then, we describe the official eval- 0, Ŷ 6= Y
uation metrics for verifying the text-to-SQL parsers. Finally,
we introduce the benchmark corpora used for training the where Ŷ = {(k̂ i , v̂ i ), i ∈ (1, m)} and Y = {(k i , v i ), i ∈
neural text-to-SQL parsers. (1, m)} denote the component sets of the predicted SQL
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 3
Fig. 1. The comprehensive overview of the text-to-SQL parsing datasets, the pre-training tabular language models, the downstream text-to-SQL
parsing approaches.
TABLE 1
The notations used in this manuscript.
Symbol Description
S Sequence of database schema tokens, which consists of tables and columns.
T Sequence of table tokens.
C Sequence of column tokens.
Q Sequence of question tokens.
q Question token.
t Table token.
c Column token.
|| Length of tokens.
X Input of text-to-SQL model, which consists of question and schema.
Y Output of text-to-SQL model, referring to SQL query.
I Input sequence of encoder, which consists of special token, question token and schema tokens.
[CLS], [SEP] Special token of PLMs.
u Graph node embedding vector.
r Relation embedding vector.
WK Weight matrix of key vectors, which are used to calculate the attention score.
WQ Weight matrix of query vectors, which are used to calculate the attention score.
query and the ground-truth query respectively. Here, k of the clause. m is the number of components. Formally, the
stands for a SQL clause and v is the corresponding value
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4
TABLE 2
The statistics of the T2S datasets. “#” denotes the number of the corresponding units.
Dataset Single-Turn Multi-Turn Cross-domain Robustness Languages #Question #SQL #DB #Domain #Table
GenQuery [2] 3 en 880 247 1 1 6
Scholar [21] 3 en 817 193 1 1 7
WikiSQL [22] 3 en 80654 77840 26521 - 1
Spider [23] 3 3 en 10181 5693 200 138 1020
Spider-SYN [24] 3 3 3 en 7990 4525 166 - 876
Spider-DK [25] 3 3 3 en 535 283 10 - 48
Spider-SSP [26] 3 en - - - - -
CSpider [27] 3 3 zh 10181 5693 200 138 1020
SQUALL [27] 3 3 en 15620 11276 2108 - 2108
DuSQL [28] 3 3 zh 23797 23797 200 - 820
ATIS [29, 30] 3 en 5418 947 1 1 27
SparC [31] 3 3 en 4298 12726 200 138 1020
CoSQL [32] 3 3 en 3007 15598 200 138 1020
CHASE [33] 3 3 zh 5489 17940 280 - 1280
exact set match accuracy is calculated by: where Ŷ and Y are the predicted and ground-truth SQL
queries, respectively. Then, the question match accuracy is
ΣNn=1 score(Ŷn , Yn ) calculated by:
EM = , (2)
N
ΣMm=1 score(Ŷm , Ym )
where N denotes the total number of samples. EM evaluates QM = , (6)
the model performance by strictly comparing differences in M
SQL, but human’s SQL annotations are often biased since a where M denotes the total number of questions.
NL question may correspond to multiple SQL queries.
Interaction Match Accuracy (IM) The interaction match
Execution Accuracy (EX) Execution accuracy (with values) accuracy is calculated as the EM score over all interactions
is calculated by comparing the output results of executing (question sequences). The score of each interaction is 1
the ground-truth SQL query and the predicted SQL query only if all the questions within the interaction are correct.
on the database contents shipped with the test set. We treat Formally, the score for each interaction is defined as follows:
the predicted query as correct only if the results of executing o
the predicted SQL query V̂ and the ground-truth SQL query
Y
1, score(Ŷi , Yi ) = 1
V are same:
i=1
( interaction = o (7)
1, V̂ = V Y
0, score( Ŷ , Y ) = 0
score(V̂ , V ) = (3)
i i
0, V̂ 6= V i=1
Similarly with EM, the EX is calculated by: where o is the number of turns in each interaction. Then, the
IM score is calculated by:
ΣN
n=1 score(Ŷn , Yn )
EX = , (4) ΣP
p=1 interactionp
N IM = . (8)
P
To avoid false positives and false negatives caused by SQL
execution on finite size databases, the test-suite execution where P is the total number of interactions.
accuracy [34] extends the execution to multiple database
instances per schema. Concretely, the test-suite distills a 2.3 Datasets
small database from random generated databases to achieve
High-quality corpora are essential for learning and evaluat-
high code coverage. In this way, we can provide the best
ing the text-to-SQL (T2S) parsing systems. In the following,
approximation of semantic accuracy.
we summarize extensively-used datasets into two primary
categories: the single-turn T2S corpora with single-turn
2.2.2 Multi-turn T2S Evaluation
(stand-alone) questions and the multi-turn T2S corpora with
Given a multi-turn setting, there are a total of P question multi-turn sequential questions.
sequences, where each sequence contains O rounds and a
total of M = P × O questions. 2.3.1 Single-Turn T2S Corpora
Question Match Accuracy (QM) The question match ac-
curacy is calculated as the EM score over all questions. Its GenQuery The GenQuery [2] dataset is a collection 880
value is 1 for each question only if all predicted SQL clauses NL questions for querying a database of US geographical
are correct. We first calculate the EM score for each question facts (denoted as Geobase). A relational database schema
as follows: and SQL queries are constructed over Geobase for 700
( questions. Afterwards, the remaining NL questions are fur-
1, Ŷ = Y ther annotated by [21], following the widely used 600/280
score(Ŷ , Y ) = (5)
0, Ŷ 6= Y training/test split [35].
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 5
Scholar The Scholar [21] dataset is a collection of 816 instance can be easily modified to add domain knowledge.
NL questions annotated with SQL queries, where 600 NL
questions are used for training and 216 questions are used Spider-SSP The Spider-SSP [26] refers to the composi-
for testing. An academic database consisting of academic tional generalization version of the Spider [23] dataset. This
papers is provided to executed the SQL queries. ability to generalize to novel combinations of the elements
observed during training is referred to as compositional
WikiSQL The original WikiSQL [22] dataset is a collection generalization. A new train and test split of the Spider
of 80,654 hand-crafted NL question and SQL query pairs dataset is proposed based on Target Maximum Compound
along with the corresponding SQL tables extracted from Divergence (TMCD) [26]. Spider-SSP consists of 3, 282 train-
24,241 HTML tables on Wikipedia. In particular, for each ing instances and 1, 094 testing instances, and the databases
selected table six SQL queries are generated following the are shared between the training and testing instances.
SQL templates and rules. Then, for each SQL query, a
crude NL question is annotated using templates via crowd- CSpider The CSpider [27] dataset is a Chinese variant of
sourcing on Amazon Mechanical Turk. The WikiSQL dataset Spider [23] by translating the English NL questions in Spider
contains much more instances and tables than ATIS [29, 30], into Chinese. Similar to Spider, CSpider consists of the same
GenQuery [2] and Scholar [21]. In addition, the WikiSQL question-SQL pairs as in the Spider dataset.
dataset is more challenging than previous T2S corpora since SQUALL The SQUALL [36] dataset is an extension of
WikiSQL spans over a large number of tables and the text- WIKITABLEQUESTIONS [37], which enriches the 11, 276
to-SQL parsers should generalize to not only new queries samples from the training set of WIKITABLEQUESTIONS
but also new table schema. The instances in WikiSQL can be by providing hand-crafted annotations including both SQL
randomly split into training/validation/testing sets, such queries and the labeled alignments between NL question
that each table is involved in exactly one set. tokens as well as the corresponding SQL fragments. In
Spider The Spider [23] dataset is a large-scale bench- total, SQUALL contain 15, 620 instances, which are split into
mark for cross-domain text-to-SQL parsing. Spider contains 9, 030 instances for training, 2, 246 instances for validation,
10,181 NL questions and 5,693 unique SQL queries over and 4, 344 instances for testing.
200 databases belonging to 138 different domains. Different DuSQL The DuSQL [28] dataset is a larges-scale Chinese
from the prior T2S datasets that contain tables from the same corpus for cross-domain text-to-SQL parsing, which consists
domain, the Spider dataset contains complex NL questions of 23, 797 NL-SQL pairs along with 200 databases and
and SQL queries spanning over multiple databases and 813 tables belonging to more than 160 domains. Different
domains. In addition, the SQL queries in the Spider dataset from most previous corpora that are manually annotated,
can be further divided into four levels based on the difficulty the SQL queries in DuSQL are automatically generated via
of the SQL queries: easy, medium, hard, extra hard, which production rules from the grammar.
can be used to better evaluate the model performance on
different queries. Finally, the Spider dataset is randomly 2.3.2 Multi-Turn T2S Corpora
split into 7,000 instances for training, 1,034 instances for ATIS The original ATIS [29, 30] dataset is a collection
development, and 2,147 instances for testing. of user questions asking for flight information on airline
travel inquiry systems along with a relational database
Spider-Syn The Spider-Syn [24] dataset is another chal- that contains information about cities, airports, flights, and
lenging variant of Spider [23], which modifies the NL so on. Most of the posted questions can be answered by
questions from Spider [23] by replacing the schema-related querying the database with SQL queries. Since the original
words with the corresponding synonyms. In this way, the SQL queries are inefficient to be executed by using the IN
explicit alignments between the words in NL questions and clauses, the SQL queries are further modified by [21] while
the tokens in table schemas are eliminated, which make the keeping the output of the SQL queries unchanged. In total,
schema linking more challenging for text-to-SQL parsing. there are 5,418 NL utterances with corresponding executable
Spider-Syn [24] is composed of 7000 training instances and SQL queries, where 4,473 utterances for training, 497 for
1034 development instances. Note that Spider-Syn [24] has development and 448 for testing.
no test set since the original Spider [23] does not release the
test set publicly. SParC The SParC [31] dataset is a large-scale cross-domain
context-dependent text-to-SQL corpus, which contains
Spider-DK The Spider-DK [25] dataset is a challenging about 4.3k question sequences including 12k+ question-
variant of the Spider [23] development set, which can SQL pairs along with 200 complex databases belonging
be used to better investigate the generalization ability of to 138 domains. SParC is built on the Spider [23], where
existing text-to-SQL models in understanding the domain each question sequence is based on a question from Spi-
knowledge. Spider-DK is constructed by adding domain der by asking inter-related questions. After obtaining the
knowledge that reflects real-world question paraphrases sequential questions, a SQL query is manually annotated for
to some NL questions from the Spider development set. each question. Following Spider, SParC is split into training,
Concretely, it consists of 535 NL-SQL pairs, where 265 NL- development and test sets with a ratio of 7:1:2, such that
SQL pairs are modified by adding domain knowledge while each database appears in only one set.
the rest 270 NL-SQL pairs remain the same as in the original
Spider dataset. It is noteworthy that Spider-DK is smaller CoSQL The CoSQL [32] dataset is the first large-scale cross-
than the original Spider development set since not every domain conversational text-to-SQL dataset created under
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 6
the WOZ setting, which consists of about 3k dialogues [5, 49] are widely used to learn contextualized represen-
including 30k+ turns and 10k+ corresponding SQL queries tations of input NL question and table schema, which are
along with 200 complex databases belonging to 138 do- then passed into the decoder for generating SQL query.
mains. In particular, each conversation simulates a DB query TypeSQL [38], Seq2SQL [22] and SyntaxSQLNet [39] work
scenario where the annotators working as DB users issue on the stand-alone question-SQL pairs and adopt the bidi-
NL questions to retrieve answers with SQL queries. The rectional LSTM (Bi-LSTM) to learn semantic representations
sequential NL questions can be used to clarify historical of the input sequence which is the concatenation of the NL
ambiguous questions or notify users of unanswerable ques- question and the column names. IRNet [4] encodes the NL
tions. Similar to Spider [23] and SParC [31] , CoSQL is also question and the table schema by using two separate Bi-
split into training, development and test sets with a ratio of LSTM encoders. In particular, the two Bi-LSTM encoders
7:1:2, such that each database appears in only one set. take as input the word embeddings and the corresponding
schema linking type embeddings, where the schema linking
CHASE The CHASE [33] dataset is a large-scale context- type embeddings are obtained by applying n-gram string
dependent chinese text-to-SQL corpus, which is composed matching to identify the table and column names mentioned
of 5, 459 coherent question sequences including 17, 940 in the NL question.
questions with their SQL queries. The context-dependent
question-SQL pairs span over 280 relational databases. Transformer-based Methods Recently, the Transformer-
CHASE has to variants: CHASE-C and CHASE-T. Specif- based [6] models have shown state-of-the-art performances
ically, CHASE-C collects 120 Chinese relational databases on text representation learning for multiple natural lan-
from DuSQL [28] and creates 2003 question sequences as guage processing (NLP) tasks. There are also several text-to-
well as their SQL queries from scratch. CHASE-T is created SQL parsing methods such as SQLova [10] and SLSQL [50]
by translating the 3456 English questions sequences and 160 that extend BERT [11] and RoBERTa [51] for encoding the
databases from SParC [31] into Chinese. The CHASE dataset NL question together with the table and column headers.
is split into 3, 949/755/755 samples for training, validation Generally, the Transformer-based encoders follow a three-
and testing, such that a database appears in solely one set. step procedure.
First, the NL question and the database
3 S INGLE - TURN T2S PARSING A PPROACHES schema are concatenated and taken as the inte-
grated input sequence of the encoder. Formally,
Deep learning has long been dominant in the field of text-to- the input sequence can be formulated as I =
SQL parsing, yielding state-of-the-art performances. In this ([CLS]; q1 ; . . . ; q|Q| ; [SEP]; s1 ; [SEP]; . . . ; [SEP]; s|T |+|C| ),
manuscript, we provide a comprehensive review of recent where [CLS] and [SEP] indicate the pre-defined special
neural network-based approaches for text-to-SQL parsing. tokens as in [11]. The input sequence can be extended to
A typical neural text-to-SQL method is usually based on the multi-turn setting by sequentially concatenating current
the sequence-to-sequence (Seq2Seq) model [3], in which an questions, dialog history and schema items [52].
encoder is devised to capture the semantics of the NL ques-
Second, the pre-trained language models (PLMs) such as
tion with a real-valued vector and a decoder is proposed
BERT [11] and RoBERTa [51] can significantly boost parsing
to generate the SQL query token by token based on the
accuracy by enhancing the generalization of the encoder
encoded question representation. As illustrated in Table 3,
and capturing long-term token dependencies. In general,
we divide the downstream text-to-SQL parsing methods
for question tokens, the output hidden states from the final
into several primary categories based on the encoder and
layer of the Transformer block in the BERT [11] or RoBERTa
the decoder. Next, we describe each category of the text-to-
models are considered as the contextualized representations
SQL parsing methods in detail.
of question tokens. For each database schema item, the out-
put hidden state of its front special token [SEP] is regarded
3.1 Encoder as the table or column header representation.
The first goal of the encoder is to learn input representation, Third, leveraging flexible neural architecture on the top
jointly representing the NL question and table schema rep- of PLMs can further enhance the encoder’s output represen-
resentations. The second goal of the encoder is to perform tations with strong expressive ability. For example, SQLova
structure modelling, since the text-to-SQL parsing task is in [10] and SDSQL [16] further stack two Bi-LSTM layers on
principle a highly structured task. the top of output representations of BERT [11]. GAZP [53]
proposes an additional self-attention layer [6] on the top of a
3.1.1 Input Representation Bi-LSTM layer to compute the intermediate representations.
As stated in Section 2.1, there are two types of input in- RYANSQL [7] sequentially employs the convolutional neu-
formation to be considered for text-to-SQL parsing: the NL ral network [54] with dense connection [9] and a scaled dot-
question and the table schemas, which are jointly repre- product attention layer [6] on the top of BERT [11] to align
sented by the encoder. Generally, the input representation question tokens with columns. It is noteworthy that the pa-
learning methods can be divided into two primary cate- rameters of convolutional neural network are shared across
gories, including LSTM-based [5, 49] and Transformer-based the NL question and columns. BRIDGE [43] encodes the
[6] methods. input sequence with BERT [11] and lightweight subsequent
layers (i.e., two Bi-LSTM layers). In addition, dense look-
LSTM-based Methods Motivated by the significant suc- up features are applied to represent meta-data information
cess in text representation learning, LSTM-based methods of the table schema such as primary key, foreign key and
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7
TABLE 3
The representative downstream text-to-SQL parsing approaches. EM denotes the exact match accuracy on the Spider [23] data for the latest
submissions. The columns denote the most important architecture decisions (SL - Schema Linking, DSL - Domain Specific Language).
Encoder Decoder
Model EM Dev EM Test EX Dev EX Test
SL LSTM Transformer GNN LSTM Transformer Grammar Sketch DSL Constrained Decoding Re-Ranking
Seq2Seq baseline [23] 3 3 1.8 4.8 - -
TypeSQL[38] 3 3 3 8.9 8.2 - -
SyntaxSQLNet [39] 3 3 3 25.0 - -
GNN [40] 3 3 3 3 51.3 - - -
EditSQL[41] 3 3 3 57.6 53.4 - -
Bertrand-DR[42] 3 3 3 3 58.5 - - -
IRNet[4] 3 3 3 3 61.9 54.7 - -
RYANSQL[7] 3 66.6 58.2 - -
BRIDGE[43] 3 3 3 3 70.0 65.0 70.3 68.3
RATSQL[12] 3 3 3 3 69.7 65.6 - -
SMBOP [44] 3 3 3 3 69.5 71.1 75.0 71.1
ShadowGNN [45] 3 3 3 3 3 72.3 66.1 - -
RaSaP[17] 3 3 3 3 74.7 69.0 - 70.0
SADGA[13] 3 3 3 3 73.1 70.1 - -
DT-Fixup[46] 3 3 3 75.0 70.9 - -
T5-Picard[47] 3 3 3 3 75.5 71.9 79.3 75.1
LGESQL[14] 3 3 3 3 75.1 72.0 - -
S2 SQL[48] 3 3 3 3 76.4 72.1 - -
datatype. These meta-data features are further fused with Desired SQL:
the BERT [11] encoding of the schema component via a feed- SELECT T1.model
FROM cars_data AS T1 JOIN cars_data AS T2
forward layer. 🤖 ON T1.make_id = T2.id WHERE T2.cylinders = 4
ORDER BY T2.horsepower DESC LIMIT 1
3.1.2 Structure Modelling Nature Language Question:
The development of large cross-domain datasets such as 🧑💻 For the cars with 4 cylinders, which model has the largest horsepower ?
WikiSQL [22] and Spider [23] results in the realistic gener- Schema:
alization challenge to deal with unseen table schemas. Each cars_data
NL question corresponds to a multi-table database schema. id mpg cylinders edispl horspower weight year
The training and testing sets do not share overlapped cars_names model_list car_makers
databases. The challenge of the generalization requires the make_id model make make_id maker model id maker full_name
text-to-SQL parsing methods to encode the NL question
and the table schema into representations with powerful
expressive ability from three aspects. Fig. 2. Example of schema linking structure used in [12].
• First, the encoder should be able to recognize NL tokens
used to refer to tables and columns either implicitly
or explicitly, which is called schema linking structure the current bottleneck of text-to-SQL parsing. [50] demon-
– aligning entity mentions in the NL question to the strates that more accurate schema linking conclusively leads
mentioned schema tables or columns. to better text-to-SQL parsing performance. Conventional
• Second, the encoded representations should be aware of schema linking can be extracted by means of rules or
the schema structure information such as the primary string matching. For example, IRNET [4] takes the extracted
keys, the foreign keys, and the column types. linking information directly as input. SDSQL [16] leverages
• Third, the encoder should be able to perceive complex the extracted linking information as a label for multi-task
variations in the NL question, i.e., question structure. learning. IESQL [56] employs a conditional random filed
Graph are the best form of data to express the complex (CRF) layer [57] for the question segment, which transforms
structure in the text-to-SQL parsing task. Recently, several the linking task into a sequence tagging task. However,
graph-based methods [40, 12, 14] have been proposed to the aforementioned schema linking methods cannot capture
reason over the NL question tokens and schema entities, and the comprehensive semantic relationship between the NL
model the complex input representation. These methods question and table schema. Another popular approach [41]
consider the NL question tokens and schema items as multi- proposes the cross-attention to implicitly learn relationships
typed nodes, and the structural relations (edges) among the between the NL question and schema representations.
nodes can be pre-defined to express diverse intra-schema Recently, the graph-based linking approaches [12, 44,
relations, question-schema relations and intra-question rela- 17, 45, 14] have been proposed to reason over the NL
tions. question tokens and schema entities, and model the com-
plex input representation. These methods consider the NL
Linking Structure As illustrated in Figure 2, schema question tokens and schema items as multi-typed nodes,
linking aims at identifying references of columns, tables and the structural relations (edges) among the nodes can
and condition values in NL questions [55]. The text-to- be pre-defined to express diverse intra-schema relations,
SQL parsers should learn to detect table or column names question-schema relations and intra-question relations. In
mentioned in NL questions by matching question tokens particular, in RATSQL [12], the graph is constructed based
with the schema, and the identified tables or columns are on two kinds of relations (i.e., name-based linking and
then utilized to generate SQL queries. Intuitively, schema value-based linking), where the name-based linking refers
linking facilitates both cross-domain generalizability and to partial or exact occurrences of table/column names in
complex SQL generation, which have been regarded as the same NL question and the value-based linking refers
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 8
SQL:
to the question-schema alignment that occurs when the NL SQL:
question mentions any values appearing in the schema and SELECT $AGG $COLUMN
🤖 WHERE $COLUMN $OP $VALUE
SELECT Id FROM CARDS_DATA
🤖 ORDER BY Horsepower
the desired SQL. The relation-aware self-attention mecha- (AND $COLUMN $OP $ VALUE) * DESC LIMIT 1
nism [58] is then proposed for graph representation learn- START Abstract Syntax Tree
SQL Sketch:
graph. Specifically, given a sequence of token representa- SELECT ORDER
tions, the relational-aware self-attention computes a scalar SELECT Column1 Columnn max COL TAB None COL TAB
similarity score
between each
pair of token representations
Column
…
Id CARDS_DATA Horsepower CARDS_DATA
OP1 OPn
eij ∝ ui WQ uj WK + rK ij , where ui and uj denote the SELECT
VALUE1 VALUEn
START -> ROOT ROOT -> SELCET ORDER COL -> Horsepower TAB -> CARDS_DATA
AGG …
Decoder Decoder Decoder Decoder
graph nodes, and the term rK
ij denotes an embedding that
START START -> ROOT AGG -> none COL TAB COL -> Horsepower
represents a relation between ui and uj from a closed set (a) Sketch-based Method (b) Generation-based Method
of possible relations. To enhance the model generalization
capability for unseen or rare schemas, ShadowGNN [45] al- Fig. 3. Example of SQL sketch used in [15].
leviates the impact of the domain information by abstracting
the representations of the NL question and the SQL query
before applying the relation-aware graph computation [12]. ferent types of links are defined for question tokens to
In addition, several works have been devoted to tackle the construct the graphs: the 1-order word dependency (i.e.,
challenge of heterogeneous graph encoding for the text-to- the relation between two consecutive words), the 2-order
SQL parsing. LGESQL [14] constructs an edge-centric graph word dependency, and the parsing-based dependency that
from the node-centric graph as in RATSQL [12], which captures syntactic relations among the NL question words.
explicitly considers the topological structure of edges. The Then, the structure-aware aggregation approach is proposed
information propagates more efficiently by considering both to capture the alignment between the constructed graphs
the connections between nodes and the topology of directed through two-stage linking. The unified representations are
edges. Two relational graph attention networks (RGANs) learned by aggregating the information via a gated-based
[59] are devised to model the structure of the node-centric mechanism.
graph and the edge-centric graph respectively, mapping the
input heterogeneous graph into token representations. 3.2 Decoder
Schema Structure It is intuitive to leverage a relational The decoder used in existing Text-to-SQL parsing models
graph neural network (GNN) to model the relations in the can be divided into two categories: sketch-based methods
relational database, propagating the node information to and generation-based methods. In this section, we provide
its neighbouring nodes. The GNN-based methods help to a comprehensive overview of the two types of decoder
aggregate feature information of neighboring nodes, making architectures.
the obtained input representation more powerful. Schema-
GNN [40] first converts the database schemas to a graph 3.2.1 Sketch-based Methods
by adding three types of edges: the foreign-primary key The sketch-based methods decompose the SQL generation
relation, the column-in-table relation and the table-own- procedure into sub-modules, e.g., SELECT column, AGG
column relation. The constructed graph is softly pruned function, WHERE value. For example, SQLNet [15] employs
conditioned on the input question, which is then fed into the SQL sketch. The tokens SELECT, WHERE and AND in-
gated GNNs [60] to learn schema representations being dicate the SQL keywords, and the following components
aware of the global schema structure. Furthermore, Global- indicate the types of prediction slots to be filled. For exam-
GNN [61] proposes a similar approach by employing a ple, the AGG slot indicates the slot to be filled with either an
graph convolutional network (GCN) to learn schema rep- empty token or one of the aggregation operators such as SUM
resentations, where a relevance probability conditioned on and MAX; the VALUE slot needs to be filled with a sub-string
the question is computed for every schema node. Some of the question; the COLUMN slot needs to be filled with a
advanced studies, such as RAT-SQL [12] and LGESQL [14], column name; the OP slot needs to be filled with operations
also learn the structure of the schema as a unique edge in such as >, <, =). These slots are later gathered together
the graph, demonstrating the indispensability of the schema and interpreted to generate the final SQL query. Each slot
structure for text-to-SQL parsing. has separate model which does not share their trainable
parameters and is responsible for predicting a part of the
Question Structure S2 SQL [48] investigates the impor- final SQL independently. Specifically, for the COLUMN slot,
tance of syntax in text-to-SQL encoder, and proposes a flex- a column attention mechanism [15] is applied to reflect the
ible and robust injection method. It leverages three induct most relevant information in NL questions when prediction
dependency types, i.e., Forward, Backward, NONE, which is made on a particular column. For the OP slot, predicting
stack multi-layer transformers to implicitly model complex its value is a 3-way classification task (>, <, =). For the
question structure. In addition, a decoupling constraint VALUE slot, [15] employs a Seq2Seq structure to generate
is employed to induce the diverse relation embedding. the sub-string of the NL question.
SADGA [13] constructs the question graph conditioned on SQLova [10] and SDSQL [16] modify the syntax-
the dependency structure and the contextual structure of guided sketch used in [15]. The proposed sketch-based
the NL question sequence, and builds the schema graph decoder consists of six prediction modules, including
conditioned on the schema structure. Specifically, three dif- WHERE-NUMBER, WHERE-COLUMN, SELECT-COLUMN,
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 9
SELECT-AGGREGATION, WHERE-OPERATOR, and and SELECT-TABLE action respectively. These actions can
WHERE-VALUE. The specific role of each action is described construct the corresponding AST of the target SQL query.
as follows: Specifically, APPLY-RULE applies a production rule to the
• SELECT-COLUMN identifies the column in “SELECT” current derivation tree of a SQL query and expands the
clause from the given NL question. last generated node with the grammar rule. The probability
• SELECT-AGGREGATION identifies the aggregation op- distribution is computed by a softmax classification layer
erator for the given select-column prediction. for the pre-defined abstract syntax description language
• WHERE-NUMBER predicts the number of “WHERE” con- (ASDL) rules. SELECT-COLUMN and SELECT-TABLE com-
ditions in SQL queries. plete a leaf node by selecting a column c or a table t from
• WHERE-COLUMN calculates the probability of generating the database schema respectively by directly copying the
each columns for the given NL question. table and column names from database schema via copy
• WHERE-OPERATOR identifies the most probable opera- mechanism [63].
tors given where-column prediction among three pos- There are also several works [46, 66, 67] which neglect
sible choices (>, =, <). the SQL grammar during the decoding process, by lever-
• WHERE-VALUE identifies which tokens of a NL question aging the powerful large scale pre-trained language model
correspond to condition values for the given “WHERE” like T5 [68] finetuned on the text-to-SQL training set for SQL
columns. query generation. Formally, the transformer-based decoder
follows the standard text generation process and produces
In the SQL query generation stage, an execution-guided de-
the hidden state in step t for generating the t-th token as
coding strategy [62] is utilized to exclude the non-executable
described in [6]. An affine transformation is then applied
partial SQL queries from the output candidates. TypeSQL
on the learned hidden state to obtain prediction probability
[38] further improves the above approach by declining the
over the target vocabulary V for each word. In addition, as
number of modules. TypeSQL chooses to combine the select-
revealed in [42], for some cases, although the best generated
column module and the where-column module into a single
SQL is in the candidate list of beam search, it is not at the top
module since their prediction procedures are similar, and
of the candidate list. Therefore, a discriminative re-ranker
the where-column module depends on the output of the
(Re-Ranking) strategy is introduced to extract the best SQL
select-column module. In addition, the where-operator and
query from the candidate list predicted by the text-to-SQL
where-value modules are combined together because the
parser. The re-ranker is constructed as a BERT fine-tuned
predictions of these two modules depend on the outputs of
classifier that is independent of schema, and the probability
the where-column module. Generally, the sketch-based ap-
of the classifier is utilized as the score for the query to
proaches are fast and guaranteed to conform to correct SQL
re-rank. Some studies [4, 45] introduce a domain specific
syntax rules. However, it is difficult for these approaches
language (DSL) serving as an intermediate representation
to handle complex SQL statements such as multi-table
to bridge the NL question and the SQL query. The methods
JOINs, nested queries, and so on. Thus, the sketch-based
reveal that there is a invariable mismatch between the inten-
approaches are popular on the WikiSQL [22] dataset, but
tions conveyed in natural language and the implementation
are difficult to be applied on the Spider [23] dataset which
details in SQL when the resulting SQL is processed to tree-
involves complex SQLs. Only RYANSQL [7] implements
structured form. Hence, the key idea of DSL is to omit the
complex SQL generation by recursively applying the sketch
implementation details of the intermediate representations.
method.
IRNet [4] presents a grammar-based neural model to gener-
ate a SemQL query as the intermediate representation bridg-
3.2.2 Generation-based Methods ing the NL question and the SQL query, and a SQL query is
On the other hand, the generation-based approaches are then inferred from the generated SemQL query with domain
based on the Seq2Seq model to decode SQL, which are more knowledge. Furthermore, a constrained decoding strategy
preferable for complex SQL scenarios than the sketch-based is proposed in Picard [47] via incremental parsing, which
approaches. For example, Bridge [43] uses a LSTM-based facilitates the parser to identify valid target sequences by
pointer-generator [63] with multi-head attention and copy rejecting inadmissible tokens at each decoding step.
mechanism as the decoder which is initiated with the final
state of the encoder. At each decoding step, the decoder
performs one of the following actions: generating a token 4 M ULTI - TURN T2S PARSING A PPROACHES
from the vocabulary V , copying a token from the question Compared to the single-turn T2S setting, the multi-turn
Q, or copying a schema component from the database T2S setting emphasizes the usage of contextual information
schema S . (historical information), which can be incorporated in both
Since the above generation-based approaches may not the encoder and the decoder. Next, we describe how the
generate SQL queries with correct grammar, some advanced contextual information is leveraged.
methods [4, 12] generate the SQL as an abstract syntax tree
(AST) [64] in the depth-first traversal order [65]. In particu-
lar, these methods employ an LSTM decoder to perform a 4.1 Encoder
sequence of three types of actions that either expand the last The encoder processes the contextual information for input
generated node into a grammar rule, called APPLY-RULE representation learning. In addition, the linking structure
action or choose a column/table from the schema when and the schema structure are considered during the encod-
completing a leaf node, called SELECT-COLUMN action ing phase.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 10
Attention
Attention
…
Encoder …
… encoders to capture the semantic connection between the
Encoder h
Turn 1 … Turn n Gate Encoder Turn 1
SQL query and the NL question. It produces comprehensive
Turn 1
representations of the previously predicted SQL queries to
(a) Concat (b) Turn (c) Gate improve the contextual representation.
Fig. 4. Different contextual NL questions encoding strategies for multi- 4.1.2 Multi-turn Structure Modelling
turn T2S parsing [69]. Different from the single-turn setting, the multi-turn setting
requires the integration of contextual inductive bias into the
structural modelling.
4.1.1 Multi-turn Input Representation
With regard to the multi-turn representation learning, pre- Linking Structure R2 SQL [74] focuses on the unique-
vious works mainly focus on two aspects, including (i) how ness of contextual linking structures, introducing a novel
to learn high-quality contextual question and schema rep- dynamic graph framework to efficiently model contextual
resentations and (ii) how to effectively encode the historical questions, database schemas, and the complex linking struc-
SQL queries. tures between them. A decay mechanism is applied to
For the contextual question and schema representation mitigate the impact of historical links on the current turn.
learning, as shown in Fig. 4, [69] investigates the impact
Schema Structure [41] exploits the conversation history
of different contextual information encoding methods on
by editing the previous predicted SQL to improve the
the multi-turn T2S parsing performance, including (a) con-
generation quality. It focuses on taking advantages of pre-
tacting all the NL questions within each question sequence
vious question texts and previously predicted SQL query
as input, (b) using a turn-level encoder to deal with each
to predict the SQL query at the current turn. Concretely,
question, and (c) devising a gate mechanism to balance the
the previous question is first encoded into a series of to-
importance of each historical question. Typically, EditSQL
kens, and the decoder then predicts a switch to alter it
[41] utilizes two separate Bi-LSTMs [5] for encoding the NL
at the token level. This sequence editing approach simu-
questions and the table schema respectively. Specifically, for
lates the changes at token level and is hence resistant to
the question at each turn, EditSQL first utilizes a Bi-LSTM
error propagation. IGSQL [75] points out that it should not
to encode the question tokens. The output hidden states
only take historical user inputs and previously predicted
are then fed into a dot-product attention layer [70] over
SQL query into consideration but also utilize the historical
the column header embeddings. The relationships among
information of database schema items. Therefore, IGSQL
the multi-turn questions are captured by the turn attention
proposes a database schema interaction graph encoder to
mechanism. At the current turn, the dot-product attention
learn database schema items together with historical items,
between the current question and previous questions in the
keeping context consistency for context-dependent text-to-
history is computed, and the weighted average of previ-
SQL parsing. The cross-turn schema interaction graph layer
ous question representations is then added to the current
and intra-turn schema graph layer update the schema item
question representation to form the context-aware question
representations by using the previous turn and the current
representation. For each column header, the table name and
turn respectively. IST-SQL [76] deals with the multi-turn
the column name are concatenated and passed into a Bi-
text-to-SQL parsing task inspired by the task-oriented di-
LSTM layer. The output hidden states are then fed into a
alogue generation task. IST-SQL defines, tracks and utilizes
self-attention layer [6] to better capture the internal structure
the interaction states for multi-turn text-to-SQL parsing,
of the table schemas such as foreign keys. In addition, the
where each interaction state is updated by a state update
self-attention vector and the question attention vector are
mechanism based on the previously predicted SQL query.
concatenated and fed into the second Bi-LSTM layer to
obtain the final column header representation. [71] learns
a context memory controller to maintain the memory by 4.2 Decoder
keeping the cumulative meaning of the sequential NL ques- For the multi-turn setting, most previous methods [41, 77,
tions and using an external memory to represent contextual 74] employ a LSTM decoder with attention mechanisms to
information. [72] decouples the multi-turn T2S parsing into produce SQL queries conditioned on the historical NL ques-
two pipeline tasks: question rewriting and single-turn T2S tions, the current NL question, and the table schema. The
parsing. The question rewriting (QR) module aims to gen- decoder takes the encoded representations of the current NL
erate semantic-completion question based on the dialogue question, SQL-states, schema-states, and last predicted SQL
context, which concatenates the dialogue history and the query as input and apply query editing mechanism [41] in
latest question as the input of the QR module. The goal of the decoding progress to edit the previously generated SQL
the QR module is to generate a simplified expression based query while incorporating the context of the NL questions
on the latest question and the dialogue history. The single- and schemas. To further alleviate the challenge that the
turn T2S parser predicts the corresponding SQL query with tokens from the vocabulary may be completely irrelevant
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 11
to the SQL query, separate layers are used to predict SQL contains 1.6 million high-quality relational Wikipedia tables
keywords, table names and question tokens. A softmax constructed by extracting all HTML tables from Wikipedia
operation is finally used to generate the output probability which had the class attribute “wikitable” (used to easily
distribution. identify data tables) from the November 2013 XML dump
of English Wikipedia. ToTTo [81] is an open-domain English
table-to-text dataset with over 120,000 training examples
5 P RE - TRAINING FOR T EXT- TO -SQL PARSING which contains 83,141 corresponding web tables automati-
Pre-trained language models (PLMs) have proved to be cally collected from Wikipedia using heuristics with various
powerful in enhancing text-to-SQL parsing and yield im- schema spanning various topical categories. WebQuery-
pressive performances, which benefit from the rich knowl- Table [82] is composed of 273,816 tables by using 21,113
edge in large-scale corpus. However, as revealed in previous web queries, where each query is used to search the Web
works [18, 20], there are intrinsic differences between the pages and the relevant tables are obtained from the top
distribution of tables and plain texts, leading to sub-optimal ranked Web pages. Spider [23] is a large-scale text-to-SQL
performances of general PLMs such as BERT [11] in text-to- dataset, which contains 200 databases belonging to 138 dif-
SQL parsing. Recently, several studies have been proposed ferent domains. In particular, the databases come from three
to alleviate the above limitation and build tabular language resources: (i) 70 complex databases are collected from SQL
models (TaLMs) by simultaneously encoding tables and tutorials, college database courses, and textbook examples,
texts, which show improved results on downstream text- (ii) 40 databases are collected from the DatabaseAnswers1 ,
to-SQL parsing tasks. In this section, we provide a compre- (iii) 90 databases are selected from WikiSQL [22].
hensive review of existing studies on pre-training for text-
to-SQL parsing from the perspectives of pre-training data 5.1.2 Natural Language Question Annotation
construction, input feature learning, pre-training objectives So far, different question annotation methods have been
and the backbone model architectures. introduced to annotate natural language questions based
on the collected databases. Generally, the methods can
5.1 Pre-training Data Construction be divided into three categories: sampling-based methods,
template-based methods, and generation-based methods.
Insufficient training data is an important challenge to learn
powerful pre-trained tabular language models. The quality, Sampling-based Methods Many works produce the
quantity and diversity of the pre-training data have signifi- NL questions in pre-training data by extracting text-table
cant influence on the general performance of the pre-trained pairs from Wikipedia. Concretely, TA PAS [19] creates the
language models when applied to downstream text-to-SQL pre-training corpus by collecting text-table pairs from
parsing tasks. Although it is easy to collect a large amount Wikipedia, where there are about 6.2M tables and 21.3M
of tables from Web (e.g., Wikipedia), obtaining high-quality text snippets. In particular, the table captions, segment titles,
NL questions and their corresponding SQL queries over article descriptions, article titles, and textual table segments
the collected tables is a labor-intensive and time-consuming are extracted as the text snippets of the corresponding ta-
process. Recently, there have been plenty of studies to gen- bles. S TRU G [83] directly collects about 120k NL web tables
erate pre-training data for text-to-SQL parsing manually or and corresponding text descriptions from the ToTTo dataset
automatically. Next, we discuss the previous pre-training [81], which extracts the text-table pairs from Wikipedia by
data construction methods from three perspectives: table employing three heuristics: number matching, cell match-
collection, NL question generation, and logic form (SQL) ing, and hyperlinks.
generation.
Template-based Methods There are several works that
5.1.1 Table Collection generate the NL questions automatically by using templates
or rules. G RAPPA [20] constructs question-SQL templates by
We briefly introduce several sources that have been ex-
extracting entity mentions of SQL operations and database
tensively used for table collection. WikiTableQuestion [37]
schemas. By leveraging the created templates on randomly
is a representative corpus which is composed of 22,033
sampled tables, a large amount of question-SQL pairs can
question-answer pairs on 2,108 tables, where the tables
be synthesized automatically. SC O R E [84] leverages only
are randomly collected from Wikipedia with at least five
500 samples from the development set of SPAR C [31] to
columns and eight rows. The tables are pre-processed by
derive utterance-SQL generation grammars consisting of a
omitting all the non-textual information, and each merged
list of synchronous question-SQL templates and follow-up
cell is duplicated to keep the table valid. In total, there are
question templates. Finally, about 435k text-to-SQL conver-
3,929 distinct column headers (relations) among the 13,396
sations are synthesized for context-dependent text-to-SQL
columns. WikiSQL [22] is a collection of 80,654 hand-crafted
pre-training.
question-SQL pairs along with 24,241 HTML tables collected
from Wikipedia. The tables are collected from [78], and the Generation-based Methods Several works have been
small tables that have less than five columns or five rows introduced to generate NL questions from entity sequences
are filtered. WDC WebTables [79] is a large-scale table col- automatically with text generation models. For example,
lection, which contains over 233 million tables and has been [85] proposes a cross-domain neural model, which accepts a
extracted from the July 2015 version of the CommonCrawl. table schema and samples a sequence of entities to be appear
Those tables are classified as either relational (90 million),
entity (139 million), or matrix (3 million). WikiTables [80] 1. http://www.databaseanswers.org/
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 12
TABLE 4
The pre-training data construction.
Table Question
WIKITABLEQUESTIONS WIKISQL [22] WIKITABLE ToTTo Spider [23] WebTable Wikipedia GITHUB ToTTo Wikipedia Spider [23]
TaBERT[18] 3 3
TA PAS [19] 3 3
G RAPPA [20] 3 3 3
GAP 3 3
S TRU G 3 3
MATE 3 3
TAPEX 3
TableFormer 3 3
SCORE 3 3 3 3
in the NL question, to transform the entity sequence to the forms that are logically distinct from the original ones,
NL question. Specifically, the T5 model [68] is first fine- (ii) the phrase and number changes aiming to modify the
tuned on a small corpus containing entity-question pairs appointed numerical values and phrases in logical forms,
and then applied to generate NL questions given entity and (iii) entity insertion, swapping and deletion that ignores
sequences. For example, given the input entity sequence the entity mention in logical forms, inserts new entities into
“department management : head name text — head age number logical forms, or swaps any two entities within in a logical
— head born state text”, the T5 model is likely to output the form. There are three reasons for automatically generating
NL question “List the name, born state and age of the heads more SQL queries by perturbing the logical forms. First,
of departments ordered by age.”. In addition, [86] introduces the regular structures of logical forms make the procedure
a generative model to generate utterance-SQL pairs, which of logical corruption controllable. Second, we can easily
leverages the probabilistic context-free grammar (PCFG) to validate the perturbed logical forms with corresponding
model the SQL queries and then employs a BART-based grammar checker and parser. Third, it is easy to obtain the
translation model to transform the logical forms to NL corresponding questions of the generated SQL queries with
questions. For example, for a given input SQL sequence minor modification given the original question-SQL pair.
“select area where state name = ‘texas’”, the generative model
outputs the NL question “what is the area of Texas?”. GAZP SQL Template Instantiation There are several studies
[53] also generates utterances corresponding to these logical that apply the SQL template instantiation methods to auto-
forms using the generative model. In addition, the input and matically generate SQL queries based on existing templates
output consistency of the synthesized utterance is verified. [88, 53] or self-defined synchronous context-free grammar
Specifically, we parse the generated queries into logical (SCFG) [20]. [88] utilizes the production rules of the SQL
forms and keep the queries whose parses are equivalent to grammar defined in the SUQALL dataset [36] for SQL
the corresponding original logical forms. annotation. Given a SQL template, the headers and cell
There are also some studies that invite annotators to values of the tables are uniformly select to fill the template.
manually create natural questions and corresponding SQL GAZP [53] samples logical forms by leveraging a grammar,
queries without leveraging templates or rules, such that such as the SQL grammar over the database schema. First,
the generated NL questions are natural and diverse. In GAZP creates coarse templates by pre-processing the SQLs
particular, Spider [23] generates 10,181 NL questions and via the SQL grammar and further replacing the mentions
5,693 unique SQL queries over 200 databases by considering of columns with typed slots. Then, the slots of each coarse
three primary aspects: SQL pattern coverage ensuring that templates are filled-in by new database contents. Instead
enough SQL samples are obtained to cover all common SQL of completely relying on the existing templates, G RAPPA
patterns, SQL consistency ensuring that the semantically [20] learns from the examples in Spider [23] and designs
equivalent NL questions share the same SQL query, and a new SCFG which is then applied on a large number of
question clarity ensuring that the vague or too ambiguous existing tables to produce new SQL queries. The key idea
questions are not included. behind this method is to define a set of non-terminal types
for operations, table names, cell values, column names,
and then substitute the entities with corresponding non-
5.1.3 SQL Annotation
terminal symbols in the SQL query to form a SQL pro-
Generally, the SQL annotation methods can be divided into duction rule. The SQL template instantiation methods often
three primary categories: logic perturbation, SQL template heavily depend on limited templates, and it is hard for them
instantiation and hierarchical synthesis pipeline. to generate diverse SQL queries with new compositions.
Logic Perturbation Due to the expensive process of obtain- Hierarchical Synthesis Pipeline Different from the above
ing SQL queries, the logic perturbation-based approaches mentioned approaches that synthesize new SQL queries
have been proposed to augment the SQL queries by per- based on hand-crafted or induced rules and templates, the
forming random logic perturbation according to hand-tuned hierarchical synthesis pipeline approaches are based on the
rules [87]. In particular, [87] generally enumerates the per- large-scale pre-trained language models (PLMs) [68], which
turbations of each given SQL query based on hand-tuned are motivated by the fact that PLMs can improve model
rules that follow three kinds of logic inconsistencies: (i) generalization by incorporating additional diverse samples
logic shift aiming to generate the questions and logical into the training corpus without labour-intensive manual
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 13
works. Concretely, [85] proposes a neural approach without utterance. This strategy is then combined with a vertical
grammar engineering but achieves high semantic parsing attention mechanism, sharing the information among the
performance. To this end, the pre-trained text generation cell representations in different rows. There are also some
models such as T5 [68], which is fine-tuned on the text-to- studies (e.g., S TRU G [83], G RA PPA [20] and UnifiedSKG
SQL data, are implemented to map entity sequences sam- [94]) which only take the headers of tables as input without
pled from the table schema to NL questions [85]. Then, the considering the data cells.
learned semantic parsers are then applied on the generated While the NLP models generally take the 1-D sequences
NL questions to produce the corresponding SQL queries. as input, the positional encoding becomes crucial for tabular
The overall data synthesis pipeline is easy to implement and data to facilitate the neural models to better capture the
achieves great diversity and coverage due to the usage of the structure information. Most previous pre-training methods
large PLMs. such as TA BERT [18], G RA PPA [20], and TA PE X [88] ex-
plored the global positional encoding strategy on the flat-
5.2 Input Encoding tened tabular sequences. Nevertheless, in addition to the 1-
D sequential positions, the tables have structured columns
In the text-to-SQL parsing task, the input often involves and rows which consists of critical two-dimensional and
two parts: NL questions and table schemas, and the output hierarchical information. The works such as TA PAS [19]
would be the SQL queries. However, the textual data, tab- and MATE [90] encode the row and column content based
ular data and SQL queries are heterogeneous, which have on column/row IDs. TABLE F ORMER [93] decides whether
different structures and formats. To be specific, the tabular two cells are in the same column/row and the column
data is generally distributed in two-dimensional structures header, rather than considers the absolute order information
with numerical values and words, while the SQL queries of columns and rows in the tables.
are usually composed of SQL keywords (e.g., “SELECT”,
“UPDATE”, “DELETE”, “INSERT INTO”) and schema en-
tities. Hence, it is non-trivial to develop a joint reasoning 5.3 Pre-training Objectives
framework over the three types of data. In this section, we
review the recent studies on heterogeneous input encoding Most existing pre-training models for text-to-SQL parsing
of pre-training for text-to-SQL parsing. employ either a single Transformer or a Transformer-based
encoder-decoder framework as the backbone, and adopt
5.2.1 Textual Data Encoding different kinds of pre-training objectives to capture the
characteristics of the text-to-SQL parsing task. As illustrated
Text encoding can be divided into dynamic and static types
in Table 5, the pre-training objectives can be divided into
based on the word encoding in natural language processing.
five primary categories, including masked language mod-
Some methods applied GloVe [89] to initialize the word
elling (MLM) [18, 19, 20, 92, 90, 93, 84], schema linking
embedding of each input item by looking up an embedding
[20, 92, 83, 84], SQL executor [88], text generation [92] and
dictionary without the context such as RATSQL [12] and
context modelling [84]. Next, we will introduce the imple-
LGESQL [14]. However, the static embedding methods are
mentation details of each primary pre-training objective.
still limited. Neither of the static methods are able to tackle
the polysemy problem. In addition, the learned features
are restricted by the pre-defined window size. With the 5.3.1 Masked Language Modelling
development of the pre-train language models, some studies Existing works often explore different variants of masked
attempt to encode textual data with PLMs instead of static language modeling (MLM) to guide the language models
word embeddings. In particular, plenty of methods (e.g., to learn better representations of both natural language
TA BERT [18], TA PAS [19], MATE [90], S TRU G [83]) utilize and tabular data. Concretely, the MLM objectives can be
the pre-trained BERT [11] as the encoder to get the con- divided into three primary categories: reconstructing the
textualized word-level representations, and the parameters corrupted NL sentences, reconstructing the corrupted table
of BERT [11] are updated along with the training process. headers or cell values, and reconstructing the tokens from
G RAPPA [20] uses R O B ERTA [51] as the encoder. TAPEX [88] the corrupted NL sentences and tables.
leverages both BART [91] encoder and decoder, while GAP In particular, most pre-training models [18, 19, 20, 92,
[92] merely utilizes the BART encoder. 90, 93, 84] adopt masked language modelling by randomly
masking a part of the input tokens from NL sentences or
5.2.2 Tabular Data Encoding table headers and then predicting the masked tokens. Then,
Different from textual data, the tabular data is distributed the MLM loss is calculated by minimizing the cross-entropy
in two-dimensional (2-D) structures. The table pre-training loss between the original masked tokens and the predicted
approaches need to first convert the 2-D table data into masked tokens. In addition, TA BERT [18] also proposed a
linearized 1-D sequence input before feeding the tabular Masked Column Prediction (MCP) and a Cell Value Recov-
data into language models. A common serialization method ery (CVR) to learn the column representations of tables,
is to flatten the table data into a sequence of tokens in where the MCP objective predicts the names and data
the row-by-row manner and then concatenate the ques- types of masked columns and the CVR objective attempts
tion tokens before the table tokens for tabular pre-training, to predict the original value of each cell in the masked
such as TA PAS [19], MATE [90], and TABLE F ORMER [93]. column given its cell vector. GAP [92] devised a Column
TA BERT [18] proposes content snapshots to encode a sub- Recovery (CRec) objective to recovery the corresponding
set of table content which is most relevant to the input column name conditioned on a sampled cell value.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 14
TABLE 5
The pre-training objectives for text-to-SQL parsing.
Objectives
Masked Language Modelling Schema Linking SQL Executor Text Generation Context Modelling
Models
TaBERT [18] 3 (MLM, MCP, CVR)
TA PAS [19] 3 (MLM)
G RAPPA [20] 3 (MLM) 3 (SSP)
GAP [92] 3 (MLM, CRec) 3 (CPred) 3 (GenSQL)
S TRU G [83] [] 3 (CG, VG)
MATE [95] 3 (MLM)
Tapex [88] 3
TableFormer [93] 3 (MLM)
SCORE [84] 3 (MLM) 3 (CCS) 3 (TCS)
5.3.2 Schema Linking sponding correct result, which requires the model to have
Schema linking is a key component in text-to-SQL parsing, deep understanding of the SQL queries and tables.
which learns the alignment between NL questions and
given tables. In particular, schema linking aims at identi- 5.3.4 SQL Generation
fying the references of columns, tables and condition values The goal of text-to-SQL parsing is to translate NL ques-
in NL questions. It is especially important for complex SQL tions into SQL queries that can be executed on the given
generation and cross-domain generalization, where the text- tables. Therefore, it can be beneficial to incorporate the SQL
to-SQL parser should be aware of what tables and columns generation objective into the pre-training methods so as to
are involved in the NL question even when referring against further enhance the downstream tasks. GAP [92] proposes a
with tables from arbitrary domains and modelling complex SQL generation objective to generate specific SQL keywords
semantic dependencies between NL questions and SQL or column names in the appropriate positions rather than
queries. merely predict whether a column is mentioned or not.
Recently, several pre-training objectives [20, 92, 83, 84]
are devised to model the schema linking information 5.3.5 Turn Contextual Switch
by learning the correlations of NL questions and tables. The above pre-training objectives primarily model stand-
G RA PPA [20] proposed a SQL Semantic Prediction (SSP) alone NL questions without considering the context-
objective, which aims at predicting whether a column name dependent interactions, which result in sub-optimal perfor-
appears in the SQL query and which SQL operation is trig- mance for context-dependent text-to-SQL parsing. SC O R E
gered conditioned on the NL question and given table head- [84] is the first representative pre-training method for
ers. The SSP objective is implemented by converting the SQL context-dependent Text-to-SQL parsing. In particular,
sequence labeling into operation classification for each col- SC O R E designs a turn contextual switch (TCS) objective
umn, which results in 254 possible operation classes. Similar to model the context flow by predicting the context switch
to G RA PPA[20], SC O RE [84] proposed a Column Contextual label (from 26 possible operations) between two consecutive
Semantics (CCS) objective, aiming to predict what operation user utterances. Despite its effectiveness, the TCS objective
should be performed on the given column. STRUG [83] ignores the complex interactions of context utterances, and
proposed three structure-grounded objectives to learn the it is difficult to track the dependence between distant utter-
text-table alignment, including Column Grounding (CG), ances.
Value Grounding (VG) and Column-Value mapping (CV).
Concretely, the CG objective is a binary classification task,
aiming at predicting whether a column is mentioned in the
6 F UTURE D IRECTIONS
NL question or not. The VG objective is also transformed Despite the remarkable progress of previous methods, there
to binary classification, which aims at predicting where a remain several challenges for developing high-quality text-
token is a part of a grounded value conditioned on the to-SQL parsers. Based on the works in this manuscript, we
NL question and the table schema. To further align the discuss several directions for future exploration in the field
grounded columns and values, the CG objective is devised of text-to-SQL parsing.
to match the tokens in the NL question and the columns.
Similar to the CG objective, GAP [92] also developed a 6.1 Effective High-quality Training Data Generation
Column Prediction (CPred) objective to predict whether a
The current benchmark datasets for text-to-SQL parsing are
column is used in the NL question or not.
still limited by the quality, quantity and diversity of training
data. For instance, WikiSQL [22] contains a large amount
5.3.3 SQL Executor of simple question-SQL pairs and single tables, which ne-
Modelling structured tables plays a crucial role in text-to- glects the quality and diversity of the training instances. In
SQL pre-training. TA PE X [88] proposed a SQL executor particular, WikiSQL [22] has several limitations which are
objective by pre-training a neural model to mimic a SQL described as follows. First, the training, development and
executor on tables. Specifically, the neural SQL executor is testing sets in WikiSQL [22] share the same domains, with-
learned to execute the SQL query and output the corre- out concerning the cross-domain generalization ability of
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 15
the text-to-SQL parsers. Second, the SQL queries in WikiSQL types of data (e.g., images) over heterogeneous forms. Using
[22] are simple, which do not contain complex operations, only homogeneous information source cannot satisfy the
such as “ORDER BY”, “GROUP BY”, “NESTED” and “HAV- demands of some real-life applications, such as E-commerce
ING”. Third, each database in WikiSQL [22] contains only and metaverse. For example, [96] takes the first step to-
one table, simplifying the text-to-SQL parsing. Spider [23] wards this direction by proposing a challenging question
is a cross-domain and complex benchmark dataset for text- answering dataset that requires joint reasoning over texts,
to-SQL parsing, which contains complicated SQL queries tables and images. When applying the text-to-SQL models
and databases with multiple tables in different domains. in the E-commerce and metaverse domains, it is necessary
Spider [23] is proposed to investigate the ability of a text-to- to process multimodal data and aggregate information from
SQL parser to generalize to not only new SQL queries and heterogeneous information sources for obtaining the correct
database schemas but also new domains. However, Spider returned results.
[23] merely has about ten thousand samples, which is not
large enough to build a high-quality text-to-SQL parser. 6.5 Cross-domain Text-to-SQL Parsing
There are also many data generation methods that apply Most existing works have worked on in-domain text-to-SQL
rule-based methods to produce a large amount of question- parsing, where the training and testing sets share the same
SQL pairs. However, these automatically generated samples domains. However, no matter how much data is collected
are of inferior quality and usually lose diversity. Therefore, and applied to train a text-to-SQL parser, it is difficult to
how to construct text-to-SQL corpora with high quality, cover all possible domains of databases. Thus, when de-
large-scale quantity and high diversity is an important ployed in practice, a well-trained text-to-SQL parser cannot
future exploration direction. generalize to new domains and often performs unsatisfacto-
rily. Although some cross-domain text-to-SQL datasets (e.g.,
6.2 Handling Large-scale Table/Database Morphology Spider [23], DK [25] and SYN [24]) have been constructed
for the challenging cross-domain settings, there are no text-
The tables used in current benchmark corpora usually con- to-SQL methods that design tailored algorithms to deal
tain less than ten rows and columns by considering the with the out-of-distribution (OOD) generalization problem
input length limitation of current neural text-to-SQL mod- where the testing data distribution is unknown and different
els. However, in many real-world applications, the involved from the training data distribution. The supervised learning
tables usually consist of thousands of rows and columns, method is fragile when being exposed to data with different
which pose a big challenge to the memory and compu- distributions. Therefore, how to explore the OOD gener-
tational efficiency of existing neural text-to-SQL models. alization of the text-to-SQL parser is a promising future
In particular, when the number or size of involved tables direction for both academic and industry communities.
becomes too large, how to encode the table schemas and
retrieve appropriate knowledge from large tables are chal- 6.6 Robustness of Text-to-SQL Parsing Models
lenging. Thus, more future efforts should be made on how to The robustness of text-to-SQL parsing models pose a promi-
(i) develop effective text-to-models that can encode a long nent challenge when being deployed in real-life applica-
sequence of table schemas, and (ii) improve the execution tions. Small perturbations in the input may significantly
efficiency of the generated SQL when involving a large-scale reduce the performance of text-to-SQL parsing models.
database. High performance requires robust performance on noisy
inputs. Recently, Spider-SYN [24] investigates the robust-
6.3 Structured Tabular Data Encoding ness of text-to-SQL parsing models to synonym substitution
Different from textual data, the tabular data involved in by removing the explicit schema-linking correspondence
text-to-SQL parsing is distributed in 2-D structures. Most between NL questions and table schemas. Spider-DK [25]
text-to-SQL parsing methods first convert the 2-D table data investigates the generalization of text-to-SQL parsing mod-
into the linearized 1-D sequence, which is then fed into the els by injecting rarely observed domain knowledge into the
input encoder. Such linearization methods cannot capture NL questions so as to evaluate the model understanding
the structural information of tables. In addition, most previ- of domain knowledge. The experimental results from both
ous works primarily focus on web tables, while more other [24] and [25] demonstrated that the performance of text-
kinds of tabular data that contain hyperlinks, visual data, to-SQL models are inferior when facing the small per-
spreadsheet formulas, and quantities are not considered. turbations (synonym substitution and domain knowledge
It is non-trivial to learn high-quality representations from injection) in the input, even though the training and test
such diverse data types by directly encoding a linearized data shares similar distributions. Hence, it is necessary to
input sequence. It is worth exploring text-to-SQL methods stabilize the neural text-to-SQL parsing models, making the
for effectively encoding structural information of such 2-D models more robust to different perturbations There is very
tabular data, so that more comprehensive input representa- few exploration for improving the robustness of text-to-SQL
tions can be learned to facilitate the SQL generation. models, thus more efforts should be paid to this research
direction.
6.4 Heterogeneous Information Modeling 6.7 Zero-shot and Few-shot Text-to-SQL Parsing
Existing text-to-SQL datasets, such as Spider [23], mainly The “pre-training+fine-tuning” paradigm has been widely
contain textual and numeric data from NL questions and used for text-to-SQL parsing, yielding state-of-the-art per-
tables. However, many real-life applications contain more formances. Although the text-to-SQL parsing models that
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 16
are trained via the “pre-training+fine-tuning” paradigm can 6.10 Data Privacy-preserving
obtain impressive performance, they still require a large- Many text-to-SQL models have been deployed on the cloud
scale annotated dataset for fine-tuning the downstream to process user data by the small businesses. Due to the
tasks. These neural models could show poor performance sensitivity of user data, data privacy-preserving could be an
in zero-shot setting without any task-specific annotated essential but challenging task in text-to-SQL parsing. In par-
training data. One possible solution is to adopt pre-trained ticular, most text-to-SQL parsing models require identifying
language models (e.g., GPT-3 [97] and Codex [98]) for zero- patterns as well as relations from large-scale corpora and
shot transfer to downstream tasks without the fine-tuning related database schemas and gathering all information into
phase. [99] and [94] revealed that large pre-trained lan- a central site for feature representation learning. However,
guage models can achieve competitive performance on text- the data privacy issue may prevent the neural models from
to-SQL parsing without fine-tuning. For example, Codex building a centralized architecture given the fact that the
[98] achieved the execution accuracy up to 67% on the relations or patterns among different database schemas may
Spider [23] development set. Therefore, zero-shot and few- be distributed among several custodians, none of which are
shot text-to-SQL parsing is a promising direction for future allowed to transfer their user data to other sites. How to
exploration. effectively transform the NL questions as well as database
schema to privatized input representations on the local
devices and then upload the transformed representations to
6.8 Pre-Training for Context-dependent Text-to-SQL the service provider would be also a promising exploration
Parsing direction.
[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence ating structured queries from natural language using
to sequence learning with neural networks,” Proc. of reinforcement learning,” ArXiv preprint, 2017.
NeurIPS, 2014. [23] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li,
[4] J. Guo, Z. Zhan, Y. Gao, Y. Xiao, J.-G. Lou, T. Liu, and J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev,
D. Zhang, “Towards complex text-to-SQL in cross- “Spider: A large-scale human-labeled dataset for com-
domain database with intermediate representation,” plex and cross-domain semantic parsing and text-to-
in Proc. of ACL, 2019. SQL task,” in Proc. of EMNLP, 2018.
[5] S. Hochreiter and J. Schmidhuber, “Long short-term [24] Y. Gan, X. Chen, Q. Huang, M. Purver, J. R. Wood-
memory,” Neural computation, 1997. ward, J. Xie, and P. Huang, “Towards robustness of
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, text-to-SQL models against synonym substitution,” in
L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Proc. of ACL, 2021.
“Attention is all you need,” in Proc. of NeurIPS, 2017. [25] Y. Gan, X. Chen, and M. Purver, “Exploring underex-
[7] D. Choi, M. C. Shin, E. Kim, and D. R. Shin, “RYAN- plored limitations of cross-domain text-to-SQL gener-
SQL: Recursively applying sketch-based slot fillings alization,” in Proc. of EMNLP, 2021.
for complex text-to-SQL in cross-domain databases,” [26] P. Shaw, M.-W. Chang, P. Pasupat, and K. Toutanova,
Computational Linguistics, 2021. “Compositional generalization and natural language
[8] K. O’Shea and R. Nash, “An introduction to convolu- variation: Can a semantic parsing approach handle
tional neural networks,” ArXiv preprint, 2015. both?” in Proc. of ACL, 2020.
[9] D. Yoon, D. Lee, and S. Lee, “Dynamic self-attention: [27] Q. Min, Y. Shi, and Y. Zhang, “A pilot study for
Computing attention over words dynamically for sen- Chinese SQL semantic parsing,” in Proc. of EMNLP,
tence embedding,” ArXiv preprint, 2018. 2019.
[10] W. Hwang, J. Yim, S. Park, and M. Seo, “A compre- [28] L. Wang, A. Zhang, K. Wu, K. Sun, Z. Li, H. Wu,
hensive exploration on wikisql with table-aware word M. Zhang, and H. Wang, “DuSQL: A large-scale and
contextualization,” ArXiv preprint, 2019. pragmatic Chinese text-to-SQL dataset,” in Proc. of
[11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, EMNLP, 2020.
“BERT: Pre-training of deep bidirectional transform- [29] P. J. Price, “Evaluation of spoken language systems:
ers for language understanding,” in Proc. of AACL, the ATIS domain,” in Speech and Natural Language:
2019. Proceedings of a Workshop Held at Hidden Valley, Penn-
[12] B. Wang, R. Shin, X. Liu, O. Polozov, and M. Richard- sylvania, June 24-27,1990, 1990.
son, “RAT-SQL: Relation-aware schema encoding and [30] D. A. Dahl, M. Bates, M. Brown, W. Fisher,
linking for text-to-SQL parsers,” in Proc. of ACL, 2020. K. Hunicke-Smith, D. Pallett, C. Pao, A. Rudnicky, and
[13] R. Cai, J. Yuan, B. Xu, and Z. Hao, “Sadga: Structure- E. Shriberg, “Expanding the scope of the ATIS task:
aware dual graph aggregation network for text-to- The ATIS-3 corpus,” in Human Language Technology:
sql,” Proc. of NeurIPS, 2021. Proceedings of a Workshop held at Plainsboro, New Jersey,
[14] R. Cao, L. Chen, Z. Chen, Y. Zhao, S. Zhu, and K. Yu, March 8-11, 1994, 1994.
“LGESQL: Line graph enhanced text-to-SQL model [31] T. Yu, R. Zhang, M. Yasunaga, Y. C. Tan, X. V. Lin, S. Li,
with mixed local and non-local relations,” in Proc. of H. Er, I. Li, B. Pang, T. Chen, E. Ji, S. Dixit, D. Proctor,
ACL, 2021. S. Shim, J. Kraft, V. Zhang, C. Xiong, R. Socher, and
[15] X. Xu, C. Liu, and D. Song, “Sqlnet: Generating struc- D. Radev, “SParC: Cross-domain semantic parsing in
tured queries from natural language without rein- context,” in Proc. of ACL, 2019.
forcement learning,” ArXiv preprint, 2017. [32] T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V.
[16] B. Hui, X. Shi, R. Geng, B. Li, Y. Li, J. Sun, and Lin, Y. C. Tan, T. Shi, Z. Li, Y. Jiang, M. Yasunaga,
X. Zhu, “Improving text-to-sql with schema depen- S. Shim, T. Chen, A. Fabbri, Z. Li, L. Chen, Y. Zhang,
dency learning,” ArXiv preprint, 2021. S. Dixit, V. Zhang, C. Xiong, R. Socher, W. Lasecki,
[17] J. Huang, Y. Wang, Y. Wang, Y. Dong, and Y. Xiao, and D. Radev, “CoSQL: A conversational text-to-SQL
“Relation aware semi-autoregressive semantic parsing challenge towards cross-domain natural language in-
for nl2sql,” ArXiv preprint, 2021. terfaces to databases,” in Proc. of EMNLP, 2019.
[18] P. Yin, G. Neubig, W.-t. Yih, and S. Riedel, “TaBERT: [33] J. Guo, Z. Si, Y. Wang, Q. Liu, M. Fan, J.-G. Lou,
Pretraining for joint understanding of textual and Z. Yang, and T. Liu, “Chase: A large-scale and prag-
tabular data,” in Proc. of ACL, 2020. matic Chinese dataset for cross-database context-
[19] J. Herzig, P. K. Nowak, T. Müller, F. Piccinno, and dependent text-to-SQL,” in Proc. of ACL, 2021.
J. Eisenschlos, “TaPas: Weakly supervised table pars- [34] R. Zhong, T. Yu, and D. Klein, “Semantic evaluation
ing via pre-training,” in Proc. of ACL, 2020. for text-to-sql with distilled test suite,” in Proc. of
[20] T. Yu, C. Wu, X. V. Lin, B. Wang, Y. C. Tan, X. Yang, EMNLP, 2020.
D. R. Radev, R. Socher, and C. Xiong, “Grappa: [35] L. S. Zettlemoyer and M. Collins, “Learning to map
Grammar-augmented pre-training for table semantic sentences to logical form: Structured classification
parsing,” in Proc. of ICLR, 2021. with probabilistic categorial grammars,” Proceedings
[21] S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and of the 21st Conference in Uncertainty in Artificial Intel-
L. Zettlemoyer, “Learning a neural semantic parser ligence, 2005.
from user feedback,” in Proc. of ACL, 2017. [36] T. Shi, C. Zhao, J. Boyd-Graber, H. Daumé III, and
[22] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Gener- L. Lee, “On the potential of lexico-logical alignments
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 18
for semantic parsing to SQL queries,” in Proc. of standing of a convolutional neural network,” in 2017
EMNLP Findings, 2020. international conference on engineering and technology
[37] P. Pasupat and P. Liang, “Compositional semantic (ICET), 2017.
parsing on semi-structured tables,” in Proc. of ACL, [55] Q. Liu, D. Yang, J. Zhang, J. Guo, B. Zhou, and J.-G.
2015. Lou, “Awakening latent grounding from pretrained
[38] T. Yu, Z. Li, Z. Zhang, R. Zhang, and D. Radev, language models for semantic parsing,” arXiv preprint
“TypeSQL: Knowledge-based type-aware neural text- arXiv:2109.10540, 2021.
to-SQL generation,” in Proc. of AACL, 2018. [56] J. Ma, Z. Yan, S. Pang, Y. Zhang, and J. Shen, “Mention
[39] T. Yu, M. Yasunaga, K. Yang, R. Zhang, D. Wang, Z. Li, extraction and linking for SQL query generation,” in
and D. Radev, “SyntaxSQLNet: Syntax tree networks Proc. of EMNLP, 2020.
for complex and cross-domain text-to-SQL task,” in [57] J. D. Lafferty, A. McCallum, and F. C. N. Pereira,
Proc. of EMNLP, 2018. “Conditional random fields: Probabilistic models for
[40] B. Bogin, J. Berant, and M. Gardner, “Representing segmenting and labeling sequence data,” in Proc. of
schema structure with graph neural networks for text- ICML, 2001.
to-SQL parsing,” in Proc. of ACL, 2019. [58] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention
[41] R. Zhang, T. Yu, H. Er, S. Shim, E. Xue, X. V. Lin, T. Shi, with relative position representations,” in Proc. of
C. Xiong, R. Socher, and D. Radev, “Editing-based AACL, 2018.
SQL query generation for cross-domain context- [59] K. Wang, W. Shen, Y. Yang, X. Quan, and R. Wang,
dependent questions,” in Proc. of EMNLP, 2019. “Relational graph attention network for aspect-based
[42] A. Kelkar, R. Relan, V. Bhardwaj, S. Vaichal, C. Kha- sentiment analysis,” in Proc. of ACL, 2020.
tri, and P. Relan, “Bertrand-dr: Improving text-to-sql [60] Y. Li, D. Tarlow, M. Brockschmidt, and R. S. Zemel,
using a discriminative re-ranker,” ArXiv preprint, 2020. “Gated graph sequence neural networks,” in Proc. of
[43] X. V. Lin, R. Socher, and C. Xiong, “Bridging textual ICLR, 2016.
and tabular data for cross-domain text-to-SQL seman- [61] B. Bogin, M. Gardner, and J. Berant, “Global reasoning
tic parsing,” in Proc. of EMNLP Findings, 2020. over database structures for text-to-SQL parsing,” in
[44] O. Rubin and J. Berant, “SmBoP: Semi-autoregressive Proc. of EMNLP, 2019.
bottom-up semantic parsing,” in Proc. of AACL, 2021. [62] C. Wang, P.-S. Huang, A. Polozov, M. Brockschmidt,
[45] Z. Chen, L. Chen, Y. Zhao, R. Cao, Z. Xu, S. Zhu, and R. Singh, “Execution-guided neural program de-
and K. Yu, “ShadowGNN: Graph projection neural coding,” in ICML workshop on Neural Abstract Machines
network for text-to-SQL parser,” in Proc. of AACL, and Program Induction v2 (NAMPI), 2018.
2021. [63] A. See, P. J. Liu, and C. D. Manning, “Get to the point:
[46] P. Xu, D. Kumar, W. Yang, W. Zi, K. Tang, C. Huang, Summarization with pointer-generator networks,” in
J. C. K. Cheung, S. J. Prince, and Y. Cao, “Optimizing Proc. of ACL, 2017.
deeper transformers on small datasets,” in Proc. of [64] D. C. Wang, A. W. Appel, J. L. Korn, and C. S. Serra,
ACL, 2021. “The zephyr abstract syntax description language.” in
[47] T. Scholak, N. Schucher, and D. Bahdanau, “PICARD: DSL, 1997.
Parsing incrementally for constrained auto-regressive [65] R. Tarjan, “Depth-first search and linear graph algo-
decoding from language models,” in Proc. of EMNLP, rithms,” SIAM journal on computing, 1972.
2021. [66] P. Shaw, M.-W. Chang, P. Pasupat, and K. Toutanova,
[48] B. Hui, R. Geng, L. Wang, B. Qin, B. Li, J. Sun, and “Compositional generalization and natural language
Y. Li, “S 2 sql: Injecting syntax to question-schema in- variation: Can a semantic parsing approach handle
teraction graph encoder for text-to-sql parsers,” ArXiv both?” in Proc. of ACL, 2021.
preprint, 2022. [67] K. Xuan, Y. Wang, Y. Wang, Z. Wen, and Y. Dong,
[49] M. Schuster and K. K. Paliwal, “Bidirectional recurrent “Sead: End-to-end text-to-sql generation with schema-
neural networks,” IEEE transactions on Signal Process- aware denoising,” ArXiv preprint, 2021.
ing, 1997. [68] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
[50] W. Lei, W. Wang, Z. Ma, T. Gan, W. Lu, M.-Y. Kan, and M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
T.-S. Chua, “Re-examining the role of schema linking the limits of transfer learning with a unified text-to-
in text-to-SQL,” in Proc. of EMNLP, 2020. text transformer,” ArXiv preprint, 2019.
[51] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, [69] Q. Liu, B. Chen, J. Guo, J.-G. Lou, B. Zhou, and
O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, D. Zhang, “How far are we from effective context
“Roberta: A robustly optimized bert pretraining ap- modeling ? an exploratory study on semantic parsing
proach,” ArXiv preprint, 2019. in context,” in Proc. of IJCAI, 2020.
[52] Y. Gou, Y. Lei, L. Liu, Y. Dai, and C. Shen, “Contextual- [70] T. Luong, H. Pham, and C. D. Manning, “Effective
ize knowledge bases with transformer for end-to-end approaches to attention-based neural machine trans-
task-oriented dialogue systems,” in Proc. of EMNLP, lation,” in Proc. of EMNLP, 2015.
2021. [71] P. Jain and M. Lapata, “Memory-Based Semantic Pars-
[53] V. Zhong, M. Lewis, S. I. Wang, and L. Zettlemoyer, ing,” Transactions of the Association for Computational
“Grounded adaptation for zero-shot executable se- Linguistics, 2021.
mantic parsing,” in Proc. of EMNLP, 2020. [72] Z. Chen, L. Chen, H. Li, R. Cao, D. Ma, M. Wu, and
[54] S. Albawi, T. A. Mohammed, and S. Al-Zawi, “Under- K. Yu, “Decoupled dialogue modeling and seman-
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 19
tic parsing for multi-turn text-to-sql,” arXiv preprint “BART: Denoising sequence-to-sequence pre-training
arXiv:2106.02282, 2021. for natural language generation, translation, and com-
[73] Y. Zheng, H. Wang, B. Dong, X. Wang, and C. Li, prehension,” in Proc. of ACL, 2020.
“Hie-sql: History information enhanced network [92] P. Shi, P. Ng, Z. Wang, H. Zhu, A. H. Li, J. Wang,
for context-dependent text-to-sql semantic parsing,” C. N. dos Santos, and B. Xiang, “Learning contextual
arXiv preprint arXiv:2203.07376, 2022. representations for semantic parsing with generation-
[74] B. Hui, R. Geng, Q. Ren, B. Li, Y. Li, J. Sun, F. Huang, augmented pre-training,” in Proc. of AAAI, 2021.
L. Si, P. Zhu, and X. Zhu, “Dynamic hybrid relation [93] J. Yang, A. Gupta, S. Upadhyay, L. He, R. Goel, and
network for cross-domain context-dependent seman- S. Paul, “Tableformer: Robust transformer modeling
tic parsing,” ArXiv preprint, 2021. for table-text encoding,” ArXiv preprint, 2022.
[75] Y. Cai and X. Wan, “IGSQL: Database schema interac- [94] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak,
tion graph based neural model for context-dependent M. Yasunaga, C.-S. Wu, M. Zhong, P. Yin, S. I. Wang
text-to-SQL generation,” in Proc. of EMNLP, 2020. et al., “Unifiedskg: Unifying and multi-tasking struc-
[76] R.-Z. Wang, Z.-H. Ling, J. Zhou, and Y. Hu, “Tracking tured knowledge grounding with text-to-text lan-
interaction states for multi-turn text-to-sql semantic guage models,” ArXiv preprint, 2022.
parsing,” in Proc. of AAAI, 2021. [95] B. Wang, M. Lapata, and I. Titov, “Meta-learning for
[77] R.-Z. Wang, Z.-H. Ling, J.-B. Zhou, and Y. Hu, “Track- domain generalization in semantic parsing,” in Proc.
ing interaction states for multi-turn text-to-sql seman- of AACL, 2021.
tic parsing,” in Proc. of AAAI, 2021. [96] A. Talmor, O. Yoran, A. Catav, D. Lahav, Y. Wang,
[78] C. S. Bhagavatula, T. Noraset, and D. Downey, “Meth- A. Asai, G. Ilharco, H. Hajishirzi, and J. Berant, “Mul-
ods for exploring and mining tables on wikipedia,” in timodalqa: complex question answering over text,
Proc. of KDD, 2013. tables and images,” in Proc. of ICLR, 2021.
[79] O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer, “A [97] L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope,
large public corpus of web tables containing time and limits, and consequences,” Minds and Machines, 2020.
context metadata,” in Proceedings of the 25th Interna- [98] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto,
tional Conference Companion on World Wide Web, 2016. J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brock-
[80] C. S. Bhagavatula, T. Noraset, and D. Downey, “Tabel: man et al., “Evaluating large language models trained
Entity linking in web tables,” in Proc. of ISWC, 2015. on code,” arXiv preprint arXiv:2107.03374, 2021.
[81] A. Parikh, X. Wang, S. Gehrmann, M. Faruqui, [99] N. Rajkumar, R. Li, and D. Bahdanau, “Evaluating
B. Dhingra, D. Yang, and D. Das, “ToTTo: A controlled the text-to-sql capabilities of large language models,”
table-to-text generation dataset,” in Proc. of EMNLP, 2022.
2020. [100] W. He, Y. Dai, M. Yang, J. Sun, F. Huang, L. Si, and
[82] Y. Sun, Z. Yan, D. Tang, N. Duan, and B. Qin, Y. Li, “Unified dialog model pre-training for task-
“Content-based table retrieval for web queries,” Neu- oriented dialog understanding and generation,” in
rocomputing, 2019. Proc. of SIGIR, 2022.
[83] X. Deng, A. H. Awadallah, C. Meek, O. Polozov, [101] W. He, Y. Dai, Y. Zheng, Y. Wu, Z. Cao, D. Liu, P. Jiang,
H. Sun, and M. Richardson, “Structure-grounded pre- M. Yang, F. Huang, L. Si et al., “Galaxy: A generative
training for text-to-SQL,” in Proc. of AACL, 2021. pre-trained model for task-oriented dialog with semi-
[84] T. Yu, R. Zhang, A. Polozov, C. Meek, and A. H. supervised learning and explicit policy injection,” in
Awadallah, “Score: Pre-training for context represen- Proc. of AAAI, 2022.
tation in conversational semantic parsing,” in Proc. of
ICLR, 2021.
[85] W. Yang, P. Xu, and Y. Cao, “Hierarchical neural data
synthesis for semantic parsing,” ArXiv preprint, 2021.
[86] B. Wang, W. Yin, X. V. Lin, and C. Xiong, “Learning
to synthesize data for semantic parsing,” in Proc. of
AACL, 2021.
[87] C. Shu, Y. Zhang, X. Dong, P. Shi, T. Yu, and
R. Zhang, “Logic-consistency text generation from
semantic parses,” in Proc. of ACL Findings, 2021.
[88] Q. Liu, B. Chen, J. Guo, M. Ziyadi, Z. Lin, W. Chen,
and J. guang Lou, “Tapex: Table pre-training via learn-
ing a neural sql executor,” 2021.
[89] J. Pennington, R. Socher, and C. D. Manning, “Glove:
Global vectors for word representation,” in Proc. of
EMNLP, 2014.
[90] J. Eisenschlos, M. Gor, T. Müller, and W. Cohen,
“MATE: Multi-view attention for table transformer
efficiency,” in Proc. of EMNLP, 2021.
[91] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer,