0% found this document useful (0 votes)
16 views32 pages

Research Paper

This survey presents a comprehensive taxonomy of deep learning approaches for text-to-SQL systems, which translate natural language queries into SQL queries for relational databases. It highlights the challenges in this field, including ambiguities in natural language and complexities in SQL syntax, while discussing the current state of research and the need for systematic organization of techniques. The authors aim to facilitate better comparisons among various systems and identify future research opportunities to enhance database accessibility for users without technical expertise.

Uploaded by

bbhargav982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views32 pages

Research Paper

This survey presents a comprehensive taxonomy of deep learning approaches for text-to-SQL systems, which translate natural language queries into SQL queries for relational databases. It highlights the challenges in this field, including ambiguities in natural language and complexities in SQL syntax, while discussing the current state of research and the need for systematic organization of techniques. The authors aim to facilitate better comparisons among various systems and identify future research opportunities to enhance database accessibility for users without technical expertise.

Uploaded by

bbhargav982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

The VLDB Journal (2023) 32:905–936

https://doi.org/10.1007/s00778-022-00776-8

REGULAR PAPER

A survey on deep learning approaches for text-to-SQL


George Katsogiannis-Meimarakis1 · Georgia Koutrika1

Received: 27 May 2022 / Revised: 31 October 2022 / Accepted: 10 December 2022 / Published online: 23 January 2023
© The Author(s) 2023

Abstract
To bridge the gap between users and data, numerous text-to-SQL systems have been developed that allow users to pose natural
language questions over relational databases. Recently, novel text-to-SQL systems are adopting deep learning methods with
very promising results. At the same time, several challenges remain open making this area an active and flourishing field
of research and development. To make real progress in building text-to-SQL systems, we need to de-mystify what has been
done, understand how and when each approach can be used, and, finally, identify the research challenges ahead of us. The
purpose of this survey is to present a detailed taxonomy of neural text-to-SQL systems that will enable a deeper study of all
the parts of such a system. This taxonomy will allow us to make a better comparison between different approaches, as well
as highlight specific challenges in each step of the process, thus enabling researchers to better strategise their quest towards
the “holy grail” of database accessibility.

Keywords Text-to-SQL · Deep learning · Natural language processing · Natural language interface for databases

1 Introduction Towards this direction, there has been an increasing


research focus on Natural Language (NL) Interfaces for
In the age of the Digital Revolution, data is now an indis- Databases (NLIDBs) that allow users to pose queries in nat-
pensable commodity that drives almost all human activities, ural language and translate these queries to the underlying
from business operations to scientific research. Nevertheless, database query language. In particular, text-to-SQL (or NL-
its explosive volume and increasing complexity make data to-SQL) systems translate queries from NL to SQL. As the
querying and exploration challenging even for experts. Exist- text-to-SQL problem is notoriously hard, these systems have
ing data query interfaces are either form-based, which are been the holy grail of the database community for several
easy to use but offer limited query capabilities, or low-level decades [5]. Early efforts [36,37,60,110] rely primarily on the
tools that allow users to synthesise queries in the underlying database schema and data indexes to build the respective SQL
database query language (e.g. SQL) but are intended for the query from a NL query. A query answer is defined as a graph
few (e.g. SQL experts). To empower everyone to access, use, where nodes are the relations that contain the query keywords
understand, and derive value from data, we need to lift the and edges represent the joins between them. Parsing-based
technical barriers that impede access to data and eliminate approaches parse the input question to understand its gram-
dependency to IT experts. Expressing queries in natural lan- matical structure, which is then mapped to the structure of
guage can open up data access to everyone. In the words of the desired SQL query [42,53,69,90,98]. Recently, there has
E. F. Codd: “If we are to satisfy the needs of casual users been a growing interest in neural machine translation (NMT)
of databases we must break the barriers that presently pre- approaches [33,89,112] that formulate the text-to-SQL prob-
vent these users from freely employing their native language” lem as a language translation problem, and train a neural
[14]. network on a large amount of {NL query/SQL} pairs. These
approaches have bloomed due to the recent advances in deep
learning and natural language processing (NLP), along with
B George Katsogiannis-Meimarakis
the creation of two large datasets (WikiSQL [112] and Spider
[email protected]
[107]) for training text-to-SQL systems.
Georgia Koutrika
[email protected] As neural text-to-SQL systems are popping up “like mush-
rooms after a rain” with promising results, an exciting, but,
1 Athena Research Center, Athens, Greece

123
906 G. Katsogiannis-Meimarakis, G. Koutrika

at the same time, highly competitive and fast-paced research – We present the current state of the deep learning text-
field is opening up. While a growing interest on the sub- to-SQL landscape, the particularities of the problem, the
ject is shown by various tutorials [44,45,54] and literature benchmarks and evaluation methods that are most com-
reviews [1,2,5,17,40,47,55,72] presented at top conferences monly used, and a wide spectrum of the most recent
and journals, an in-depth, systematic study and taxonomy efforts that leverage the latest and most sophisticated deep
of neural approaches for text-to-SQL is missing. We believe learning approaches
that in order to make real progress in building text-to-SQL – We provide a taxonomy that not only enables a side-
systems, we need to de-mystify what has been done, under- by-side comparison of the systems but also allows
stand how and when each model and approach can be used, decomposing the text-to-SQL problem in a number
and recognise the research challenges ahead of us. Two ear- of sub-problems and categorising existing techniques
lier works [2,55] study rule-based approaches that originated accordingly
from the database community; our work has a different scope, – We provide a detailed discussion of methods used in
focusing entirely on deep learning systems. Additionally, two these systems, taking advantage of our taxonomy to high-
studies consider both rule-based and neural text-to-SQL sys- light the advantages and shortcomings of different design
tems: [47] provides a taxonomy of both types of systems and choices
an experimental evaluation based on a new accuracy metric – We discuss in detail open challenges that are highlighted
proposed by the authors, while [72] provides a large-scale from our study and provide directions for critical future
overview of rule-based, neural and conversational NLIDBs. research
The biggest difference with these works is that we present
an in-depth taxonomy tailored to neural systems and their
The rest of this paper is organised as follows: Sect. 2
peculiarities (while also covering more and newer efforts).
provides a definition and explanation of the text-to-SQL
Finally, three studies focus on neural text-to-SQL systems:
problem, including an analysis of the challenges that make
[1] provides an overview of the neural text-to-SQL land-
the problem so hard. In Sect. 3, we present the datasets that are
scape, but in a more bare-bones manner compared to our
currently fuelling the creation of deep learning systems. We
work, and [17,40], which are the closest to our work, since
also touch on the problem of evaluating system performance
they both attempt to organise the existing neural text-to-SQL
based on these benchmarks. Section 4 presents a fine-grained
approaches. However, our work goes in greater depth than
taxonomy for deep learning text-to-SQL systems, analysing
these works, both by presenting a taxonomy with additional
the most important steps followed by all systems and pre-
dimensions, but also by using this taxonomy to analyse and
senting current work, open problems and hints for future
compare different systems and design choices. We also point
research for each step. Section 5 gives an overview of the
the interested reader to recent surveys on semantic pars-
main neural building blocks used for text-to-SQL systems,
ing [43] and context-dependent semantic parsing [56], two
as well as their most common usage. Having established
broader domains that the text-to-SQL problem is a part of.
a concrete set of axes for comparing and classifying text-
In a nutshell, this survey aims at catching up with recent
to-SQL systems, in Sect. 6, a multitude of neural systems
advances in deep learning text-to-SQL systems and sys-
are presented and compared based on the aforementioned
tematically organising all the different techniques that have
taxonomy, allowing the reader to grasp the progress that
been proposed for each step of the translation process. Our
has been made in this domain and the differences between
objective is to (a) put different neural text-to-SQL works
key approaches. In Sect. 7, we take advantage of the taxon-
in perspective, (b) create a fine-grained taxonomy that cov-
omy, to compare different design choices and provide useful
ers each step of the neural text-to-SQL pipeline, (c) explain
insights for researchers and practitioners that are interested
and organise all the techniques used for each dimension of
in implementing a novel text-to-SQL system. Finally, Sect. 8
the taxonomy, (d) use the taxonomy to compare and high-
aims at inspiring practitioners and researchers in the fields
light the strengths and weaknesses of different systems and
of database systems, natural language processing and deep
techniques, and (e) highlight open challenges and research
learning, by shedding light on open problems that need to
opportunities for the database and the machine learning
be addressed, as well as closely related areas that could both
communities. Our study is also relevant to other areas,
give and receive benefit from research done in the text-to-
including the broader area of data exploration (e.g. natural
SQL problem.
language explanations, recommendations), entity resolution,
and query optimisation, where the methods presented here
may be transferred to or inspire the development of new
methods. 2 The text-to-SQL problem
In particular, our contributions are the following:
The text-to-SQL problem can be described as follows:

123
A survey on deep learning approaches for text-to-SQL 907

mean based on the number of ratings collected. On the other


hand, for the query “Return the top scorer” on a football
database, “top” refers to the number of goals scored. Based
on the user, for a business analyst, the query “Return the top
product” should return the most profitable products, whereas
for a consumer it should return the top-rated products.
Paraphrasing In natural language, two sentences can have
Fig. 1 The text-to-SQL problem
the exact same meaning but be expressed in two completely
different ways. For instance, “How many people live in
Texas?” and “What is the population of Texas?”. Both trans-
Given a natural language query (NLQ) on a Relational
late to the same SQL query, but the second one may actually
Database (RDB) with a specific schema, produce a SQL
be easier for a system because it is likely that a “popula-
query equivalent in meaning, which is valid for the said RDB
tion” attribute exists in the database schema, and thus, the
and that when executed will return results that match the
user intent can be inferred with high confidence. Paraphras-
user’s intent.
ing includes synonymy where multiple words have the same
A NLQ may be expressed as a complete and fluent
meaning (e.g. “movies” and “films”).
utterance (e.g. “What movies has Spielberg directed since
Inference A query may not contain all information needed
2012?”) or it may be just a few keywords (e.g. “Italian
for a system to fully understand it. The system has to infer
Restaurants in Vienna”). A text-to-SQL example can be seen
the missing information based on the given context. We dis-
in Fig. 1. Translating a NLQ to SQL hides challenges related
tinguish two main types of inference:
to the understanding of the input NL query as well as related
Elliptical queries are sentences from which one or more
to building the correct (syntactically and semantically) SQL
words are omitted but can still be understood in the context of
query based on the underlying database schema.
the sentence.1 An example is “Who was the president before
Obama”. The fact that the query refers to US presidents needs
2.1 NL challenges to be inferred.
Follow-up questions are common in conversations between
Ambiguity Natural language is inherently ambiguous, which humans. We ask a question, receive an answer, and then ask
means that it allows the formulation of expressions that are a follow-up question assuming that the context of the first
open to more than one interpretation. There are several types question is known. For example, “Q: Which is the capital
of ambiguity [3,66]. We describe the most common ones of Germany?”, “A: Berlin”, “Q: What about France?”. In
below. the absence of the first question, the second one does not
Lexical ambiguity (or polysemy) refers to a single word make sense, but given the query context, it is obvious that it
having multiple meanings. For example, “Paris” can be a is asking about the capital city of France.
city or a person. User mistakes Spelling errors as well as syntactical or
Syntactic ambiguity refers to a sentence having multiple grammatical errors make the translation problem even more
interpretations based on its syntactic structure. For exam- challenging.
ple, the question “Find all German movie directors” can be
parsed into “directors that have directed German movies” 2.2 SQL challenges
or “directors from Germany that have directed a movie”.
Semantic ambiguity refers to a sentence with multi- SQL syntax SQL has a strict syntax, which leads to limited
ple semantic interpretations. For instance, “Are Brad and expressivity compared to natural language. There are queries
Angelina married?” may mean they are married to each other that are easy to express in natural language, but the respective
or separately. SQL query may be complex. For example, the query “Return
Context-dependent ambiguity refers to a term having dif- the movie with the best rating” maps to a nested SQL query.
ferent meanings depending on the query context, the data Furthermore, while a sentence in natural language may
domain, and the user goals. The most common example terms contain some mistakes, and still be understood by a human,
are “top” and “best”. Based on the query context, for the SQL is not that forgiving. An SQL query translated from a
query “Who was the best runner of the marathon?”, the NL query needs to be syntactically and semantically correct
one who completed the race faster (min operation) should in order to be executable over the underlying data.
be returned, but when asking “Which was the best nation Database structure The user’s conceptual model of the
of the 2004 Olympics?” the one with the most medals (max data, i.e. the entities, their attributes and relationships that are
operation) is expected. Based on the domain, for the query
“Return the top movie” on a movie database, “top” may 1 https://en.wikipedia.org/wiki/Ellipsis_(linguistics).

123
908 G. Katsogiannis-Meimarakis, G. Koutrika

described in the data, may not match the database schema, Table 1 A comparison of the two most popular text-to-SQL bench-
and that poses several challenges. marks: WikiSQL and Spider
The vocabulary gap refers to the differences between the WikiSQL Spider
vocabulary used by the database and the one used by the
Crowd-sourced Created by experts
user. For example, in the query “Who was the best actress in
25K Wikipedia tables 200 databases, 138 domains
2011?”, “actress” should map to the Actor.name attribute in
the database). 80K NL questions 10K NL questions
Schema ambiguity is when a part of the query may map Single-table, simple queries Complex queries
to more than one database element. For example, “model” Contains errors Higher quality
could refer to car.model or engine.model. No query categorisation 4 hardness categories
Implicit join operations occur when parts of a query are
translated into joins across multiple relations. For example,
Table 2 An overview of text-to-SQL benchmarks and their size in
“Find the director of the movie “A Beautiful Mind”” entails queries and databases
joins due to database normalisation.
Year Dataset Queries Databases
Entity modelling is the problem where a set of entities may
be modelled differently, e.g. as different tables or as rows 1994 ATIS [15,71] 275 1
(or values) in a single table. For example, in a university 1996 GeoQuery [109] 525 1
database, every person is either a Student or a Faculty mem- 2003 Restaurants [70,84] 39 1
ber, so these two relations suffice. On the other hand, movies 2014 Academic [53] 179 1
have several genres that cannot be stored as different tables. 2017 IMDb [98] 111 1
They are stored in a Genre relation and are connected with Yelp [98] 68 1
movies through a many-to-many relationship. As a result, Scholar [42] 396 1
similar queries, such as “Find comedies released in 2018” WikiSQL [112] 80,654 24,241
and “Find students enrolled in 2018” need in fact to be han- 2018 Advising [27] 281 1
dled differently. The system maps “comedies” to a value in Spider [107] 10,181 200
the Genre table and joins it with the Movie table whereas it
2020 MIMICSQL [92] 10,000 1
maps “students” to the Student relation.
SQUALL [80] 11,276 1679
FIBEN [77] 300 1
2021 Spider-Syn [28] 8034 160
3 Datasets and evaluation
Spider-DK [29] 535 10
KaggleDBQA [51] 272 8
To build a neural text-to-SQL system, it is necessary to con-
SEDE [34] 12,023 1
sider the available datasets for training and evaluation, as
well as the evaluation methodology for testing and compar-
ing its performance to other systems. A text-to-SQL dataset
(or benchmark) refers to a set of NL/SQL query pairs defined This situation drastically changes with the emergence
over one or more databases. of WikiSQL [112] and Spider [107], in 2017 and 2018,
Early system evaluations did not rely on common datasets, respectively. These are the first large-scale, multi-domain
they rather employed a variety of datasets that combined benchmarks that made it possible to train and evaluate neural
different databases and query sets of varying size and com- text-to-SQL systems and provided a common tool to com-
plexity. In general, the query sets were small and designed in pare different systems easily. While other benchmarks have
an ad-hoc way by the system developers, and as a result it was followed, these two remain the most popular ones. Table 1
hard to reach meaningful conclusions about the translation summarises and compares the two benchmarks.
capabilities of a system. Often, the query sets were propri- This section provides an overview of various text-to-SQL
etary and hence not available to reproduce the experiments. datasets (summarised in Table 2), covering either a single or
The lack of a common dataset to be used by different system multiple domains, as well as the evaluation methodologies
evaluations and the poor cross-system evaluations impeded for comparing the system predictions to the ground truth.
a fair system comparison and a clear view of the text-to-SQL
landscape. In addition to these shortcomings, training deep 3.1 Domain-specific text-to-SQL datasets
learning text-to-SQL systems requires a substantial query set.
As a result, for a long time, the lack of appropriate datasets Domain-specific text-to-SQL datasets focus on one domain
delayed the adoption of deep learning techniques for the text- and typically include a single database, such as: movies and
to-SQL problem. television series (IMDb [98]), restaurant and shop reviews

123
A survey on deep learning approaches for text-to-SQL 909

(Yelp [98] and Restaurants [70,84]), academic research model trained on it. Figure 3 demonstrates an example of
(Scholar [42] and Academic [53]), financial data (Advising a table incorrectly copied from Wikipedia that was never-
[27] and FIBEN [77]), medical data (MIMICSQL [92]), and theless used to generate a pair of a NLQ and a SQL query
questions and answers from Stack Exchange (SEDE [34]). that, ultimately, make no sense. Research even suggests that
Interestingly, these datasets have not seen the same the state-of-the-art systems have reached the upper barrier
widespread use as WikiSQL or Spider for a number of rea- of accuracy on the task [39]. This is also demonstrated by
sons. Since they focus on a single domain, it is not possible evaluating human performance on a small proportion of the
to argue that a proposed system can be considered a “univer- dataset.
sal solution” even if it performs well on a specific domain. Spider Spider [107] is a large-scale complex and cross-
Second, their size is relatively small compared to Spider domain semantic parsing and text-to-SQL dataset annotated
and WikiSQL, usually not surpassing a thousand examples. by 11 Yale students. It contains 200 relational databases
Third, most of these datasets do not have a pre-defined from 138 different domains along with over 10,000 natural
train/dev/test split so that systems trained and evaluated on language questions and over 5000 SQL queries. Its queries
them would be compared fairly to one another. range from simple to hard, using all the common SQL ele-
Even though the generalisation capability of a text-to- ments, including nesting. These characteristics of the dataset
SQL model is an important challenge, a realistic application along with its high quality, since it was hand-crafted and re-
would most likely require a text-to-SQL system to work with checked, have led researchers to widely rely on it for building
a single database of a specific domain, or with a few related systems that can generate quite complex SQL queries.
databases. In such a scenario, a high performance on a sin- Other cross-domain datasets Recent cross-domain datasets
gle domain may be even more important than a cross-domain focus on particular aspects of the text-to-SQL problem.
generalisation capability, and achieving it is very challenging Spider-DK [28] extends Spider to explore system capa-
[34]. bilities at cross-domain generalisation (i.e. robustness to
Furthermore, datasets such as SEDE [34], are made specif- domain-specific vocabulary across different domains), while
ically to reflect that SQL queries in real-life scenarios can be Spider-Syn [28] focuses on robustness to synonyms and dif-
very complex and long; having numerical computations, vari- ferent vocabulary. Both datasets highlight very interesting
able declarations, date manipulations, and other elements that and important requirements for a text-to-SQL system, and
are not present in the Spider and WikiSQL datasets. SEDE’s can be used as supplementary benchmarks.
authors demonstrate that the state-of-the-art systems which SQUALL [80] is based on a previous dataset named Wik-
achieve high scores on Spider, do not perform as well on iTableQuestions [67], consisting of NL Questions posed on
SEDE, proving the necessity for new and more advanced Wikipedia tables along with the expected answers. In contrast
benchmarks. to WikiSQL, there are no structured queries in the WikiTable-
Questions dataset. The authors of SQUALL have created the
3.2 Cross-domain text-to-SQL datasets corresponding SQL queries for most of the examples in the
WikiTableQuestions dataset, while also providing an align-
WikiSQL WikiSQL [112] is a large crowd-sourced dataset ment between words in the NLQ and the parts of the SQL
for developing natural language interfaces for relational query that they refer to. This additional feature could steer
databases, released along with the Seq2SQL text-to-SQL more thorough research on the schema linking and schema
system. It contains over 25,000 Wikipedia tables and over ambiguity problems (briefly mentioned in Sect. 2 and more
80,000 natural language and SQL question pairs created by thoroughly examined in Sect. 4).
crowd-sourcing. Each entry in the dataset consists of a table Finally, KaggleDBQA [51] is another cross-domain dataset,
with its columns, a Natural Language Question (NLQ) and a although of much smaller size, that has been extracted from
SQL query. Figure 2 shows an example from the dataset. Kaggle and features real-world databases taken from the
The complexity of the SQL queries found in WikiSQL Web, having all the peculiarities of a DB that are missing
is low because each query is directed to a single table and from Spider, whose DBs were created specifically for bench-
not to a relational database and they are do not use any marking text-to-SQL systems. KaggleDBQA also includes
complex SQL clause such as JOIN, GROUP BY, ORDER documentation and metadata for its DBs, posing an inter-
BY, UNION, and INTERSECTION. Additionally, WikiSQL esting research question of how this additional information
does not allow the selection of multiple columns in a single could be used to improve the system performance.
query or the use of the asterisk (*) operator. Consequently,
the proposed task is much simpler than the ultimate goal of 3.3 Evaluation metrics
creating a natural language interface for relational databases.
We must also note that WikiSQL contains multiple errors Having a ground truth SQL query for each NLQ enables us
and ambiguities, which might hinder the performance of a to train and evaluate a deep learning text-to-SQL system on

123
910 G. Katsogiannis-Meimarakis, G. Koutrika

Fig. 2 An example from the WikiSQL dataset

Fig. 3 An incoherent example from the WikiSQL dataset

it. In this section, we will present metrics used to evaluate a a prediction as correct if all component matches are cor-
text-to-SQL system’s predictions. rect (e.g. aggregation function, condition operators, SELECT
String matching (introduced as Logical Form Accuracy columns, etc.).
[112]) is the simplest accuracy metric for text-to-SQL. It Exact set match without values is a category in the Spider
considers the ground truth and predicted queries as simple [107] dataset, that works in the same way as exact set match-
strings and checks whether they are identical. A match is ing, but does not take into account if the values that appear
only found when the predicted query is written exactly as the in the predicted query are the same as the ones that appear
ground truth, without taking into account that many parts of in the gold query. The reason for this simplification is that
a SQL query can be written in a different order or even in a predicting the correct values can be very challenging, espe-
different but still equivalent way. cially when these values appear in the NLQ differently to the
Execution accuracy [107,112] (or Query Accuracy [11]) way they are stored in the DB (e.g. the word “Greek” might
is another simple approach for comparing SQL queries. For imply a condition such as country=“Greece”). Although this
each NLQ, both the ground truth and the predicted queries are metric might be considered as common practice in the Spi-
executed against the corresponding database (or table) and der benchmark, as research shows [27], disregarding values
their results are compared. If the results are the same, then during evaluation removes an important challenge of the text-
the prediction is considered correct. False positives can occur to-SQL problem.
when both queries return the same results, but are different Sub-tree elements matching (or Partial Component Match
on a semantic level (e.g. when they return empty results or F1—PCMF1) [34] is a metric proposed to avoid a score of
when an aggregation function is applied to different columns zero by the exact set match metric, when some parts of the
that happen to return the same result). predicted query are correct. It considers parts of the query
Component matching [107] is proposed in order to obtain such as the SELECT, WHERE and FROM clauses and it
a better understanding of which parts of the SQL query calculates the F1 score of each clause based on the precision
are predicted correctly. For example, we might consider the and recall of the predicted attributes in the clause. The final
SELECT column accuracy, i.e. the percentage of the pre- PCMF1 score of a predicted query is the average F1 score of
dicted queries that have the same columns in the SELECT all the considered query parts. For example, in large queries,
clause as the corresponding ground truth queries. For some the system might predict a large part of the query correctly
parts, a more sophisticated approach might be necessary to and make some errors in the WHERE clause. While the exact
avoid incorrect classifications. For instance, when compar- match metric would assign a score of zero even for a small
ing the conditions of the WHERE clause, their order should mistake, the PCMF1 metric would assign a score relatively
not be taken into account. close to one, thus providing a better assessment of the system
Exact set matching [107] (or Query Match Accuracy [96]) performance.
considers all the possible component matches and classifies

123
A survey on deep learning approaches for text-to-SQL 911

A more thorough methodology for evaluating the seman- schema links, along with the rest of the inputs, will be fed
tic equivalence of two SQL queries has been proposed by into the neural network that is responsible for the translation.
[47], but has yet to be adopted by any deep learning systems. The core of this neural network consists of two main
This approach starts by comparing the execution result of the parts: the encoder and the decoder. The encoder takes one
two queries, as well as their results on additional generated or more inputs of variable shapes and transforms them into
data, in case the original database contains a small amount one or more internal representations with fixed shapes that
of data. Furthermore, a prover is used to provide a proof are consumed by the decoder. Additionally, the encoder usu-
of equivalence between the queries or a counter example in ally infuses the representation of each input with information
the case of non-equivalence. If the prover cannot work for from the rest of the inputs, so as to create a more informed
the given queries, then a query re-writer is applied on both representation that better captures the instance of the prob-
queries and the re-written queries’ parse trees are compared. lem at hand. The decoder uses the representations calculated
If the re-written parse trees are structurally identical then the by the encoder and makes predictions on the most probable
queries are semantically equivalent, otherwise the queries are SQL query (or parts of it).
manually evaluated by an expert. While this approach could Given that the inputs (NLQ, DB, schema links) are mainly
detect matches even if queries are expressed in fundamen- textual, natural language representation is responsible for
tally different ways, the requirement of manual labour as well creating an efficient numerical representation that can be
as the extra processing requirements it presents, are some of accepted by the encoder. Input encoding is the process of
the reasons why it has not seen widespread use yet. further structuring the inputs in a format that can be accepted
What metric each system is using greatly depends on the by the encoder, as well as the choice of an appropriate
dataset that each system is created for and aims at entering encoder network for processing them and producing an inter-
its leaderboard.2,3 Specifically, systems that are built for the nal hidden representation. Finally, output decoding consists
WikiSQL dataset, use Logical Form Accuracy and Execution of designing the structure of the predictions that the network
Accuracy, while systems built for the Spider dataset use Exact will make, as well as choosing the appropriate network for
Set Matching without Values and Execution Accuracy. This making such predictions (e.g. a SQL query can be viewed
strongly indicates the influence that benchmark creators have as a simple string, or as a structured program which follows
on the evaluation strategy of text-to-SQL systems. It also a certain grammar). While some systems perform the NL
highlights the responsibility of the next benchmark creators representation and encoding steps separately (e.g. a repre-
to address the problems of current metrics and include more sentation based on word embeddings which is then encoded
thorough evaluation metrics. by a LSTM), in some cases, they can be almost indistin-
guishable (e.g. when using BERT [19]). It is even possible
for all three steps to be merged into one (e.g. when using
the T5 encoder–decoder pre-trained language model [74]).
4 Taxonomy
Finally, the neural training refers to the procedure followed
for training the neural network.
Despite the fact that deep learning approaches have only
The last dimension of the taxonomy is the output refine-
recently become popular for the text-to-SQL problem,
ment, which can be applied during the decoding phase in
numerous systems have already been proposed, that bring
order to reduce the possibility of errors and to achieve
a wide variety of novelties and employ different approaches.
better results. Note that even though output refinement is
Nevertheless, there are key parts that serve common purposes
closely related to output decoding and even interacts with
across almost all systems, which allow us to build a general
the decoder, it is not a part of the neural network. As such, in
model that can help us better understand them. Hence, the
most cases, it is possible to add or remove an output refine-
goal of this section is to present an overview of the most
ment technique once the system has been created and trained.
important parts of neural text-to-SQL systems as well as a
taxonomy of the possible choices in each part.
Figure 4 shows an overview of a neural text-to-SQL sys- 4.1 Schema linking
tem. The main input of a text-to-SQL system is a NL query
(NLQ) and the database (DB) that the NLQ is posed on. The To better grasp the concept of schema linking, let us think
first step, whenever employed, is schema linking, which aims of how a human, asked to write a SQL query from a NLQ,
at the discovery of possible mentions of database elements would start by looking at the underlying database and by
(tables, columns and values) in the NLQ. These discovered trying to identify how the entities mentioned in the NL are
stored in the database. In other words, they would attempt
to link parts of the NLQ to the database elements they are
2 https://yale-lily.github.io/spider. referring to. Intuitively, a text-to-SQL system could benefit
3 https://github.com/salesforce/WikiSQL. by doing the same when translating a NLQ.

123
912 G. Katsogiannis-Meimarakis, G. Koutrika

“gender=F”. In this case, besides a schema link between


“female” and the column “gender”, the system must also be
given the value as it is stored in the DB (“F”) as part of the
input, in order to use it when constructing the SQL predic-
tion. Otherwise, it will most likely produce a condition like
“gender=female”, which would return no rows. Due to the
volume of a DB, finding value links is not only hard but can
be very computation-expensive.
The schema linking process has two parts. Candidate dis-
covery is the process of extracting query candidates from the
NLQ and database candidates from the underlying database.
Candidate matching is the process of comparing a set of
query candidates and a set of database candidates and estab-
lishing the links.
Schema linking enhances the input, and a system can
operate without it. Hence, performing no schema linking is
possible too. In fact, while most recent systems incorporate
some form of schema linking in their workflow, earlier ones
(e.g. Seq2SQL [112], SQLNet [96]) and even some recent
ones (e.g. HydraNet [62], T5+PICARD [76], SeaD [97]) sim-
ply rely on their neural components to make predictions.

4.1.1 Query candidate discovery


Fig. 4 Overview of a neural text-to-SQL system, based on the proposed
taxonomy We first walk through the techniques used for discovering
query candidates.
Single tokens A simple approach for finding query candi-
More formally, schema linking is the process of discover- dates is to consider all the single words of the NLQ as query
ing which parts of the NLQ refer to which database elements. candidates. This is obviously prone to errors as it is likely
The NLQ parts that could possibly refer to a database element that a query candidate spans over multiple tokens (e.g. “New
are called query candidates, while the database elements York”, “Iggy Pop and the Stooges”).
that could occur in the NLQ are called database candi- Multi-word candidates To find all possible query can-
dates. Query candidates can be words or phrases, while didates, even multi-word ones, it is necessary to consider
database candidates can be tables, columns, and values in n-grams of varying length. For example, IRNet [33] uses all
the database. A connection between a query candidate and a n-grams of length from 1 to 6 in the user question as query
database candidate is called a schema link, which can be fur- candidates. It processes them in descending order of length
ther categorised as a table link or column link, when the query and if a n-gram is marked as a schema link, the system dis-
candidate maps to a table name or column name, respectively, cards all the smaller n-grams that are contained in it, to avoid
and value link, when it matches a value of a column. generating duplicate links. Furthermore, IRNet [33] assumes
Schema linking is very challenging for a variety of rea- that any phrase (n-gram) appearing inside quotes must be a
sons. Query and database candidates may not use the same reference to a value stored inside the database. Note that in
vocabulary nor appear in the exact same phrasing. For exam- this case, the system not only discovers a query candidate,
ple, the phrase “sang by” in the NLQ might refer to the but also asserts that the database candidate that will be linked
database column “singer” (same word stem, phrased differ- to it must be a value.
ently) or “artist” (vocabulary mismatch). This problem is Named entities ValueNet [10] adds an extra step for intel-
even more challenging when the NLQ expresses a condition ligent candidate discovery, by performing Named Entity
(i.e. a reference to a DB value) in a different way than how Recognition (NER) on the user’s NLQ to discover possible
the value is stored in the DB. This is an issue because in query candidates. This technique is very effective in discov-
contrast to the table and column names of the DB, the sheer ering candidates that refer to a widely known entity such as a
volume of data stored in a DB prohibits using all DB val- place or a person but might not generalise to entities that are
ues as inputs to the system, making it very challenging for specific to a certain domain. ValueNet asserts that candidates
the system to build the correct SQL condition. For exam- discovered through NER refer to a DB value, i.e. the DB can-
ple, the word “female” might imply a condition such as didate they will be matched to, must be a value. TypeSQL

123
A survey on deep learning approaches for text-to-SQL 913

[103] uses the Freebase4 Knowledge Graph to perform NER. referring to values. In order to discover the DB column or
It searches for five types of entities, namely: Person, Place, table that could contain a value such as the discovered query
Country, Organization and Sport. However, the query candi- candidate, the system searches each candidate in the knowl-
dates that are found to be Named Entities are not matched to edge graph and only keeps two types of results: is-type-of
a DB candidate, but simply marked with the entity type that and related-terms. For example, when searching for “New
describes them. York” in ConceptNet, one of the returned results is is-type-of
Additional candidates As mentioned earlier, creating cor- “state”. This result helps IRNet link “New York” to a column
rect conditions can be even more challenging when the value named “state” or similarly. Note that this approach stands
is not expressed in the NLQ exactly as it is stored in the DB. out from what has been discussed so far, in the way that
ValueNet [10] proposes an improved pipeline for generat- a value link is discovered using an intermediate candidate
ing additional candidates for value links that consists of: (a) (knowledge graph result) and the column names.
identifying possible query candidates using NER, (b) gener-
ating additional candidates by looking up similar values in
the database and by using string manipulation, and (c) vali-
dating all the generated candidates by confirming they appear 4.1.3 Candidate matching
in the database. The validated candidates are then given to
the system, to aid it in generating correct conditions. Let us Having discovered the query and database candidates, an
consider the following example, where the NLQ contains efficient method is needed for comparing them to identify
the phrase “New York”, but the DB contains the value “NY”. possible links. As discussed earlier, candidates are not always
ValueNet would recognise “New York” as a named entity, it expressed in the same way in both sides, so identifying links is
would generate additional similar candidates (e.g. “N. York”, not straightforward. Techniques that can recognise semantic
“N.Y.” and “NY”) and it would look them up in the DB. similarities between candidates are required.
Doing so, it would discover that only “NY” appears in the Exact and partial matching The simplest approach is to
DB, and would only add this value in the input to help the look for exact and partial matches, as it is done by IRNet
system create a correct condition (e.g. “state=NY”). [33]. An exact match requires that the candidates are iden-
tical, while a partial match occurs when one candidate is
4.1.2 Database candidate discovery a substring of the other. Admittedly, this approach is bare-
bone and while it can discover more obvious links, it can
Table and column names The first and most obvious source also result in false positive matches when candidates share
for database candidates are the names of the tables and the same words (e.g. “residence” would be considered a par-
columns of the database. Given that most databases contain tial match with “former residence”).
a relatively small number of tables and columns, all of them Fuzzy/approximate string matching Another useful tech-
can be database candidates. nique for identifying matches when the link in the candidates
Values via lookup Values stored in the database comprise are written differently is approximate string matching. An
another large pool for database candidates. However, due to example of such an approach is the Damerau–Levenshtein
the volume of data, iterating over all the DB values is not distance [16], used by ValueNet [10]. While such tech-
performance-wise. Indexes have been widely used in ear- niques aid at identifying matches with different spelling or
lier text-to-SQL systems, which do not rely on deep learning spelling mistakes (e.g. “color”-“colour”), they cannot han-
[36,53], to accelerate the search. ValueNet [10] also uses dle synonyms and thus are not robust to the use of different
indexes and computationally cheap methods for retrieving vocabulary.
values from the DB. It is necessary to note that a database Learned embeddings To calculate the similarity between
lookup requires the use of an already discovered query can- words of the NLQ and schema entities, an earlier work in
didate. In order to avoid greedily looking up all the query the area of semantic parsing [49] proposes the use of learned
candidates, the system might only look up certain query can- word embeddings. The system learns word embeddings using
didates that seem more likely to refer to a value (e.g. because the words of the text-to-SQL training corpus and combines
they are found inside quotes or based on heuristics). them with additional features that are calculated using NER,
Values via knowledge graphs IRNet [33] assumes that edit distance and indicators for exact token and lemma match.
access to the database contents is not possible and employs These embeddings are then used to calculate the similarity
the knowledge graph ConceptNet [82] for recognising value of query candidates to DB candidates. While this approach
links. As a first step, IRNet considers that all n-grams begin- is more expensive than previous matching techniques, it
ning and ending with single quotes are query candidates allows for much more flexible and intelligent matching. This
approach was also adopted by text-to-SQL systems [8,9] as
4 https://developers.google.com/freebase. well.

123
914 G. Katsogiannis-Meimarakis, G. Koutrika

Classifiers Given the complexity of schema linking, it may on an attention mechanism, has been instrumental to the
be possible to achieve better results by training a model to widespread use of PLMs that have become the go-to solu-
perform schema linking. tion for input encoding, greatly benefiting the accuracy of
A Conditional Random Field (CRF) model [50] can be text-to-SQL systems. Finally, RAT-SQL [89] proposed a
trained on a small group of hand-labelled samples to recog- modified Transformer layer, called Relation-Aware Trans-
nise column links, table links and value links for numerical former (RAT), that biases the attention mechanism of the
and textual values [11]. The predictions of this model can Transformer towards already-known relations from the DB
then be passed to the main neural network of the text-to- schema and discovered schema links.
SQL system along with the rest of the inputs. DBTagger [86]
uses a similar approach to solve the schema linking problem 4.2 Natural language representation
as a sequence tagging problem. It employs CRFs on every
token of the NLQ to identify: (a) its Part of Speech (POS), (b) An essential step for text-to-SQL systems is creating and pro-
schema link type (e.g. table link, value link, etc.), and (c) the cessing numerical representations of their NL inputs. Until
specific schema element that it refers to. The authors argue recently, the most popular technique for NL representation
that learning these three tasks in a multi-learning paradigm has been pre-trained word embeddings. Recent advances in
helps the system achieve better performance than it would NLP, such as the introduction of the Transformer architecture
if it only learned to identify the schema element each token [87] followed by its use to create large Pre-trained Language
refers to. Models (PLMs), has tipped the scales greatly to its favour.
The SDSQL [38] system is simultaneously trained on Additionally, as new PLMs are emerging, a new research
two tasks: (a) the text-to-SQL task, similarly to all sys- path is being paved focusing on the design of better PLMs or
tems, and (b) the Schema Dependency Learning task. For PLMs created specifically for certain problems (such as the
this additional learning task, the system is essentially trained text-to-SQL problem).
to discover schema links in the form of dependencies between
the words of the NLQ and the parts of the SQL query. 4.2.1 Word embeddings
Namely, the possible dependencies are: select-column (S-
Col), select-aggregation (S-Agg), where-column (W-Col), Word embeddings aim at mapping each word to a unique
where-operator (W-Op) and where-value (W-Val). For exam- numerical vector. While there are simplistic approaches
ple, a select-column (S-Col) label is assigned to the depen- for creating such vectors (e.g. one-hot embeddings), more
dency between the column appearing in the SELECT clause advanced algorithms [65,68] aim at making the value of
and the word of the NLQ that refers to it. A deep biaffine each vector meaningful. These vectors are usually trained
network [23,25] is trained along the rest of the system to from a large text corpus (e.g. Wikipedia or Twitter) using a
detect the existence and type of these dependencies. Train- self-supervised algorithm that is mainly based on word co-
ing data for this task is created from the already available NL occurrences. The set of pre-trained vectors can then be used
and SQL pairs, by assigning dependency labels between the to build a model that benefits from the inherent knowledge
NLQ tokens and table columns. Although the schema links that is present in the vectors due to their training.
discovered by the system are not directly used for predict- For example, the GloVe [68] embeddings, which capture
ing the SQL query, training for both tasks simultaneously interesting word relationships, were frequently used by the
has a positive effect on the system performance. This task first text-to-SQL systems. Such word relationships include
goes beyond the schema linking task, as some of the afore- words with similar meaning being near neighbours and lin-
mentioned dependencies include query candidates that might ear substructures that indicate similar relationships between
refer to query parts (e.g. aggregation functions and condition words (e.g. the distances between the word pairs Paris-France
operations). It should also be noted that this approach has and Athens-Greece will be similar because these words share
been applied to WikiSQL, but it has not yet been extended a capital-country relation). A pre-trained set of GloVe embed-
to the more challenging Spider dataset. dings can be used to create numerical representations for NL
Neural attention While attention layers do not directly inputs of a model, which can then be encoded using a RNN
determine a match, we mention them briefly because of their (such as a LSTM).
capability to highlight connections between query and DB
candidates, which can improve the system’s internal repre- 4.2.2 Pre-trained language models
sentation and boost its performance. SQLNet [96] was the
first system to introduce such a mechanism, named Col- The introduction of the Transformer architecture [87] and its
umn Attention, that processes the NLQ and column names use in PLMs such as BERT [19] has led to a great performance
and finds relevant columns for each word of the NLQ. boost in many NLP problems. The text-to-SQL problem is no
The Transformer [87] neural architecture, which is based exception, as the use of PLMs has quickly become the go-to

123
A survey on deep learning approaches for text-to-SQL 915

solution for NL representation. In order to understand how a tables, and (b) links and additional values that have been
PLM can be used in a text-to-SQL system, it is first necessary discovered during the schema linking process.
to highlight the difference between two main categories of The use of neural networks mandates the transformation
PLMs: (a) encoder-only and (b) encoder–decoder models. of all inputs into a form that can be accepted by the network.
Encoder-only models, like BERT [19], RoBERTa [59], This can be very restrictive, given how heterogeneous these
and TaBERT [101], take a sequential input and produce a types of inputs are and how difficult it is to represent them all
contextualised numerical representation for each input token. in a single type of input. In this section, we examine the most
The term “contextualized” marks a notable difference to word representative choices for input encoding, while also tak-
embedding techniques, which map each word to a fixed vec- ing into account the additional features that each choice can
tor, while the representations given by PLMs are computed incorporate. We distinguish four encoding schemes: (a) sep-
taking all tokens of the input into account. This representa- arate NLQ and column encodings (b) input serialisation (c)
tion can then be used by additional neural layers to make encoding NLQ with each column separately, and (d) schema
a prediction for the downstream task at hand. While GloVe graph encoding. A schematic overview of the possible encod-
representations can be seen as improved word embeddings ing choices can be seen in Fig. 5.
and can be used in similar fashion (e.g. using an LSTM), this
is not necessary. In fact, due to the robustness of PLMs, it is 4.3.1 Separate NLQ and column encodings
possible to process their outputs using very simple and small
neural networks and still achieve better results than complex A first approach, used mostly by earlier systems (e.g.
networks using word embeddings. Seq2SQL [112], SQLNet [96]), is to encode the NLQ sepa-
Encoder–decoder models, like T5 [74] and BART [52], rately from the table columns. The main reason for encoding
are full end-to-end models that take a sequential text input the two inputs separately is the shape mismatch between
and return a sequential text output (seq-to-seq). These models them; while the NLQ is a simple sentence (i.e. a sequence
produce the final output on their own, without the need for of words), the table header is a list of column names, where
any extra neural layers, and can be used on any downstream each name can contain multiple words, i.e. it is a sequence
task as long as the expected output can be modelled as a text of sequences of words.
sequence. In Seq2SQL [112], SQLNet [96] and IncSQL [79], each
Furthermore, as such models are gaining more atten- word (embedding) of the NLQ is fed into a bi-directional
tion, the creation of task-specific PLMs is becoming a new LSTM (bi-LSTM) that produces a hidden state representa-
research area of its own. Such models can be customised to tion for each word. For column headers, since each column
work with different types of inputs and perform better on name can have multiple words, a bi-LSTM is used for each
less generic tasks, such as the text-to-SQL task. There are column name, and the final hidden state of each column is
multiple PLMs, such as GraPPa [104] and TaBERT [101], used as the initial representation for the column. Notice that
that have been designed to work with structured and tabular by keeping only the last state of each column name, the rep-
data as well as to better generalise in tasks that use SQL, and resentation of the header becomes a simple sequence and not
they can improve the performance of a text-to-SQL system a nested sequence. Since the two inputs are encoded sep-
when used in place of a generic PLM. It must also be noted arately, they must be combined at some point so that the
that while most text-to-SQL systems are originally proposed output is influenced by both of them. This can be done by
with BERT [19] or another general-purpose PLM, they often using cross-serial dot-product attention [61], concatenating
manage to achieve higher scores by replacing it with a PLM, the two representations, summing them or using a combina-
such as TaBERT [101], that was specifically pre-trained for tion of the above.
a task that uses structured data, like the text-to-SQL task. None of the studied systems that follow this encoding
approach use any extra features besides the NLQ and DB
4.3 Input encoding columns. This may be attributed to the fact that these are
some of the earliest neural text-to-SQL proposals, which did
The dimension of input encoding examines how the input is not perform schema linking and focused on the simpler Wik-
structured and fed to the neural encoder of the system, so iSQL dataset.
that it can be processed effectively. There are different inputs
that are useful for translating a NLQ to SQL. The NLQ and 4.3.2 Input serialisation
the names of the DB columns and tables could be consid-
ered the minimum required input. Other features that could A different approach is to serialise all the inputs into a single
improve the network performance include: (a) the relation- sequence and encode it all at once. This is a very common
ships present in the DB schema, including primary-to-foreign practice when using PLMs (e.g. BERT [19], T5 [74]) that cre-
key relationships and relationships between columns and ate a contextualised representation of their input, because if

123
916 G. Katsogiannis-Meimarakis, G. Koutrika

Fig. 5 An overview of the possible encoding choices. Pink tokens represent words of the NLQ, blue tokens represent database elements and grey
tokens are auxiliary tokens (color figure online)

each input were to be encoded separately, the system would well. In order to encode discovered schema links along with
not benefit from the PLM’s contextualisation ability. This the rest of the input, IRNet uses three extra tokens, namely
approach simplifies the encoding process and benefits from [Column], [Table], [Value], that can be appended before a
the robustness of PLMs. However, it also carries disadvan- NLQ token or phrase, to mark that it was linked to a database
tages, such as losing schema structure information and being candidate. Still, using this serialisation format, there is a lot
unable to easily represent relationships between the inputs of schema information not captured. For example, it is not
(e.g. primary-foreign key relationships, schema links, etc.). possible to extract any primary-foreign key relationships, or
As we go through some different serialisation approaches, to which table each column belongs.
we will also examine how much information can be retained Finally, BRIDGE [57] constructs an input for a PLM that
in each case. starts with a [CLS] token, followed by the NLQ and a [SEP]
It should also be noted that PLMs usually employ a few token, as well as the tables and column of the DB, where
special tokens that are added to the serialised sequence. a [T] and [C] token is added before each table and column
For example, BERT [19] uses the classification [CLS] and name, respectively, so as to better preserve each attribute’s
the separating [SEP] special tokens. The [CLS] token is role. The difference between IRNet’s and BRIDGE’s use of
added at the start of the sequence. Its contextualised out- the special [C]/[Column] and [T]/[Table] tokens is that the
put, which gathers information from all the tokens in the former uses them in the NLQ part to indicate a schema link
sequence thanks to the underlying attention networks, can to a column or table, while the latter uses them to indicate
be used to make classification predictions that concern the that the tokens after a [C] or [T] token are a column or table
entire sequence. The [SEP] special token can be used to sep- name, respectively. BRIDGE also uses an extra third token
arate different sentences in the same sequence. These tokens [V] along with a value, after a column name, to mark that this
are also useful for the text-to-SQL problem. value appears under the column at hand and was discovered
The simplest serialisation technique, used by several sys- as a possible value link to some NLQ candidate. In this case,
tems [35,39,63] that work on the WikiSQL dataset, creates a BRIDGE uses the [V] token in the DB schema part of the
single input sequence that only contains the NLQ and all the input while also appending a value after it, while IRNet uses
table headers. The serialised sequence starts with the [CLS] the [Value] token in the NLQ part, without providing the
token, as is common for BERT, then the NLQ tokens are actual value. Additionally, all the columns belonging to a
appended, followed by a [SEP] token marking the end of the certain table are added right after the table’s name in the
NLQ and then each column name is added followed by a sequence so as to better preserve the schema structure in
[SEP] token. This input is processed by BERT, which cre- this serialised representation. Nevertheless, all relationships
ates a contextualised representation that has the same length between attributes (e.g. primary/foreign keys) are still lost
as the input, and that can be processed by the rest of the when following this representation.
network to make predictions. Since these systems only work
with single tables, there is not a lot of information that needs
to be preserved, but it could be argued that this approach sep- 4.3.3 Encoding NLQ with each column separately
arates the column names much less strictly compared to the
separate encoding approach. HydraNet [62] employs a unique approach: it processes the
IRNet [33] (when using BERT) creates an input that starts NLQ with each column separately and makes predictions
with a [CLS] token, then continues with the NLQ’s tokens for each column independently. For each table column, a
followed by a [SEP] token, the name of each column of the different input is constructed by concatenating the NLQ with
database followed by a [SEP] token, and finally the table the column name and type and the table name. Using this
names of the schema, each separated with a [SEP] token as input, the system predicts the probability of the column at
hand appearing in the SELECT clause, the probability of the

123
A survey on deep learning approaches for text-to-SQL 917

column appearing in the WHERE clause, the operation that


will be used if this column appears in the WHERE clause,
and so on. It could be argued that this approach does not
allow the system to have a complete view of the problem
instance, because the neural network makes predictions for
each column separately, without being aware of the rest of the
table columns. Nevertheless, HydraNet achieves exceptional
performance on the WikiSQL benchmark.
This approach does not utilise any additional features (e.g.
schema links). However, given that it also serialises its inputs
(albeit, only keeping a single column each time), it could
draw inspiration from the serialisation techniques described
in Sect. 4.3.2 to encode information about schema links. For
example, it could append values similarly to BRIDGE [57],
or use [Table] and [Column] tokens to explicitly mark column
and table names in the input NLQ.
It should be noted, however, that generalising this approach
to a complete relational DB would not be an easy task. First of
all, a DB usually has multiple tables, each containing mul-
tiple columns, which means that the network would have
to make predictions for a much larger number of columns,
greatly increasing time complexity for predicting a single
SQL query. Furthermore, queries posed on complete DBs
often contain JOIN clauses and other operations that depend
on more than one entity; as such, processing each column sep-
arately becomes very counter-intuitive. Finally, this approach
is based on a sketch-based decoder (more in Sect. 4.4), which
is hard to extend for complete DBs.

4.3.4 Schema graph encoding

A graph is the most effective way for representing the DB


elements and their relationships. Representing and encoding
the input using a graph is used only by a handful of systems
[8,9,89]. Each node in the graph represents a database table
or a column, while their relationships can be represented by
edges that connect the respective nodes. It is also possible to
add the NLQ words as nodes in the graph, and add edges that
connect the query candidates with their equivalent database
candidates for representing all the discovered schema links.
Additionally, the used graph representation may allow for
different classes of nodes and edges leading to even higher
expressivity. There can be different classes of nodes to distin-
guish between tables, columns and NLQ words and different
classes of edges to distinguish between edges that represent
foreign-primary key relations, edges that indicate a column
belonging to a table and edges that represent schema links.
Even though representing the system input as a graph
Fig. 6 System categorisation on the taxonomy dimensions of natural allows for minimal loss of information and can include many
language representation, input encoding, output decoding, neural train- types of additional inputs, processing a graph with a neural
ing and output refinement
network is far more difficult than processing a sequence. This
is the main reason why graphs have yet to see widespread
use in the text-to-SQL problem. However, recent advances

123
918 G. Katsogiannis-Meimarakis, G. Koutrika

in graph neural networks and the clever use of Transformers slot or part of the query. Furthermore, it is hard to extend to
[87] proposed by RAT-SQL [89] and [78], are showing very complex SQL queries, because generating sketches for any
promising and might be a good choice for future research. type of SQL query is not trivial.

4.4 Output decoding


4.4.3 Grammar-based approaches
Text-to-SQL systems following the encoder–decoder archi-
Systems using a grammar-based decoder [13,22,33,79,89]
tecture can be divided into three categories based on how their
are an evolution of sequence-to-sequence approaches, and
decoder generates the output [13]: (a) sequence-based, (b)
produce a sequence of grammar rules instead of simple
grammar-based, and (c) sketch-based slot-filling approaches.
tokens in their output. These grammar rules are instructions
that, when applied, can create a structured query.
4.4.1 Sequence-based approaches
The most often used grammar-based decoders by text-
to-SQL systems have been previously proposed for code
This category includes systems that generate the predicted
generation as an Abstract Syntax Tree (AST) [99,100]. These
SQL, or a large part of it, as a sequence of words (comprising
models take into account the grammar of the target code lan-
SQL tokens and schema elements) [11,57,112]. This decod-
guage (in our case, the SQL grammar) and consider the target
ing technique is the simplest, and was adopted by Seq2SQL
program to be an AST, whose nodes are expanded at every
[112], which is one of the first deep-learning text-to-SQL sys-
tree level using the grammar rules, until all branches reach
tems. Later systems steered away from sequence decoding
a terminal rule. When it reaches a terminal rule, the model
because it is prone to errors.
might generate a token, for example, a table name, an operator
The main drawback of sequence decoding is that it treats
or a condition value, in the case of text-to-SQL. The decoder
the SQL query as a sequence that needs to be learnt, and
uses a LSTM-based architecture that predicts a sequence of
at prediction time, there are no measures to safeguard from
actions, where each action is the next rule to apply to the
producing syntactically incorrect queries. When generating a
program AST. Because the available predictions are based
query, it does not take into account the strict SQL grammat-
both on the given grammar and the current state of the AST,
ical rules, nor does it actively prevent generating incorrect
the possibility of generating a grammatically incorrect query
column and table names that do not exist in the DB.
is greatly reduced.
Nevertheless, sequence-based approaches are starting to
Grammar-based approaches are considered the most
be used again and are proving to be very efficient thanks to
advantageous option for generating complex SQL queries,
two advances: (a) the introduction of large pre-trained seq-
as sequence-based approaches were too prone to errors and
to-seq Transformer [87] models (e.g. T5 [74], BART [52])
sketch-based approaches are difficult to be extended to com-
and (b) the use of smarter decoding techniques that constrain
plex queries. While their status is recently being challenged
the predictions of the decoder and prevent it from producing
by the advances of sequence-based decoders discussed ear-
invalid queries (e.g. PICARD [76]).
lier, the quest for the most effective decoding technique is far
from over.
4.4.2 Sketch-based slot-filling approaches

Systems in this category [35,39,62,63,96,103] aim at simpli- 4.5 Neural training


fying the difficult task of generating a SQL query to the easier
task of predicting certain parts of the query, such as predict- Another dimension that must be examined when consider-
ing the table columns that appear in the SELECT clause. In ing a neural text-to-SQL system is the methodology that is
this way, the SQL generation task is transformed into a clas- followed to train it.
sification task. In particular, we consider a query sketch with Even though the description of a system is usually focused
a number of empty slots that must be filled in, and develop around its architecture and neural layers as well as the way
neural networks that predict the most probable elements for it encodes the inputs and decodes the output, the dimension
each slot. A basic prerequisite for such approaches is to have of neural training is important, because it is the process that
a query sketch that, when completed, will be able to capture enables the neural network to learn how to perform the task
the NLQ’s intention. at hand.
While dividing the text-to-SQL problem into small sub- Earlier systems adopted the simple paradigm of training
tasks makes it easier to generate syntactically correct queries, the network exclusively on a text-to-SQL dataset, however,
sketch-based approaches may have two drawbacks. Firstly, recent systems have proposed more sophisticated approaches
the resulting neural network architecture may end up being that can greatly benefit the network performance and its gen-
quite complex since dedicated networks may be used for each eralisation capabilities.

123
A survey on deep learning approaches for text-to-SQL 919

Fresh start The most common approach is to train the Learning consists of predicting which words or phrases
network from scratch, i.e. initialise all the weights with a ran- of the NLQ have a dependency to which columns of the
dom initialisation algorithm and train them on a downstream table and the type of the dependency that connects them.
task. However, recent developments in the domain of NLP The goal is to learn which parts of the NLQ signify that
are showing that pre-trained networks and self-supervised a specific column will appear in the SQL query and the
learning are able to achieve much better performance. role that the column will have in it (e.g. if it appears in
Transfer learning The use of transfer learning is quickly the SELECT clause, if it implies the use of the MAX
gaining ground in the NLP community, due to the introduc- aggregation function, etc.).
tion of Transformers [87], which greatly reduce training time
compared to RNNs. Transfer learning refers to when a model Pre-training specific components Another approach is to
trained on a different, usually more generic task, and a differ- train specific parts of our network so that they can better
ent dataset, is incorporated to a new model and further trained adjust to the peculiarities of the task. For example, GP [111]
on a downstream task (e.g. text-to-SQL). Language models, proposes a framework that pre-trains the system decoder,
i.e. networks that have been trained to predict missing words before training the entire system, in order to better train it on
or phrases on huge text corpora, are becoming the standard the context-free parts of the SQL grammar, e.g. SQL queries
approach for most NLP tasks, given the performance boost always start with SELECT, the FROM clause is second, and
they provide in almost all cases. so forth. For this purpose, the encoder’s semantic information
Some systems, such as HydraNet [62], rely on language is replaced by zero vectors so that the decoder is pre-trained
models almost completely, only using linear output layers without any information about the particular NLQ.
to produce predictions. Most systems however, incorporate
language models as an alternative or an enhancement for 4.6 Output refinement
word embeddings and RNNs.
Additional objectives Another interesting approach that Once trained, a neural model can be used for inference. There
follows the success of language models and self-supervised is one last dimension to consider; that of output refinement,
learning is that of using additional self-supervised tasks i.e. additional techniques that can be applied on a trained
while training for the text-to-SQL problem. Recent research model to produce even better results, or to avoid producing
[12,38,97] suggests that training neural models for more incorrect SQL queries.
generic tasks besides the downstream task of text-to-SQL that None An obvious approach is to use the trained model as
the model is designed to solve, can improve performance on is, without output refinement.
the downstream task. When using additional objectives, one The most important reason for this approach concerns time
must decide whether the model should be trained on all the and resource availability; in some applications, it might be
auxiliary objectives along the downstream task or whether it crucial to achieve low latency responses or to run on everyday
should be first trained on the auxiliary tasks and then fine- machines. For example, PICARD [76], increases inference
tuned on the downstream task. time by 0.6s when running on a machine with very high-end
GPU and arguably even more so on a personal computer. It
– Erosion The erosion task, proposed by [97], consists of must be noted however that almost all leader-board entries
randomly permuting, removing and adding columns to that achieve high results, use some refinement technique.
the input schema and training the model to produce the Execution-guided decoding This is a mechanism [91] that
correct SQL query using the eroded schema. Addition- helps prevent text-to-SQL systems from predicting SQL
ally, the system must learn to produce an unknown token queries that return execution errors. Even though sketch-
when it has to use a column that has been removed from based approaches are designed to avoid syntactical errors, the
the given schema. possibility for semantical errors is ever-present. Some exam-
– Shuffling The shuffle task, proposed by [97], randomly ples of such errors include aggregation functions mismatches
changes the order of schema entities and condition values (e.g. using AVERAGE on a string type column), condition
in the input SQL query and NLQ, training the model to type mismatches (e.g. comparing a float type column with a
correctly re-order them. string type value), and so forth. To avoid these type of errors,
– Graph pruning The graph pruning task, proposed by [12], execution-guided decoding can execute partially complete
trains the model to prune all the nodes of the input graph SQL queries at prediction time and decide to avoid a certain
representation that are irrelevant to the given NLQ. prediction if the execution fails or if it returns an empty out-
– Schema dependency learning SDSQL [38] proposes an put. Execution-guided decoding is system-agnostic and can
additional task to the text-to-SQL task, that closely be applied to most sketch-based systems (e.g. HydraNet, IE-
resonates to the schema linking problem. SDSQL is SQL), increasing their accuracy in almost all cases. Let us
designed for the WikiSQL dataset. Schema Dependency note that even though some systems presented in this work

123
920 G. Katsogiannis-Meimarakis, G. Koutrika

might not be proposed using execution-guided decoding in Linear networks Linear (or Dense) Neural Networks are
their original paper, they are subsequently shown to perform often used as output layers for sketch-based decoders or to
better in the WikiSQL leaderboard when using it. For this process an internal representation. Given that this type of neu-
reason, they are shown to use execution-guided decoding in ral layer is not suited for processing data in a sequence format,
Fig. 6 and Table 3. they are not effective at processing input such as a NLQ, or
Constrained decoding While generative models with producing output in a sequence format (e.g. in a sequence
sequence-based outputs are becoming more powerful for NL or grammar-based decoder). In sketch-based decoders, how-
generation, they are clearly prone to errors when it comes ever, where the network must predict the correct choice for a
to generating structured language like SQL. PICARD [76] certain slot, linear layers are the best suited option to perform
proposes a novel method for incrementally parsing and con- this classification task (i.e. choose the best option for filling
straining auto-regressive decoders, to prevent them from a slot out of all the available options).
producing grammatical or syntactical errors. For each token Recurrent neural networks Recurrent neural networks
prediction, PICARD examines the generated sequence so (RNNs) have long been considered the go-to solution for
far along with the k most probable next tokens and dis- NLP, only to be recently dethroned by the powerful Trans-
cards all tokens that would produce a grammatically incorrect formers. The main advantage of RNNs is their ability to (a)
SQL query, use an attribute that is not present in the DB effectively process series inputs, such as a NLQ, which is a
at hand, or use a table column without having its table in series of words, and (b) to generate a series output, such as
the query scope (i.e. not having the appropriate table in the condition value of a WHERE clause, or a series of gram-
the FROM clause). Using PICARD, a seq-to-seq pre-trained mar rules that can generate a SQL query. Well-known RNN
transformer model (T5-3B [74]) has managed to reach the architectures include the LSTM (Long Short-Term Memory)
top of the SPIDER leader-board, lifting the barriers of using and the GRU (Gated Recurrent Unit). The LSTM is popular
sequence-based decoders for text-to-SQL. It should be noted for NLP tasks and most often used in text-to-SQL systems.
that while PICARD could be considered as the most sophis- Early systems, such as Seq2SQL [112] and SQLNet [96],
ticated constrained decoding technique, other systems with relied on LSTMs for input encoding (along with pre-trained
sequence-based decoders have proposed similar decoding word embeddings), but this type of use is now outperformed
techniques to avoid errors. Some examples of such systems by pre-trained Language Models. Even though the recent
are SeaD [97] and BRIDGE [57]. success of Transformers and Language Models has greatly
Discriminative re-ranking The Global-GNN reduced the use of RNNs in the input encoding phase, RNNs
parser [9] proposes an additional network that re-ranks the are still being used to assist LMs in input encoding and to
top-k predictions of the main text-to-SQL network and is generate non-NL series outputs. For example, IRNet [33]
trained separately from it. The discriminative re-ranker net- uses BERT to encode the input NLQ and schema but also
work takes into account the words of the NLQ and the employs LSTMs to create single-token representations for
database elements used by each of the k highest-confidence columns and tables with more than one word in their name
SQL predictions, by the text-to-SQL network, and re-ranks (and more than one token to represent them).
them based on how relevant it believes they are. Its authors RNNs are also often used for generating a series out-
argue that while the text-to-SQL network usually predicts put. For example, Seq2SQL [112] and SQLNet [96] employ
the correct structure for the target SQL query, it might not pointer networks [88] comprised of LSTM layers that gen-
always predict the correct columns, tables and aggregation erate the entire WHERE clause or the condition value of the
functions, because each of them is predicted only knowing WHERE clause, respectively. Another case of RNNs for out-
already predicted elements and not future predictions. On the put generation is seen in systems (e.g. IRNet [33], RAT-SQL
other hand, the re-ranker can look at the completed predic- [89]) that employ a grammar-based decoder that generates
tions and judge the use of each database element in hindsight, an SQL query as an abstract syntax tree, leveraging work in
thus improving the prediction quality. semantic parsing [99] that uses LSTMs.
Transformers In text-to-SQL systems, Transformers are
commonly used in Transformer-based Pre-trained Language
Models for input encoding, to create a contextualised rep-
5 Neural architecture resentation of the input text. Pre-trained Language Models
offer more robust representations and greatly improve the
Neural architecture refers to the building blocks used to create model performance almost all of the times, making them
all neural parts of the system. This section examines the types more preferable than pre-trained word embeddings. To use
of neural layers used by text-to-SQL systems and analyses them for input encoding, one can simply replace the input
the roles and functions that each one of them is often used encoder (e.g. word embeddings and LSTM) with a model
for. like BERT.

123
A survey on deep learning approaches for text-to-SQL 921

There have been also other, rarer uses of Transformers area. This section provides insights and explanations on
in text-to-SQL systems. For example, HydraNet is a system these systems while also grouping them based on impor-
completely reliant on a pre-trained language model. In this tant milestones of this research area. Figure 7 presents a
case, the text-to-SQL problem is formulated so that it matches chronological view on deep learning text-to-SQL systems,
the pre-training logic of a language model and only very along with important datasets and language representation
simple linear networks are used to make predictions using advancements that have had a great impact on the domain.
the contextualised representations created by the Language While certain systems could obviously fit in multiple sec-
Model. tions, this specific categorisation is based on the novelty
Another unique example is RAT-SQL [89], which uses introduced by each system at the time of its publishing, its
specifically modified Relation Aware Transformers (RAT) influence on later systems, as well as the possible importance
to encode its input. What is special about RAT is that they of each novelty given its capability to address future and open
also accept pre-defined relations about the elements of input research problems.
series, which essentially allows to bias the encoder towards
already known relations in the database schema and the user 6.1 The dawn of an era
question. A similar approach is used by [78] in order to extend
the Transformer architecture to support relations between As mentioned before, the era of deep learning text-to-SQL
elements of the inputs, in the form of a GNN Sub-layer. This systems essentially starts with the release of the first large
extension of the Transformer allows to encode the input as annotated text-to-SQL dataset. WikiSQL was released along
a graph, where the edges can have different layers, similarly with Seq2SQL [112], which was one of the first neural net-
to RAT-SQL; however, its performance is much lower on the works for the text-to-SQL task and was based on previous
Spider benchmark. work focusing on generating logical forms using neural net-
Conditional random fields (CRFs) CRFs [50] are a type of works [21]. The system predicts the aggregation function and
discriminative machine learning model that excels at mod- the column for the SELECT clause as classification tasks
elling relations and dependencies. Because of this capability, and generates the WHERE clause using a seq-to-seq pointer
CRFs are often used in NLP for labelling tasks such as Part- network. The latter part of the system is burdened with gen-
of-Speech (POS) tagging and Named Entity Recognition erating parts of the query that can lead to syntactic errors,
(NER). Even though CRFs are rarely used in text-to-SQL, which is its major drawback.
there is a notable mention of a system integrating them in its A big difference from almost all other systems is that
neural architecture for a specific sub-task. Namely, IE-SQL Seq2SQL is partly trained using reinforcement learning.
[63] employs CRFs tasked with two schema-linking tasks While the aggregation function and SELECT column predic-
of recognising: (a) which words in the NLQ are slot men- tors are trained using cross entropy loss, the WHERE clause
tions to SQL elements, such as the SELECT column and the predictor is trained using a reward function that returns a pos-
WHERE columns, and (b) finding slot relations, i.e. group- itive reward if the produced query returns the same results
ing each of the WHERE column mentions with the mentions as the ground truth query and a negative reward if the query
of operations and values that correspond to them. Both tasks returns different results or if it cannot be executed due to
are modelled as labelling tasks, which is why CRFs are a errors. The reasoning behind using reinforcement learning,
good choice. even though it generally performs worse than supervised
Convolutional neural networks (CNNs) Convolutional learning, is that the WHERE clause can be expressed in mul-
networks are very rarely used for the text-to-SQL task, since tiple ways and still be correct.
they are best suited for processing visual data. One exam- To address these problems, i.e. that sequence decoders can
ple of a system using CNNs is RYANSQL [13], which uses produce errors and that reinforcement learning is not ideal,
CNNs with Dense Connections [102], in order to encode the SQLNet [96] proposed using a query sketch with fixed slots
inputs. However, the authors of RYANSQL demonstrate that that, when filled, form a SQL query. This sketch can be seen
replacing this CNN-based encoder with a PLM can greatly in Fig. 8, and it covers all the queries present in the WikiSQL
improve the model’s performance, making the choice to steer dataset. Using a sketch allowed the problem to be formu-
away from CNNs all the more obvious. lated almost entirely as a classification problem, since the
network has to predict: (a) the aggregation function between
a fixed number of choices, (b) the SELECT column among
6 Systems a number of columns present in the table, (c) the number of
conditions (between 0 and 4 in the WikiSQL dataset), (d)
Having established a taxonomy for deep learning text-to- the columns present in the WHERE clause (as multi-label
SQL systems, let us now zoom in on key systems that have classification, since they can be more than one), (e) the oper-
introduced novel and interesting ideas and have shaped the ation of each condition among a fixed number of operations

123
922 G. Katsogiannis-Meimarakis, G. Koutrika

Fig. 7 A timeline of deep learning text-to-sql systems, datasets and language representation techniques

is only used on the WikiSQL dataset and is not extended to


more complex SQL queries, which is not trivial work. In fact,
because Coarse2Fine is designed for the WikiSQL dataset,
Fig. 8 Query Sketch proposed by SQLNet the sketches it generates only differ between them in the
number of conditions that appear in the WHERE clause and
the operations in each condition. As such, while the idea it
(≤, =, ≥) and (f ) the value of each condition. Predicting the proposes might be very interesting, in practice, it essentially
value is achieved using a sequence generator network, which achieves generating SQL queries of no greater complexity
in this case is only responsible for the value and not for the than what simple sketch-based systems do.
SQL syntax or grammar, so syntactic mistakes are avoided. RYANSQL [13] is another system that generates the appro-
Another improvement introduced in SQLNet is the intro- priate sketch before filling it, but in contrast to the previous, it
duction of a column attention neural architecture to the manages to produce much more complex SQL queries such
network. Given that SQLNet encodes the NLQ and table as the ones present in the Spider dataset. This is achieved
columns separately, the encoded representation of the NLQ by breaking down each SQL query into a non-nested form
does not have any information on the available columns and that consists of multiple, simpler, sub-queries. The authors
thus cannot inform the system on which words in the NLQ propose 7 types of sub-queries, each with its own sketch,
are important for generating the correct SQL query. Column that can be combined to produce more complex queries. The
attention is an attention mechanism that infuses the NLQ rep- network then learns to recursively predict the type of each
resentation with information about the table columns, so as to sub-query and to subsequently fill in its sketch. RYANSQL
emphasise the words that might be more related to the table. achieved the first position in the Spider benchmark at the
Other than that, both systems are similar to each other, using time of its publication, but has since been surpassed by other
GloVe [68] embeddings for text representation and LSTM systems, while no other similar approach has been able to
networks for encoding them. achieve comparable performance.
SyntaxSQLNet [105] follows a similar approach, but
6.2 Sketch generation instead of generating the query sketch, it follows a pre-
defined SQL grammar that determines which of its 9 slot-
While the use of a sketch greatly simplifies the text-to-SQL filling modules needs to be called to make a prediction. This
problem and makes predictions simpler for neural networks, allow the system to produce grammatically correct complex
the complexity of SQL queries the system can generate using queries while enjoying the benefits of a sketch-based decoder.
a single sketch is restricted. Systems such as Coarse2Fine At each prediction step, the grammar and the prediction his-
[22] and RYANSQL [13] have tried to generalise sketch- tory from the previous steps are used to determine the module
based decoding, by attempting to not only fill in the slots of (e.g. COLUMN module, AGGREGATOR module, OPERA-
a sketch but also generate the appropriate sketch for a given TOR module, HAVING module, etc.) that needs to make a
NLQ. prediction in order to build the SQL query. Although this is
Coarse2Fine [22] is a semantic parser that can generate a hybrid approach, the architecture of the decoder modules
various types of programs, one of which is SQL. Its main classifies SyntaxSQLNet as a sketch-based decoding system.
highlight is that it decomposes the decoding process into The main difference is that most sketch-based decoders call
two steps: first, it generates a rough (coarse) sketch of the all their slot-filling modules simultaneously to fill the sketch,
target program without low-level details, and then it fills this whereas SyntaxSQLNet calls specific modules recursively
sketch with the missing (fine) details. Its authors argue that because the grammar defines what needs to be filled in at
a great advantage of this approach is that the network can each prediction step. SyntaxSQLNet was one of the first sys-
disentangle high-level from low-level knowledge and learn tems proposed for Spider. Since then, many systems have
each one of them more effectively. Unfortunately, this system achieved better performance scores while steering away from

123
A survey on deep learning approaches for text-to-SQL 923

this methodology, hinting at its weaknesses. For example, relation aware self-attention on its inputs, which essentially
one of the main challenges is to effectively pass all the infor- biases the network towards the given relations (edges). This
mation of the prediction history and the current state of the allows the system to use Transformers and even pre-trained
generated SQL to each module, at every prediction step. language models to process the graph as a series while also
utilising the information present in the graph edges. Finally,
6.3 Graph representations it generates a SQL query as an AST using the method men-
tioned above [99].
The use of graphs for input encoding has only recently seen All systems discussed in this section have grammar-based
increased use, despite its powerful capability to represent decoders. This happens mainly because they aim to produce
the DB schema. This section explores key systems that have complex queries such as the ones in the Spider dataset, and at
shown new perspectives on how graphs can be represented the time of their publication, grammar-based decoders were
and used in the text-to-SQL task. the most common option. It would be possible for a system
A natural option for processing graphs are Graph Neu- using a graph representation of the input to use a different
ral Networks (GNNs). However, while being a good option decoder with its own advantages and drawbacks.
for tasks such as node classification, node clustering and
edge prediction, they are not as suitable for generative tasks 6.4 Using intermediate languages
like the text-to-SQL problem. Two systems manage to lever-
age GNNs to encode the database schema and its elements: Following the success of the grammar-based methods in
the GNN parser [8] and its successor Global-GNN parser generating complex SQL queries over multi-table DBs,
[9]. To achieve this, the database schema is represented researchers also examined the use of languages during the
as a graph, where tables and columns are represented as decoding phase that can better align with NL than SQL mak-
nodes, and different types of edges represent the relation- ing it easier for the system to make predictions, but at the
ships between them (e.g. which columns appear in which same time they can be deterministically translated into SQL.
table and which columns and tables are connected with a We examine key systems that use an Intermediate Language,
primary-foreign key relationship). For NLQ encoding, both either a pre-existing language or one created specifically for
systems use word embeddings and LSTM networks, while this task, as the target language for the neural decoder.
node encodings calculated by the GNNs are concatenated to IRNet [33] is a grammar-based system capable of gener-
each word embedding, based on the discovered schema links. ating complex SQL queries, such as the ones in the Spider
For decoding, both systems use a grammar-based decoder dataset. It uses the same AST decoding method [99] for code
[99] that generates a SQL query as an Abstract Syntax generation used in other grammar-based text-to-SQL sys-
Tree (AST), which is often used by grammar-based systems tems (e.g. RATSQL [89] and the GNN parser [8]). The main
[10,33,89]. Global-GNN [9] introduces the use of a re-ranker difference is that it predicts an AST of a SemQL program,
that, given k SQL predictions from the network, chooses the which is an Intermediate Language created specifically for
best interpretation based on the database elements used and this system. Its authors argue that it is easier to generate
the graph representation calculated. queries in this language and then transform them to SQL.
In order to avoid the disadvantages of GNNs, other efforts Furthermore, IRNet performs schema linking by consider-
modify architectures that have already shown their power in ing all n-grams of length 1 to 6 as query candidates and all
the text-to-SQL task, such as the Transformer [87], so that column and table names as DB candidates and uses exact
they can accept edge information and process a graph. RAT- and partial matches to discover links between them. It also
SQL [89] uses a graph representation of the input, but instead searches for all query candidates that appear inside quotes in
of using GNNs, it proposes a modified Transformer archi- the ConceptNet knowledge graph [82] in order to link them
tecture named Relation Aware Transformer (RAT). Firstly, it to a database column or table. Input encoding uses BERT
creates a question-contextualised schema graph, i.e. a graph followed by linear and recurrent neural networks.
representing the database tables and columns as well as the SmBoP [75] is a grammar-based system that introduces
words of the NLQ as nodes and the relationships between various novelties in the decoding phase. The use of rela-
them as edges. An edge can appear either between two tional algebra as an Intermediate Language is one of them.
database nodes, similarly to the previous systems, or between Its authors argue that, along with being better aligned with
a database node and a word node. In this graph, schema link- NL, relational algebra is a language that is already used by
ing is performed to discover connections between a database DB engines, unlike SemQL. Additionally, in order to decode
node and a word node that might refer to it. The names of ASTs of queries in relational algebra, SmBoP uses a bottom-
all the nodes in the graph are first encoded using BERT [19] up parser, in contrast to the usual approach of generating
and then processed by the RAT network, along with the edge ASTs by performing top-down depth-first traversal, followed
information of each node. The RAT neural block performs by almost all text-to-SQL systems. The bottom-up decoder

123
924 G. Katsogiannis-Meimarakis, G. Koutrika

generates at time step t, the top-k sub-trees of height ≤ t, columns. Furthermore, it uses an attention layer to create a
where k is a given parameter that represents the number of single token representation for columns that have more than
beams used during the decoding search. The main advan- one token (i.e. more than one word in their name). X-SQL
tage of the bottom-up parsing is that at any given time-step, also outperforms the much more complex SQLova, achieving
the generated sub-trees are meaningful and executable sub- slightly lower scores than HydraNet.
programs, while in the top-down parsing, intermediate states
are partial programs without a clear meaning.
6.6 Schema linking focus
6.5 The age of BERT
As discussed earlier, schema linking is a major part of creat-
Much like in other NLP problems, replacing a conventional ing a SQL query from a NLQ. This section looks into systems
encoder with a pre-trained language model such as BERT that have put extra effort on schema linking, or even based
[19] has been shown to improve performance of a text-to- their entire workflow on this process.
SQL system. TypeSQL [103] is one of the first systems to introduce a
SQLova [39] is a sketch-based approach focused on the process similar to schema linking in its workflow, and one
WikiSQL dataset. It employs a large and complex network of the few systems working on WikiSQL that uses schema
almost identical to the one used by SQLNet, with its main linking. Its methodology is described as Type Recognition,
difference being that instead of GloVe embeddings, it uses but closely resonates to the concept of schema linking. The
BERT to create a contextualised representation of the NLQ goal of this methodology is to assign a “type” to every token
and table headers. The representations are then passed to 6 of the NLQ. It considers all n-grams in the NLQ of length
networks, each responsible for a different part of the query from 2 to 6 and tries to assign them one of the following
sketch, that are very similar to the sub-networks used by “types”: (a) Column, if it matches the name of a column or
SQLNet. The result is a staggering, almost 20%, increase in a value that appears under a column, (b) Integer, Float, Date
execution accuracy on the test set of WikiSQL, indicating or Year, if it a numerical n-gram, (c) Person, Place, Country,
BERT’s power in the text-to-SQL task. Organization or Sport by performing NER using the Free-
HydraNet [62] is another sketch-based approach on the base knowledge graph. Even though this process is unilateral,
WikiSQL benchmark taking advantage of the BERT lan- as its main goal is to classify the query candidates into a type
guage model. Its main difference from SQLova is that category and not to explicitly link them to a DB candidate, it
HydraNet aligns itself better to the way that BERT has been is one of the first attempts towards schema linking.
pre-trained and only uses a simple linear network after receiv- ValueNet [10] builds on the grammar-based system IRNet
ing the contextualised representations from BERT, instead [33] focusing on schema linking and condition value dis-
of large networks with LSTMs and attention modules like covery. The main motivation of the system is that despite
SQLova. Furthermore, HydraNet processes each table header the constant improvement of text-to-SQL systems, even the
separately instead of jointly encoding them, an approach that state-of-the-art is falling behind at predicting the correct val-
is unique to this system. As a result, it can only make predic- ues in the SQL conditions. Similarly to IRNet, ValueNet
tions for each column on its own, i.e. it decides if the column decodes a SQL query in a SemQL 2.0 AST. SemQL 2.0
at hand will appear in the SELECT clause, if it will appear in extends the SemQL grammar with values. Additionally, since
the WHERE clause, what its operation will be if it appears condition values might not be written by the user in the
in the WHERE clause and so on. HydraNet, with its simpler exact same way they appear in the DB, ValueNet employs
architecture leveraging BERT, achieves better accuracy on an extended value discovery workflow of five steps:
WikiSQL than SQLova, which employs a larger and more
complex network.
X-SQL [35] is a sketch-based system using the MT- • value extraction to recognise possible value mentions in
DNN pre-trained language model [58], that was built for the the NLQ, it uses NER and heuristics;
WikiSQL benchmark. Similarly to HydraNet, it uses much • value candidate generation to create additional candidate
simpler networks than SQLova for filling the slots of the values, it uses string similarity, hand-crafted heuristics
query sketch. However, it encodes all table headers simulta- and n-grams;
neously, along with the user question. Additionally, instead • value candidate validation to reduce the number of can-
of using segment embeddings that originally indicate the didate values, it keeps only the candidates that appear in
span of different sentences in the language model’s input, the DB;
X-SQL uses type embeddings. These embeddings differenti- • value candidate encoding it appends each candidate to the
ate between the different types of elements in the input, such input along with the table and the column it was found
as the user’s question, categorical columns and numerical under, and

123
A survey on deep learning approaches for text-to-SQL 925

• neural processing the encoded representations are pro- recognises are a direct match to the slots of the query sketch.
cessed by the neural network, which eventually decides Therefore, extending it to queries beyond the sketch is not
if and where they will be used. trivial.

The authors also provide a classification of the Spider 6.7 The return of the sequence
queries based on the difficulty of discovering the values. This
is another important aspect of the text-to-SQL problem usu- Generating SQL queries using a sequence-based decoder was
ally overlooked by other works. initially avoided as it could produce syntax and grammar
SDSQL [38] is a sketch-based system designed for the errors, as discussed in Section 4.4. Grammar-based decoders
WikiSQL task. What is special about this system is that it can were instead regarded as the best choice for a system to
be viewed as two neural networks tackling two tasks at the effectively generate complex SQL queries. However, recent
same time. The first network predicts SQL queries using the works [57,76,97] have changed the landscape by introducing
same architecture used by SQLova [39], while the second net- a series of techniques that minimise the possibility of errors
work performs schema dependency predictions. The schema by sequence-based decoders. These techniques have made
dependency network uses bi-affine networks [24] to predict the use of very powerful pre-trained encoder–decoder mod-
dependencies between the words of the NLQ and the table els [52,74] a viable and high-performing option, allowing the
headers. Such dependencies include: (a) the select-column systems that use them to achieve top performance in both the
dependency that connects a query candidate that maps to Spider and WikiSQL benchmarks.
a column that will appear in the SELECT clause with the SeaD [97] is a system based on the BART [52] encoder–
corresponding column of the table, and (b) the where-value decoder pre-trained language model designed on WikiSQL.
dependency that connects the query candidate that refers to To overcome the drawbacks of its sequence-based decoder,
a value that will appear in the WHERE clause to the table SeaD employs two techniques: (a) it introduces two addi-
column it belongs to. It must be noted that even though the tional tasks on which the model is trained at the same
second network performs schema linking, its predictions are time with the text-to-SQL task, and (b) it uses execution-
not directly used by the first network to construct the SQL guided decoding [91], slightly modified to work with its
query. Instead, a combined loss from the predictions of both sequence-based decoder. Its main contribution is the use of
tasks is used to train the weights of the networks, which the two additional training tasks named erosion and shuffle
allows the schema dependency learning to improve the first (see Sect. 4.5), which are designed specifically to help the
network’s performance indirectly. model better understand the nature of the text-to-SQL prob-
IE-SQL [63] proposes a unique approach to the text-to- lem and the tables used by the WikiSQL dataset. The use
SQL problem almost completely based on schema linking. of additional training tasks is also closely aligned with how
It uses two instances of BERT [19] to perform two differ- language models are pre-trained to understand the more gen-
ent tasks: a mention extractor and a linker. The mention eral notion of natural language before being fine-tuned to a
extractor recognises which query candidates are mentions specific task. Nevertheless, while SeaD has managed to over-
of columns that will be used in the SELECT and WHERE come the limitations of sequence-based decoders and achieve
clauses of the SQL query, mentions of aggregation func- the best performance on the WikiSQL benchmark, both the
tions, condition operators and condition values. Additionally, decoding technique and the additional objectives it employs
the mention extractor recognises mentions that should be are designed with the WikiSQL dataset in mind. Extending
grouped together. For example, the mentions of the column, them to full relational databases would not be a trivial matter.
the operator and the value that belong to the same condi- BRIDGE [57] is another recent system with a sequence-
tion are grouped together. Having extracted the mentions, based decoder that works on Spider, although it does not
the linker maps the mentions of column names to the actual use an encoder–decoder language model. Instead, it uses
columns of the table they are referring to. The linker also BERT [19] and LSTM networks for input encoding and
maps value mentions without a grouped column to the appro- enriches the input representation using linear networks that
priate table column. By using the predictions of the mention use metadata such as foreign and primary key relation-
extractor and the linker, IE-SQL can predict an SQL query, ships, as well as column type information. Additionally, the
without any additional neural component. Even though this system performs schema linking using fuzzy string match-
approach may not be a clear match with any of the three ing between query candidates and the values of columns
decoding categories, we classify it as a sketch-based system that only take values from a pre-defined list (i.e. picklist
because its methodology is heavily based on the existence attributes). The discovered values are added in the input
of a query sketch similar to the one used by SQLNet [96]. sequence to help the network create better SQL queries.
IE-SQL can better learn the dependencies between the slots Finally, the sequence-based decoder used by BRIDGE is a
and uses a more robust approach. Still, the mention types it pointer generator network using schema-consistency guided

123
926 G. Katsogiannis-Meimarakis, G. Koutrika

decoding, a constraining strategy to avoid the aforemen- design choices. A summary of these insights can be seen
tioned drawbacks of sequence-based decoders. In order to use in Table 4.
schema-consistency guided decoding, BRIDGE is trained Output decoding There is a connection between the decod-
(and makes predictions) on SQL queries written in execution ing approach used by a system and the benchmark on which
order, i.e. all queries start with the FROM clause, followed it operates. Systems that operate on Spider do not use a
by the WHERE, GROUP BY, HAVING, SELECT, ORDER sketch-based decoder. This is due to the fact that sketch-
BY and LIMIT clauses, strictly in that order. This means that based approaches are more cumbersome to be adapted for
all columns that appear in the query, must appear after the generating complex SQL queries. RYANSQL [13] attempted
table that they belong to has been generated. Based on this, extending the sketch-based approach to Spider, but later
BRIDGE can limit the search space of columns and avoid systems steered away from this choice. Furthermore, while
using columns that will produce invalid SQL queries. until recently grammar-based decoders dominated the Spider
PICARD [76] is a constraining technique for auto- benchmark and sketch-based decoders dominated WikiSQL,
regressive decoders of language models, that is specifically recent improvements in sequenced-based decoders have
created to improve their performance on the text-to-SQL task. turned the tables, bringing sequence-based decoders on the
Essentially, at each prediction step, it constrains the model’s top of both benchmarks (i.e. T5-3B+PICARD [76] for Spider
set of possible predictions by removing tokens that could pro- and SeaD [97] for WikiSQL).
duce syntactically and grammatically incorrect SQL queries. The output decoder is what defines the system’s SQL
It is used at inference time, by looking at the confidence expressiveness and the effort needed to implement and extend
scores of the model’s prediction and the schema of the under- the system to new types of SQL queries. For example,
lying DB, and it operates at three levels: grammar-based decoders are harder to implement, since an
extensive grammar is required in order for the system to
• it rejects misspelled attributes and keywords, as well as cover all the possible SQL queries that the use case in
tables and columns that are invalid for the given schema, question might require. Additionally, extending a system to
• it parses the output as an AST to reject grammatical use mathematical operations (e.g. WHERE end_year -
errors, such as an incorrect order of keywords and clauses start_year < 4) will require varying degrees of effort
or an incorrect query structure, depending on the type of decoder. In the case of a sketch-
• it checks that all used tables have been brought into scope based or grammar-based decoder, an extension of the sketch
by being included in the FROM clause and that all used or grammar is necessary to cover the new query type. On the
columns belong to exactly one table that has been brought other hand, sequence-based decoders can effectively gener-
into scope. ate everything (which is usually a drawback), as long as there
are training examples to learn from.
NL representation. There is a clear tendency by the lat-
When PICARD is used with the T5 [74] pre-trained lan- est models to use PLMs for NL representation. Besides the
guage model (the 3B parameters version), it ranked first on systems that use GNNs for input encoding [8,9], the only sys-
the Spider leaderboard for execution with values. This of tems that use word embeddings for NL representation, were
course does not come without any drawbacks, such as the published before PLMs were widely available. In almost all
increased prediction time due to the constrained decoding, as cases, the use of a PLM instead of word embeddings leads to
well as the tremendous computational and memory require- a boost in performance. This is also shown in some systems
ments for training and running such a large model as T5-3B. that were originally designed to work with word embeddings,
but are also tested with a PLM during ablation studies (e.g.
RAT-SQL [89], RYANSQL [13]). In fact, with the constant
7 Discussion and higher-level comparison introduction of new PLMs, the question of which PLMs is
more suitable becomes all the more relevant. However, a
In what follows, we make several observations regard- major, typically overlooked, drawback of PLMs is their com-
ing how the landscape is shaped along the dimensions of putational cost and hardware requirements. Even though the
our taxonomy, presented in Sect. 4. Table 3 provides an cost of pre-training can be alleviated because it is very easy
overview of the design choices of each system studied in this to find a pre-trained model online, there is still the cost of
survey. Additionally, we provide some higher-level insights training for the text-to-SQL downstream task, as well as dur-
that can be useful for practitioners interested in introduc- ing inference. Running a model with a PLM will also require
ing a deep learning text-to-SQL system in a real-world use an additional amount of computational resources (usually
case. These insights include remarks concerning: adaptabil- memory and/or a GPU) due to the size of these models. For
ity to new databases, difficulty of implementation, technical example, BERT-base [19] has 110M parameters, BERT-large
demands, and other advantages and drawbacks of certain has 340M parameters, and T5 [74] has variations of similar

123
Table 3 Systems examined in this work
Year System Benchmark Schema linking Natural language Input encoding Output decoding Neural training Output refinement

2017 Seq2SQL WikiSQL × WE Separate Sequence FS ×


SQLNet WikiSQL × WE Separate Sketch FS ×
2018 IncSQL WikiSQL × WE Separate Grammar FS ×
TypeSQL WikiSQL  WE Separate Sketch FS ×
Coarse2Fine WikiSQL × WE Separate Sketch FS ×
A survey on deep learning approaches for text-to-SQL

SyntaxSQLNet Spider × WE Separate Sketch FS ×


2019 SQLova WikiSQL × E-PLM Serialise Sketch TL EG decoding
IRNet Spider  WE or E-PLM Serialise Grammar TL ×
X-SQL WikiSQL × E-PLM Serialise Sketch TL EG decoding
RAT-SQL Spider  WE or E-PLM Graph Grammar TL ×
GNN Spider  WE Graph Grammar FS ×
Global-GNN Spider  WE Graph Grammar FS Re-ranking
2020 ValueNet Spider  E-PLM Serialise Grammar TL ×
BRIDGE Spider  E-PLM Serialise Sequence TL Constr. decoding
HydraNet WikiSQL × E-PLM Per column Sketch TL EG decoding
IE-SQL WikiSQL  E-PLM Serialise Sketch TL EG decoding
RYANSQL Spider × WE or E-PLM Serialise Sketch TL ×
SmBoP Spider  E-PLM Graph Grammar TL ×
2021 SDSQL WikiSQL  E-PLM Serialise Sketch TL + AO EG decoding
SeaD WikiSQL × ED-PLM Serialise Sequence TL + AO Constr. decoding
T5-3B+PICARD Spider × ED-PLM Serialise Sequence TL Constr. decoding
In the natural language column, WE, E-PLM and ED-PLM stand for word embeddings, encoder-only PLM and encoder–decoder PLM accordingly. In the neural training column, FS, TL and AO
stand for fresh start, transfer learning and additional objective accordingly. In the output refinement column, EG decoding and constr. Decoding stand for execution-guided and constrained decoding
accordingly

123
927
928 G. Katsogiannis-Meimarakis, G. Koutrika

Table 4 Higher-level comparison of taxonomy dimensions on various practical dimensions (➚ signifies good performance, ➘ signifies poor
performance, and ➙ signifies average performance)

g
es

in
tiv

od
ng
g
n

ec

g
ec
in

di
ar
um

in
ar

.d
LM

j
rn
te

o
St
e

ob

nk
m

ec
M

lis

en
ol

tr
ra

ch

lea
ph

-ra
-P

-d
rc
PL

ria

ns
d.
qu
pa

et

es
ra

ra
E

EG
ED

Ad

Co

Re
Pe

Sk

Tr
W

Se

Se

Se
E-

Fr
G

G
Natural Input Output Neural Output
language encoding decoding training refinement
Ease of implementation
Use with full RDBs
Extend to new SQL types
Computational costs
Handling large schemas

sizes that reach up to 11B parameters, while the one pre- cate schema linking pipelines. On the query side, we observe
sented with PICARD [76] has 3B parameters. This must be that almost all systems consider single-word and multi-word
considered, especially when building applications that must tokens, while ValueNet [10] also performs NER to find pos-
support heavy workloads or have low latency requirements. sible candidates. On the DB side, using the table and column
Input encoding Regarding input encoding, there are two names is the baseline for most systems, while some sys-
main observations to point out: (a) while earlier systems tems also lookup the values that are present in the DB.
performed separate encoding, later systems use serialised or Finally, to match the candidates, some systems use simple
graph encoding, and (b) newer systems working on the Wik- text matching (either exact or partial), while newer systems
iSQL, all use serialised encoding. The clear tendency to use have experimented with the use of classifiers instead of string
serialised encoding can be easily attributed to the extensive operations to find matches. It becomes quickly apparent that
use of PLMs, which offer much better performance with a schema linking is mostly explored by systems operating on
serialised input. This is even more true in the case of Wik- the Spider dataset, accompanied by very few systems using
iSQL, because single tables can easily be serialised along the WikiSQL benchmark. This is somewhat expected, given
with the NLQ making the combination of PLMs and seri- that as the SQL complexity and the volume of tables, columns
alised encoding an easy and powerful choice. However, when and data increase, researchers seek to aid the neural network
it comes to DBs with several tables and relationships among by providing auxiliary information. However, what is very
them and their columns, a more flexible and informative rep- peculiar is that some high performing recent systems (i.e.
resentation is required. Some systems have examined the T5-3B+PICARD [76] and SeaD [97]) do not perform any
more innovative approach of graph encoding, which so far schema linking at all. This is an open research question. Can
seems promising, offering a lot of ground for future research. powerful neural architectures, pre-trained on vast amounts
Another practical limitation that must be taken into account of data, defy the need for schema linking? Or, can they
is how flexible each encoding option is when it comes to DBs achieve even higher scores if combined with schema link-
with large schemas. For example, the SDSS database, which ing? One important observation is that very little effort has
stores data from astronomical surveys, has 87 tables with been put into testing how fast and scalable these approaches
some tables containing up to a hundred columns. Serialising are, especially for very large databases. In fact, to the best
such a schema would result in a very long sequence that can- of our knowledge, only a single work [86] provides exper-
not be processed by a PLM due to their limitation in input imental evaluations concerning the time and memory used
length. Similarly, performing separate encoding might cre- for schema linking. Hence, extra caution is necessary when
ate a bottleneck in the schema encoder side. Graph encoding using these methods in a real-world system, as most of them
might be more efficient for handling larger schemas, since are not adequately optimised.
GNNs can encode each schema element as a single node. Neural training The neural training dimension is closely
However, this approach is also prone to poorer performance connected to the NL representation adopted by each system.
as the schema gets larger. This happens because using a PLM means that the model
Schema linking Table 5 displays the schema linking tech- adopts the Transfer Learning paradigm, because it further
niques used by each system studied in this survey. While trains an already pre-trained neural component on a new
the first text-to-SQL systems did not perform any kind of downstream task. There are no cases of systems perform-
schema linking, later systems have proposed various intri- ing transfer learning on other parts of the model besides the

123
A survey on deep learning approaches for text-to-SQL 929

NL representation part. This is mostly due to the fact that and compare different systems. They have simplified system
PLMs perform exceptionally well, and making an improve- evaluation, and they are often seen as the panacea for text-
ment through a different transfer learning technique would be to-SQL evaluation. Researchers tend to over-rely on these
very difficult. Furthermore, there are only two models that use benchmarks to argue that their systems are advancing the
additional objectives during training [57,97]. This relatively state of the art, and they do not spend time performing addi-
novel approach follows the success of PLMs using vari- tional experiments on other benchmarks. However, given the
ous auxiliary tasks during pre-training and seems to be very progress in system-building, new standards are necessary for
promising in training a model that achieves better generalisa- benchmarking text-to-SQL systems, in order to make these
tion. It must be noted that the time and computing resources systems applicable to real-world scenarios, and to continue
needed to train a model using each training approach are usu- pushing the state of the art.
ally not taken into account when presenting new models, in First of all, datasets such as WikiSQL, that contain single-
favour of better performance metrics. It is however necessary table databases and very simple SQL queries, cannot be seen
to address them in order to make the use of such models feasi- as realistic benchmarks for real-world applications. Given
ble in a real-world application. For example, the pre-training that the SQL queries in WikiSQL can be covered by a very
part of transfer learning is very costly, unless the pre-trained simple sketch, as the one shown in Fig. 8, and current systems
model is made available by its creators. Similarly, using addi- have reached very high accuracy scores on this dataset, there
tional objectives will greatly increase the computations that is a need for more challenging benchmarks. These bench-
must be performed, thus increasing the cost of training. marks were a good start for the neural text-to-SQL field and
Output refinement The output refinement heavily depends have allowed a lot of novel ideas to be implemented in a
on the approach used for output decoding, as well as the “sandbox” environment, but the state of the art is now able
dataset that the system operates on. A system designed for to achieve much more.
WikiSQL can use execution-guided decoding [91], no matter Similarly, Spider contains DBs and queries that were
the type of its decoder because of the simplicity of the Wik- specifically created for text-to-SQL evaluation but they are
iSQL queries. Systems with sequence-based decoders can rather simplistic and do not reflect the characteristics of real-
use constrained decoding techniques to improve their pre- world DBs. For example, the Spider DBs have a simple
dictions and reduce the possibilities of errors. In fact, this schema or too little data stored. In fact, the 166 DBs of Spider
output refinement technique is one of the main reasons why that are available to the public (i.e. train and dev set, since
they can be so effective. The re-ranking technique could be the test set is held-out by the authors) sum up to less than
used by any system that can produce more than one predic- 1GB. Ideally, new benchmarks should aspire to introduce real
tions for a single input, but in practice it has not been adopted cases of DBs taken from the industry and academia, accom-
by any other system after being proposed by Global-GNN [9]. panied with real logs of SQL queries performed on them by
Furthermore, each refinement technique adds an additional their users. The NLQ part could be obtained either by asking
burden to the system that translates to extra computational the users to specify what their intention was when running
cost and more time needed to make a prediction. When used these queries, by asking SQL experts to explain them, or by
in a real-time application, it is necessary to consider if the employing a SQL-to-text system.
performance boost gained from the refinement step, is worth Another important drawback of current benchmarks is
the extra time and resources required. their relatively small number of examples (i.e. NL-SQL
pairs), especially compared to datasets used by deep neu-
ral networks in other problems (e.g. the SQuAD Question
8 Research challenges Answering dataset contains more than 100K examples).
Besides the obvious contribution of creating a new large-
While a lot of progress has been made on the text-to-SQL scale dataset from scratch, there are a few other paths that
problem, several important issues need to be tackled. In could be considered. For example, it would be possible to
this section, we outline some of the most challenging prob- create a novel benchmark suite containing multiple previ-
lems and highlight interesting research opportunities for the ous benchmarks. This would not be a trivial work, since a
database and the machine learning communities that could lot of consideration is needed concerning how to split the
greatly impact the state of the art in text-to-SQL research and datasets in a way that the train and test sets could help devel-
beyond. opers understand if their system can successfully generalise
to unseen DBs, domains, SQL query patterns, NLQ vocab-
8.1 Benchmarks ulary, etc. Another way to create a new benchmark could
be by transforming similar benchmarks used in slightly dif-
As mentioned earlier, WikiSQL and Spider are large-scale ferent tasks, or different query languages. For example, a
query benchmarks that provide a common way to evaluate text-to-SPARQL dataset, such as the CFQ dataset [46] that

123
930 G. Katsogiannis-Meimarakis, G. Koutrika

Table 5 A comparison of schema linking techniques used by the examined systems; the schema linking process is divided in the query candidate
discovery, DB candidate discovery and candidate matching phases, as described in our taxonomy

es
m

ing
tes

na

s
ch

ing
id a
s

at
en

ing
l um

d
ku
nd

ing

em
s

d
es
tok

KG
lo o

be
tch
ca
i
s

co

ch
tit
en

at

em
d

ma
al

at
via

via
en

ers
d

im
or
tok

ion

an

tm

d
i-w

ox

sifi
d

al

ne
es

es
gle

d it
me

le

pr
rti
ac
ult

lu

lu

ar

as
b
Sin

Ad

Ap
Na

Ex
Va

Va

Pa
Ta

Le

Cl
M
Year System Benchmark Query DB Matching
Seq2SQL WikiSQL
2017
SQLNet WikiSQL
IncSQL WikiSQL
TypeSQL WikiSQL
2018
Coarse2Fine WikiSQL
SyntaxSQLNet Spider
SQLova WikiSQL
IRNet Spider
X-SQL WikiSQL
2019
RAT-SQL Spider
GNN Spider
Global-GNN Spider
ValueNet Spider
BRIDGE Spider
HydraNet WikiSQL
2020
IE-SQL WikiSQL
RYANSQL Spider
SmBoP Spider
DBTagger -
SDSQL WikiSQL
2021
SeaD WikiSQL
T5-3B+PICARD Spider

contains more than 200 thousand NLQ-to-SPARQL exam- the “getting the answer fast”. The database community could
ples or the LC-QuAD dataset [85] that contains 5 thousand come up with benchmarks that focus on efficiency (not just
examples, would be very beneficial if converted to SQL (sim- effectiveness) and allow evaluating systems based on execu-
ilarly to how the WikiTableQuestions [67] was used to create tion time and resource consumption in addition to translation
the SQUALL [80] text-to-SQL dataset). accuracy.
Another critical limitation of existing benchmarks is that
they fail to address the question of what type of NL and SQL
8.2 System efficiency and technical feasibility
queries a system can understand and build, respectively. This
is due to the lack of a clear query categorisation. For instance,
Focusing on the translation accuracy of the system is only one
Spider has four very coarse-grained classes of queries. This
side of the coin. Evaluating system efficiency is important in
highlights the need for new benchmarks and in-depth system
order to understand the viability of a solution and pinpoint the
evaluations, in the spirit of [6,30], that provide fine-grained
pain points that need to be addressed. Deep learning text-to-
query categories and allow researchers to understand the
SQL systems are typically relying on very complex models,
strengths and weaknesses of a system.
which have been trained and evaluated in toy databases (like
Furthermore, existing benchmarks assume that for each
the ones contained in existing benchmarks). Hence, it comes
NL query, there is only one correct SQL query. This may be
to no surprise that they have not yet seen practical applica-
restricting. First, there are NL queries that may have more
tions in real-life use-cases and domains, and their usefulness
than one correct translations over the data. Second, equivalent
is to be proved. Several important challenges need to be tack-
SQL queries are written in a different way but return the same
led first.
results.
Firstly, while the use of PLMs for NL Representation is
Finally, while the state-of-the-art systems are still dealing
highly favoured by newer systems, these models introduce
with “getting the answer right”, they are mostly overlooking
a large overhead at inference time, and while using larger

123
A survey on deep learning approaches for text-to-SQL 931

PLMs usually translates to higher accuracy, it also trans- names that use domain-specific terminology. For example,
lates to higher inference times. Output refinement techniques the SDSS [83] database has attributes such as “speccobj”
are also adding extra overhead that might make a system (spectroscopic object) and “photoobj” (photometric object),
impractical to use in a real-world scenario. For example, that are unknown to and hence cannot be translated by any
one of the best-performing models on the Spider dataset, of the available text-to-SQL systems. That is why in real-life
T5-3B+PICARD, uses a large PLM along with a compu- applications, ontologies and domain knowledge are used to
tationally intensive output refinement technique. Adapting enable reliable text-to-SQL translations [4,73].
such a model to work with fewer resources, reducing its It is also important to enable natural language queries
training time, or optimising its output refinement would be a in languages other than English, which is the main focus
significant scientific and engineering achievement. of current efforts. Due to the problem’s multidisciplinarity,
Furthermore, input encoding techniques such as serialisa- database, ML, and NLP approaches can join forces to push
tion combined with PLMs have an input size limit (usually the barrier further.
of 512 tokens), which poses no problem for the DBs in Spi-
der, but is restricting when working with real-world database
schemas. The challenge of creating a robust input encod-
ing technique that can efficiently work with larger schemas,
must also be tackled in order to make text-to-SQL systems
technically feasible. 8.4 Data augmentation
Additionally, schema linking techniques have been shown
to work and be beneficial for systems working on the Spider The need of deep learning models to train on a high volume of
dataset, but they have yet to be tested on a real, large- training examples, combined with the relatively small size of
scale DB. Even though using indices and other DB lookup available benchmarks and the cost of manually creating new
techniques might speed up schema linking, it is still ques- examples, has elevated data augmentation to an important
tionable if looking up multiple words or n-grams for every problem.
NLQ, is efficient in a real application. Advanced matching DBPal [94] is a template-based approach that uses manu-
techniques, such as classifiers, also introduce additional over- ally crafted templates of NL/SQL-pairs, which can be filled
head. There is a lot of room for contributions in optimising with the names of tables, columns and values in order to cre-
schema linking, and this could be the area where the DB ate training instances. The NLQs can be further augmented,
community has the most to offer in order to make the break- with the use of NL techniques such as paraphrasing, random
throughs of the NLP world usable in practice. deletions and synonym substitutions. Nevertheless, such tem-
In a nutshell, improving translation speed by building effi- plates and NL techniques can not work consistently across
cient methods is necessary. But this may not be enough. all new DBs and might often result to “robotic” or unnatural
Text-to-SQL translation creates overhead to the overall query NLQs. Another approach [32] uses a similar template-based
execution time that the user will experience, and hence needs approach to create SQL queries by sampling column names
to be weighted in. Early text-to-SQL systems originating and values from a given table and then applies recurrent neu-
from the DB community [36,37,53,60,110] not only tried ral networks (RNNs) to generate the equivalent NLQ. A more
to generate correct SQL queries but also optimal in terms of recent work [95] proposes a pipeline that can generate exam-
execution speed. Hence, many of them contained logic for ples spanning over multiple tables of a relational database.
generating code that would return the desired results fast. SQL queries are created using an abstract syntax tree gram-
Ultimately, allowing the user to express questions in natu- mar and filling them with attributes from the database. The
ral language should free them from the technical details of NLQs are then generated using a hierarchical, RNN-based
how this query should be expressed in the underlying system neural model, that recursively generates explanations for all
language and how it should be executed efficiently. parts of the queries and then concatenates them.
However, even though some initial efforts have been made,
8.3 Universality of the solution a systematic evaluation of how each approach affects differ-
ent systems, as well as the quality of generated data in each
Another challenge is the universality of the solution, i.e. per- case, is still missing. Additionally, another research question
forming equally well for different databases. This problem that arises is how to train a system using domain-specific or
becomes highly relevant when applying a text-to-SQL sys- augmented data, along with a general-domain dataset such
tem to an actual database [34] that is used in a business, as Spider. For example, should the system be trained simul-
research or any other real-world use case. Apart from the taneously on domain-specific as well as general-use data, or
large number of tables and attributes that we have already only on domain-specific data, or should a more advanced
discussed, such databases may contain table and column sampling method be used [95]?

123
932 G. Katsogiannis-Meimarakis, G. Koutrika

8.5 The path to data democratisation would be very interesting to allow the user to access and query
data solely through the power of natural language and conver-
While the text-to-SQL problem is a major research challenge, sation. The release of a conversational (CoSQL [106]), and a
it is also important to understand that it is a piece of the greater context-dependent (SParC [108]) text-to-SQL dataset, based
puzzle of data democratisation. In order to allow all users, no on the Spider [107] dataset, has allowed for more focused
matter their technical knowledge, to easily access data and to progress in this domain. The conversational version of the
derive value from it, we must consider complementary prob- problem carries new aspects and difficulties that candidate
lems, such as query explanations, query result explanations, systems must tackle. First and foremost, for each prediction,
and query recommendations. These problems can also bene- the system must take into account all previous interactions
fit from and be inspired by the models and methods presented with the user (i.e. all previous NLQs and the predicted SQL
in our study. queries). Additionally, it is often necessary to ask the user for
Query explanations When a user formulates a query using clarifications when facing vague questions, or ask the user
a text-to-SQL system, the question is how they can confirm to choose between possible interpretations of an utterance
that the obtained results match the intention of the NLQ. in the conversation. While some of the systems presented in
Query explanations in natural language would allow the user this work can be adapted to work in a conversational setting,
to cross-check their NLQ to the explanations of the predicted heavier modifications are often necessary in order for the
SQL queries and validate the results. This is the SQL-to- model to effectively encode the conversation history and the
text problem [26,48,93], which has been understudied so previous SQL predictions (note that we have only discussed
far, and would greatly benefit from models and methods about encoding NL and DB schemas). Ultimately, this aspect
for the text-to-SQL problem. Many interesting questions of the problem opens the path towards “intelligent data assis-
arise, including: how to transfer existing text-to-SQL meth- tants“ [64], similar to but extremely more powerful than the
ods to solve this problem, what evaluation metrics to use, and intelligent personal assistants that are gaining more and more
whether we can design systems that can use the same model popularity and use through our smartphones and dedicated
to solve both problems. In this direction, models such as T5 speakers devices.
seem promising.
Query result explanations In a similar vein, results to a 9 Conclusions
query are typically presented in a tabular form that is not self-
explanatory. Generating NL explanations for query results The domain of text-to-SQL translation has received increas-
is another open research area [18,81]. Interestingly, while ingly larger attention by both the database and NLP com-
there has been considerable work on the “sibling” area of munities. The recent introduction of two large text-to-SQL
data-to-text generation [7], the problem of query result expla- datasets [107,112] has enabled the use of deep learning mod-
nations (or QR-to-text) has several intricacies that do not els and spurred a new wave of innovation. To understand
allow directly adapting methods from the data-to-text gen- which milestones have been conquered and what obstacles
eration domain. The need to capture query semantics (that lie ahead, it is necessary to provide a systematic and organ-
are implied by the results), the lack of appropriate bench- ised study of the field.
marks, and the fact that query results may contain several This work explained the text-to-SQL problem and the
rows from different tables that are joined are just a few of the available benchmarks, before diving into the systems. We
open issues. provided a fine-grained taxonomy of deep learning text-
Query recommendations Even when the user understands to-SQL systems, based on six axes: (a) schema linking,
the data that is kept in the database, it might not always be (b) natural language representation, (c) input encoding, (d)
clear what kind of queries can be asked and what kind of output decoding, (e) neural training, and (f ) output refine-
knowledge can be extracted. For this reason, query recom- ment. For each axis of our taxonomy, we analysed all the
mendations can help a user find interesting queries to ask the approaches that have been presented so far and explained
database, either based on the user preferences and history, or their strengths and weaknesses. We relied on this taxonomy
on queries that are frequently asked by other users of the same to present some of the most important systems that have been
database [41] or by analysing the data [31]. In this context, proposed, grouping them together, in order to highlight their
adapting deep-learning models for query recommendations similarities, differences and innovations.
offers numerous challenges and opportunities. Finally, having presented the current state of the art, we
Conversational text-to-SQL Developing a conversational discussed open challenges and research opportunities that
DB interface is another promising task, very similar to ear- must be tackled in order to truly advance the field of text-to-
lier non-DL approaches such as Analyza [20], which heavily SQL, as well as broader challenges that are closely related
involves the user in the translation process. Since our ultimate to it. It is important to keep in mind, that the ultimate goal
goal is creating a user-friendly and seamless experience, it of text-to-SQL research is to empower the casual user to

123
A survey on deep learning approaches for text-to-SQL 933

access and derive value from data. This is a goal that requires 11. Cai, R., Xu, B., Zhang, Z., Yang, X., Li, Z., Liang, Z.: An encoder–
the combined effort of multiple disciplines and cannot be decoder framework translating natural language to database
queries. In: Lang, J. (ed.) Proceedings of the Twenty-Seventh
measured by a single performance metric. International Joint Conference on Artificial Intelligence, IJCAI
2018, July 13–19, 2018, Stockholm, Sweden, pp. 3977–3983.
Acknowledgements This work has been partially funded by the Euro-
ijcai.org (2018)
pean Union’s Horizon 2020 research and innovation program (Grant
12. Cao, R., Chen, L., Chen, Z., Zhao, Y., Zhu, S., Yu, K.: LGESQL:
Agreement No. 863410).
line graph enhanced text-to-SQL model with mixed local and non-
local relations. In: Proceedings of the 59th Annual Meeting of the
Funding Open access funding provided by HEAL-Link Greece.
Association for Computational Linguistics and the 11th Interna-
tional Joint Conference on Natural Language Processing (Volume
Open Access This article is licensed under a Creative Commons
1: Long Papers), pp. 2541–2555. Association for Computational
Attribution 4.0 International License, which permits use, sharing, adap-
Linguistics (2021)
tation, distribution and reproduction in any medium or format, as
13. Choi, D., Shin, M.C., Kim, E., Shin, D.R.: RYANSQL: recur-
long as you give appropriate credit to the original author(s) and the
sively applying sketch-based slot fillings for complex text-to-SQL
source, provide a link to the Creative Commons licence, and indi-
in cross-domain databases. Comput. Linguist. 47(2), 309–332.
cate if changes were made. The images or other third party material
https://doi.org/10.1162/coli_a_00403
in this article are included in the article’s Creative Commons licence,
14. Codd, E.F.: Seven steps to rendezvous with the casual user. In:
unless indicated otherwise in a credit line to the material. If material
Klimbie, J.W., Koffeman, K.L. (eds.) Data Base Management,
is not included in the article’s Creative Commons licence and your
Proceeding of the IFIP Working Conference Data Base Manage-
intended use is not permitted by statutory regulation or exceeds the
ment, Cargèse, Corsica, France, April 1–5, 1974, pp. 179–200.
permitted use, you will need to obtain permission directly from the copy-
North-Holland (1974)
right holder. To view a copy of this licence, visit http://creativecomm
15. Dahl, D.A., Bates, M., Brown, M., Fisher, W., Hunicke-Smith,
ons.org/licenses/by/4.0/.
K., Pallett, D., Pao, C., Rudnicky, A., Shriberg, E.: Expanding the
scope of the atis task: The atis-3 corpus. In: Proceedings of the
Workshop on Human Language Technology, HLT ’94, pp. 43-48.
References Association for Computational Linguistics, USA (1994)
16. Damerau, F.J.: A technique for computer detection and correction
1. Abbas, S., Khan, M.U., Lee, S.U.-J., Abbas, A., Bashir, A.K.: A of spelling errors. Commun. ACM 7(3), 171–176 (1964)
review of nlidb with deep learning: findings, challenges and open 17. Deng, N., Chen, Y., Zhang, Y.: Recent advances in text-to-SQL:
issues. IEEE Access. 10, 14927–14945 (2022) a survey of what we have and what we expect. In: Proceedings
2. Affolter, K., Stockinger, K., Bernstein, A.: A comparative sur- of the 29th International Conference on Computational Linguis-
vey of recent natural language interfaces for databases. VLDB J. tics, pp. 2166–2187. International Committee on Computational
28(5), 793–819 (2019) Linguistics, Gyeongju, Republic of Korea (2022)
3. Ambiguity. https://stanford.io/2YXcECi 18. Deutch, D., Frost, N., Gilad, A.: Explaining natural language
4. Amer-Yahia, S., Koutrika, G., Braschler, M., Calvanese, D., Lanti, query results. VLDB J. 29(1), 485–508 (2020)
D., Lücke-Tieke, H., Mosca, A., de Farias, T Mendes, Papadopou- 19. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-
los, D., Patil, Y., Rull, G., Smith, E., Skoutas, D., Subramanian, training of deep bidirectional transformers for language under-
S., Stockinger, K.: Inode: building an end-to-end data exploration standing. In: Proceedings of the 2019 Conference of the North
system in practice. SIGMOD Rec. 50(4), 23–29 (2022) American Chapter of the Association for Computational Lin-
5. Androutsopoulos, I., Ritchie, G.D., Thanisch, P.: Natural language guistics: Human Language Technologies, vol. 1 (Long and Short
interfaces to databases—an introduction. Nat. Lang. Eng. 1(1), Papers), pp. 4171–4186. Association for Computational Linguis-
29–81 (1995) tics, Minneapolis (2019). https://doi.org/10.18653/v1/N19-1423
6. Belmpas, T., Gkini, O., Koutrika, G.: Analysis of database search 20. Dhamdhere, K., McCurley, K.S., Nahmias, R., Sundararajan, M.,
systems with THOR. In: Maier, D., Pottinger, R., Doan, A., Tan, Yan, Q.: Analyza: Exploring Data with Conversation. ACM, New
W., Alawini, A., Ngo, H.Q. (eds.) Proceedings of the 2020 Interna- York (2017)
tional Conference on Management of Data, SIGMOD Conference 21. Dong, L., Lapata, M.: Language to logical form with neural
2020, online conference [Portland, OR, USA], June 14–19, 2020, attention. In: Proceedings of the 54th Annual Meeting of the Asso-
pp. 2681–2684. ACM (2020) ciation for Computational Linguistics (Volume 1: Long Papers),
7. Berant, J., Deutch, D., Globerson, A., Milo, T., Wolfson, T.: pp. 33–43. Association for Computational Linguistics, Berlin
Explaining queries over web tables to non-experts. In: 2019 IEEE (2016). https://doi.org/10.18653/v1/P16-1004
35th International Conference on Data Engineering (ICDE), pp. 22. Dong, L., Lapata, M.: Coarse-to-fine decoding for neural semantic
1570–1573 (2019) parsing. In: Proceedings of the 56th Annual Meeting of the Associ-
8. Bogin, B., Berant, J., Gardner,M.: Representing schema struc- ation for Computational Linguistics (Volume 1: Long Papers), pp.
ture with graph neural networks for text-to-SQL parsing. In: 731–742. Association for Computational Linguistics, Melbourne
Proceedings of the 57th Annual Meeting of the Association for (2018). https://doi.org/10.18653/v1/P18-1068
Computational Linguistics, pp. 4560–4565. Association for Com- 23. Dozat, T., Manning, C.D.: Deep biaffine attention for neural
putational Linguistics, Florence, Italy (2019) dependency parsing (2017)
9. Bogin, B., Gardner, M., Berant, J.: Global reasoning over database 24. Dozat, T., Manning, C.D.: Deep biaffine attention for neural
structures for text-to-SQL parsing. In: Proceedings of the 2019 dependency parsing. In: 5th International Conference on Learning
Conference on Empirical Methods in Natural Language Pro- Representations, ICLR 2017, Toulon, France, April 24–26, 2017,
cessing and the 9th International Joint Conference on Natural Conference Track Proceedings. OpenReview.net (2017)
Language Processing (EMNLP-IJCNLP), pp. 3659–3664. Asso- 25. Dozat, T., Manning, C.D.: Simpler but more accurate semantic
ciation for Computational Linguistics, Hong Kong, China (2019) dependency parsing. In: Proceedings of the 56th Annual Meet-
10. Brunner, U., Stockinger, K.: Valuenet: a neural text-to-sql archi- ing of the Association for Computational Linguistics (Volume
tecture incorporating values (2020). arXiv:2006.00888 2: Short Papers), pp. 484–490. Association for Computational

123
934 G. Katsogiannis-Meimarakis, G. Koutrika

Linguistics, Melbourne (2018). https://doi.org/10.18653/v1/P18- tional Committee on Computational Linguistics, Barcelona, Spain
2077 (2020)
26. Eleftherakis, S., Gkini, O., Koutrika, G.: Let the database talk 41. Idreos, S., Papaemmanouil, O., Chaudhuri, S.: Overview of data
back: natural language explanations for SQL. In: Mottin, D., exploration techniques. In: Sellis, T.K., Davidson, S.B., Ives, Z.G.
Lissandrini, M., Roy, S.B., Velegrakis, Y. (eds.) Proceedings of (eds.) Proceedings of the 2015 ACM SIGMOD International Con-
the 2nd Workshop on Search, Exploration, and Analysis in Het- ference on Management of Data, Melbourne, Victoria, Australia,
erogeneous Datastores (SEA-Data 2021) Co-located with 47th May 31–June 4, 2015, pp. 277–281. ACM (2015)
International Conference on Very Large Data Bases (VLDB 42. Iyer, S., Konstas, I., Cheung, A., Krishnamurthy, J., Zettlemoyer,
2021), Copenhagen, Denmark, August 20, 2021, volume 2929 of L.: Learning a neural semantic parser from user feedback. In:
CEUR Workshop Proceedings, pp. 14–19. CEUR-WS.org (2021) Barzilay, R., Kan, M. (eds.) Proceedings of the 55th Annual Meet-
27. Finegan-Dollak, C., Kummerfeld, J.K., Zhang, L., Ramanathan, ing of the Association for Computational Linguistics, ACL 2017,
K., Sadasivam, S., Zhang, R., Radev, D.: Improving text-to- Vancouver, Canada, July 30–August 4, Volume 1: Long Papers,
SQL evaluation methodology. In: Proceedings of the 56th Annual pp. 963–973. Association for Computational Linguistics (2017)
Meeting of the Association for Computational Linguistics (Vol- 43. Kamath, A., Das, R.: A survey on semantic parsing. In: 1st Confer-
ume 1: Long Papers), pp. 351–360. Association for Computational ence on Automated Knowledge Base Construction, AKBC 2019,
Linguistics, Melbourne, Australia (2018) Amherst, MA, USA, May 20–22, 2019 (2019)
28. Gan, Y., Chen, X., Huang, Q., Purver, M., Woodward, J.R., Xie, 44. Katsogiannis-Meimarakis, G., Koutrika, G.: A deep dive into
J., Huang, P.: Towards robustness of text-to-SQL models against deep learning approaches for text-to-sql systems. In: Proceedings
synonym substitution. In: Proceedings of the 59th Annual Meet- of the 2021 International Conference on Management of Data,
ing of the Association for Computational Linguistics and the 11th SIGMOD/PODS ’21, pp. 2846–2851, New York, NY, USA. Asso-
International Joint Conference on Natural Language Processing ciation for Computing Machinery (2021)
(Volume 1: Long Papers), pp. 2505–2515. Association for Com- 45. Katsogiannis-Meimarakis, G., Koutrika, G.: Deep learning
putational Linguistics (2021) approaches for text-to-sql systems. In: EDBT, pp. 710–713 (2021)
29. Gan, Y., Chen, X., Purver, M.: Exploring underexplored limita- 46. Keysers, D., Schärli, N., Scales, N., Buisman, H., Furrer, D.,
tions of cross-domain text-to-SQL generalization. In: Proceedings Kashubin, S., Momchev, N., Sinopalnikov, D., Stafiniak, L.,
of the 2021 Conference on Empirical Methods in Natural Lan- Tihon, T., et al.: Measuring compositional generalization: a com-
guage Processing, pp. 8926–8931, Online and Punta Cana, prehensive method on realistic data (2019). arXiv:1912.09713
Dominican Republic. Association for Computational Linguistics 47. Kim, H., So, B.-H., Han, W.-S., Lee, H.: Natural language to sql:
(2021) where are we today? Proc. VLDB Endow. 13(10), 1737–1750
30. Gkini, O., Belmpas, T., Ioannidis, Y., Koutrika, G.: An in-depth (2020)
benchmarking of text-to-sql systems. In: SIGMOD Conference. 48. Kokkalis, A., Vagenas, P., Zervakis, A., Simitsis, A., Koutrika,
ACM (2021) G., Ioannidis, Y. E.: Logos: a system for translating queries into
31. Glenis, A., Koutrika, G.: Pyexplore: query recommendations for narratives. In: Candan, K.S., Chen, Y., Snodgrass, R.T., Gravano,
data exploration without query logs. In: Li, G., Li, Z., Idreos, S., L., Fuxman, A. (eds.) Proceedings of the ACM SIGMOD Inter-
Srivastava, D. (eds.) SIGMOD ’21: International Conference on national Conference on Management of Data, SIGMOD 2012,
Management of Data, Virtual Event, China, June 20–25, 2021, Scottsdale, AZ, USA, May 20–24, 2012, pp. 673–676. ACM
pp. 2731–2735. ACM (2021) (2012)
32. Guo, D., Sun, Y., Tang, D., Duan, N., Yin, J., Chi, H., Cao, 49. Krishnamurthy, J., Dasigi, P., Gardner, M.: Neural semantic
J., Chen, P., Zhou, M.: Question generation from SQL queries parsing with type constraints for semi-structured tables. In: Pro-
improves neural semantic parsing. In: Proceedings of the 2018 ceedings of the 2017 Conference on Empirical Methods in Natural
Conference on Empirical Methods in Natural Language Process- Language Processing, pp. 1516–1526. Association for Computa-
ing, pp. 1597–1607. Association for Computational Linguistics, tional Linguistics, Copenhagen, Denmark (2017)
Brussels, Belgium (2018) 50. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random
33. Guo, J., Zhan, Z., Gao, Y., Xiao, Y., Lou, J.-G., Liu, T., Zhang, fields: probabilistic models for segmenting and labeling sequence
D.: Towards complex Text-to-SQL in cross-domain database with data. In: Proceedings of the Eighteenth International Conference
intermediate representation (2019) on Machine Learning, ICML ’01, pp. 282–289. Morgan Kauf-
34. Hazoom, M., Malik, V., Bogin, B.: Text-to-SQL in the wild: a mann Publishers Inc, San Francisco, CA, USA (2001)
naturally-occurring dataset based on stack exchange data. In: Pro- 51. Lee, C.-H., Polozov, O., Richardson, M.: KaggleDBQA: Realistic
ceedings of the 1st Workshop on Natural Language Processing evaluation of text-to-SQL parsers. In: Proceedings of the 59th
for Programming (NLP4Prog 2021), pp. 77–87. Association for Annual Meeting of the Association for Computational Linguistics
Computational Linguistics (2021) and the 11th International Joint Conference on Natural Language
35. He, P., Mao, Y., Chakrabarti, K., Chen, W.: X-sql: reinforce Processing (Volume 1: Long Papers), pp. 2261–2273. Association
schema representation with context (2019) for Computational Linguistics (2021)
36. Hristidis, V., Gravano, L., Papakonstantinou, Y.: Efficient IR-style 52. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed,
keyword search over relational databases. In: VLDB, pp. 850–861 A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: denoising
(2003) sequence-to-sequence pre-training for natural language genera-
37. Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in tion, translation, and comprehension. In: Proceedings of the 58th
relational databases. In: VLDB, pp. 670–681 (2002) Annual Meeting of the Association for Computational Linguis-
38. Hui, B., Shi, X., Geng, R., Li, B., Li, Y., Sun, J., Zhu, X.: Improving tics, pp. 7871–7880. Association for Computational Linguistics
text-to-sql with schema dependency learning (2021) (2020). https://doi.org/10.18653/v1/2020.acl-main.703
39. Hwang, W., Yim, J., Park, S., Seo, M.: A comprehensive explo- 53. Li, F., Jagadish, H.V.: Constructing an interactive natural language
ration on wikisql with table-aware word contextualization (2019) interface for relational databases. PVLDB 8(1), 73–84 (2014)
40. Iacob, R.C.A., Brad, F., Apostol, E.-S., Truică, C.-O., Hosu, I.A., 54. Li, Y., Rafiei, D.: Natural language data management and inter-
Rebedea, T.: Neural approaches for natural language interfaces faces: recent development and open challenges. In: Proceedings
to databases: A survey. In: Proceedings of the 28th International of the 2017 ACM International Conference on Management of
Conference on Computational Linguistics, pp. 381–395. Interna-

123
A survey on deep learning approaches for text-to-SQL 935

Data, SIGMOD ’17, pp. 1765–1770. Association for Computing 71. Price, P.J.: Evaluation of spoken language systems: the atis
Machinery, New York, NY, USA (2017) domain. In: Proceedings of the Workshop on Speech and Natural
55. Li, Y., Rafiei, D.: Natural Language Data Management and Language, HLT ’90, pp. 91–95. Association for Computational
Interfaces. Synthesis Lectures on Data Management. Morgan & Linguistics, USA (1990)
Claypool Publishers, San Rafael (2018) 72. Quamar, A., Efthymiou, V., Lei, C., Özcan, F.: Natural lan-
56. Li, Z., Qu, L., Haffari, G.: Context dependent semantic parsing: guage interfaces to data. Found. Trends Databases 11(4), 319–414
a survey. In: Proceedings of the 28th International Conference on (2022)
Computational Linguistics, pp. 2509–2521. International Com- 73. Quamar, A., Özcan, F., Miller, D., Moore, R.J., Niehus, R.,
mittee on Computational Linguistics, Barcelona, Spain (2020) Kreulen, J.: Conversational bi: an ontology-driven conversa-
57. Lin, X.V., Socher, R., Xiong, C.: Bridging textual and tabular tion system for business intelligence applications. Proc. VLDB
data for cross-domain text-to-SQL semantic parsing. In: Find- Endow. 13(12), 3369–3381 (2020)
ings of the Association for Computational Linguistics: EMNLP 74. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena,
2020, pages 4870–4888. Association for Computational Linguis- M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer
tics (2020) learning with a unified text-to-text transformer. J. Mach. Learn.
58. Liu, X., He, P., Chen, W., Gao, J.: Multi-task deep neural networks Res. 21(1), 1532–4435 (2022)
for natural language understanding. In: Proceedings of the 57th 75. Rubin, O., Berant, J.: SmBoP: semi-autoregressive bottom-up
Annual Meeting of the Association for Computational Linguis- semantic parsing. In: Proceedings of the 2021 Conference of the
tics, pp. 4487–4496. Association for Computational Linguistics, North American Chapter of the Association for Computational
Florence (2019). https://doi.org/10.18653/v1/P19-1441 Linguistics: Human Language Technologies, pp. 311–324. Asso-
59. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, ciation for Computational Linguistics (2021). https://doi.org/10.
O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly 18653/v1/2021.naacl-main.29
optimized bert pretraining approach (2019) 76. Scholak, T., Schucher, N., Bahdanau, D.: PICARD: parsing incre-
60. Luo, Y., Lin, X., Wang, W., Zhou, X.: Spark: top-k keyword query mentally for constrained auto-regressive decoding from language
in relational databases. In: ACM SIGMOD, pp. 115–126 (2007) models. In: Proceedings of the 2021 Conference on Empirical
61. Luong, T., Pham, H., Manning, C. D.: Effective approaches to Methods in Natural La nguage Processing, pp. 9895–9901. Asso-
attention-based neural machine translation. In: Proceedings of ciation for Computational Linguistics (2021). https://doi.org/10.
the 2015 Conference on Empirical Methods in Natural Language 18653/v1/2021.emnlp-main.779
Processing, pp. 1412–1421. Association for Computational Lin- 77. Sen, J., Lei, C., Quamar, A., Ozcan, F., Efthymiou, V., Dalmia,
guistics, Lisbon (2015). https://doi.org/10.18653/v1/D15-1166 A., Stager, G., Mittal, A., Saha, D., Sankaranarayanan, K.:
62. Lyu, Q., Chakrabarti, K., Hathi, S., Kundu, S., Zhang, J., Chen, Z.: ATHENA++: natural language querying for complex nested sql
Hybrid ranking network for text-to-sql. Technical Report MSR- queries. Proc. VLDB Endow. 13(11), 2747–2759 (2020)
TR-2020-7, Microsoft Dynamics 365 AI, March (2020) 78. Shaw, P., Massey, P., Chen, A., Piccinno, F., Altun, Y.: Generating
63. Ma, J., Yan, Z., Pang, S., Zhang, Y., Shen, J.: Mention extraction logical forms from graph representations of text and entities. In:
and linking for SQL query generation. In: Proceedings of the 2020 Proceedings of the 57th Annual Meeting of the Association for
Conference on Empirical Methods in Natural Language Process- Computational Linguistics, pp. 95–106. Association for Compu-
ing (EMNLP), pp. 6936–6942. Association for Computational tational Linguistics, Florence, Italy (2019)
Linguistics (2020) 79. Shi, T., Tatwawadi, K., Chakrabarti, K., Mao, Y., Polozov, O.,
64. Mandamadiotis, A., Koutrika, G., Eleftherakis, S., Glenis, A., Chen, W.: Incsql: training incremental text-to-sql parsers with
Skoutas, D., Stavrakas, Y.: Datagent: the imminent age of intel- non-deterministic oracles (2018)
ligent data assistants. Proc. VLDB Endow. 14(12), 2815–2818 80. Shi, T., Zhao, C., Boyd-Graber, J., Daumé III, H., Lee, L.: On
(2021) the potential of lexico-logical alignments for semantic parsing to
65. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation SQL queries. In: Findings of the Association for Computational
of word representations in vector space. In: Bengio, Y., LeCun, Y. Linguistics: EMNLP 2020, pp. 1849–1864 (2020). Association
(eds.) 1st International Conference on Learning Representations, for Computational Linguistics
ICLR 2013, Scottsdale, Arizona, USA, May 2–4, 2013, Workshop 81. Simitsis, A., Koutrika, G., Ioannidis, Y.: Précis: from unstructured
Track Proceedings (2013) keywords as queries to structured databases as answers. VLDB J.
66. Notes on ambiguity. http://bit.ly/2YTLFeR 17(1), 117–149 (2008)
67. Pasupat, P., Liang, P.: Compositional semantic parsing on semi- 82. Speer, R., Havasi, C.: Representing general relational knowledge
structured tables. In: Proceedings of the 53rd Annual Meeting of in conceptnet 5. In: LREC (2012)
the Association for Computational Linguistics and the 7th Interna- 83. Szalay, A.S., Gray, J., Thakar, A.R., Kunszt, P.Z., Malik, T., Rad-
tional Joint Conference on Natural Language Processing (Volume dick, J., Stoughton, C., vandenBerg, J.: The sdss skyserver: public
1: Long Papers), pp. 1470–1480. Association for Computational access to the sloan digital sky server data. In: Proceedings of the
Linguistics, Beijing, China (2015) 2002 ACM SIGMOD International Conference on Management
68. Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors of data, pp. 570–581 (2002)
for word representation. In: Proceedings of the 2014 Conference 84. Tang, L.R., Mooney, R.J.: Automated construction of database
on Empirical Methods in Natural Language Processing (EMNLP), interfaces: intergrating statistical and relational learning for
pp. 1532–1543. Association for Computational Linguistics, Doha semantic parsing. In: 2000 Joint SIGDAT Conference on Empir-
(2014). https://doi.org/10.3115/v1/D14-1162 ical Methods in Natural Language Processing and Very Large
69. Popescu, A., Armanasu, A., Etzioni, O., Ko, D., Yates, A.: Mod- Corpora, pp. 133–141. Association for Computational Linguis-
ern natural language interfaces to databases: composing statistical tics, Hong Kong, China (2000)
parsing with semantic tractability. In: COLING (2004) 85. Trivedi, P., Maheshwari, G., Dubey, M., Lehmann, J.: Lc-
70. Popescu, A.-M., Etzioni, O., Kautz, H.: Towards a theory of nat- quad: a corpus for complex question answering over knowledge
ural language interfaces to databases. In: Proceedings of the 8th graphs. In: International Semantic Web Conference, pp. 210–218.
International Conference on Intelligent User Interfaces, IUI ’03, Springer (2017)
pp. 149–157. Association for Computing Machinery, New York,
NY, USA (2003)

123
936 G. Katsogiannis-Meimarakis, G. Koutrika

86. Usta, A., Karakayali, A., Ulusoy, O.: Dbtagger: multi-task learn- Language Processing: System Demonstrations, pp. 7–12. Asso-
ing for keyword mapping in nlidbs using bi-directional recurrent ciation for Computational Linguistics, Brussels, Belgium (2018)
neural networks. Proc. VLDB Endow. 14(5), 813–821 (2021) 101. Yin, P., Neubig, G., tau Yih, W., Riedel, S.: TaBERT: pretraining
87. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., for joint understanding of textual and tabular data. In: Proceedings
Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. of the 58th Annual Meeting of the Association for Computational
In: Proceedings of the 31st International Conference on Neural Linguistics, pp. 8413–8426. Association for Computational Lin-
Information Processing Systems. NIPS’17, pp. 6000–6010. Cur- guistics (2020). https://doi.org/10.18653/v1/2020.acl-main.745
ran Associates Inc., Red Hook, NY (2017) 102. Yoon, D., Lee, D., Lee, S.: Dynamic self-attention: computing
88. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Cortes, attention over words dynamically for sentence embedding (2018)
C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) 103. Yu, T., Li, Z., Zhang, Z., Zhang, R., Radev, D.: TypeSQL:
Advances in Neural Information Processing Systems, vol. 28. Cur- knowledge-based type-aware neural Text-to-SQL generation. In:
ran Associates, Inc. (2015) Proceedings of the 2018 Conference of the North American Chap-
89. Wang, B., Shin, R., Liu, X., Polozov, O., Richardson, M.: ter of the Association for Computational Linguistics: Human
RAT-SQL: relation-aware schema encoding and linking for Text- Language Technologies, vol. 2 (Short Papers), pp. 588–594.
to-SQL parsers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, New Orleans (2018).
Association for Computational Linguistics, pp. 7567–7578. Asso- https://doi.org/10.18653/v1/N18-2093
ciation for Computational Linguistics (2020). https://doi.org/10. 104. Yu, T., Wu, C.-S., Lin, X.V., Wang, B., Tan, Y.C., Yang, X.,
18653/v1/2020.acl-main.677 Radev, D., Socher, R., Xiong, C.: Grappa: grammar-augmented
90. Wang, C., Cheung, A., Bodík, R.: Synthesizing highly expressive pre-training for table semantic parsing (2020)
SQL queries from input-output examples. In: 38th ACM SIG- 105. Yu, T., Yasunaga, M., Yang, K., Zhang, R., Wang, D., Li, Z.,
PLAN, pp. 452–466 (2017) Radev, D.: SyntaxSQLNet: syntax tree networks for complex and
91. Wang, C., Tatwawadi, K., Brockschmidt, M., Huang, P.-S., Mao, cross-domain text-to-SQL task. In: Proceedings of the 2018 Con-
Y., Polozov, O., Singh, R.: Robust text-to-sql generation with ference on Empirical Methods in Natural Language Processing,
execution-guided decoding (2018) pp. 1653–1663. Association for Computational Linguistics, Brus-
92. Wang, P., Shi, T., Reddy, C.K.: Text-to-sql generation for question sels (2018). https://doi.org/10.18653/v1/D18-1193
answering on electronic medical records. In: Proceedings of The 106. Yu, T., Zhang, R., Er, H. Y., Li, S., Xue, E., Pang, B., Lin, X. V.,
Web Conference, vol. 2020, pp. 350–361 (2020) Tan, Y. C., Shi, T., Li, Z., Jiang, Y., Yasunaga, M., Shim, S., Chen,
93. Wang, W., Bhowmick, S.S., Li, H., Joty, S.R., Liu, S., Chen, P.: T., Fabbri, A., Li, Z., Chen, L., Zhang, Y., Dixit, S., Zhang, V.,
Towards enhancing database education: natural language gener- Xiong, C., Socher, R., Lasecki, W.S., Radev, D.: CoSQL: a con-
ation meets query execution plans. In: Li, G., Li, Z., Idreos, S., versational Text-to-SQL challenge towards cross-domain natural
Srivastava, D. (eds.) SIGMOD ’21: International Conference on language interfaces to databases. In: Proceedings of the 2019 Con-
Management of Data, Virtual Event, China, June 20–25, 2021, ference on Empirical Methods in Natural Language Processing
pp. 1933–1945. ACM (2021) and the 9th International Joint Conference on Natural Language
94. Weir, N., Utama, P., Galakatos, A., Crotty, A., Ilkhechi, A., Processing (EMNLP-IJCNLP), pp. 1962–1979. Association for
Ramaswamy, S., Bhushan, R., Geisler, N., Hättasch, B., Eger, Computational Linguistics, Hong Kong (2019). https://doi.org/
S., Cetintemel, U., Binnig, C.: Dbpal: a fully pluggable nl2sql 10.18653/v1/D19-1204
training pipeline. In: Proceedings of the 2020 ACM SIGMOD 107. Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma,
International Conference on Management of Data, SIGMOD J., Li, I., Yao, Q., Roman, S., Zhang, Z., Radev, D.: Spider: a
’20, pp. 2347–2361. Association for Computing Machinery, New large-scale human-labeled dataset for complex and cross-domain
York, NY, USA (2020) semantic parsing and Text-to-SQL task. In: Proceedings of the
95. Wu, K., Wang, L., Li, Z., Zhang, A., Xiao, X., Wu, H., Zhang, M., 2018 Conference on Empirical Methods in Natural Language
Wang, H.: Data augmentation with hierarchical SQL-to-question Processing, pp. 3911–3921. Association for Computational Lin-
generation for cross-domain text-to-SQL parsing. In: Proceed- guistics, Brussels (2018). https://doi.org/10.18653/v1/D18-1425
ings of the 2021 Conference on Empirical Methods in Natural 108. Yu, T., Zhang, R., Yasunaga, M., Tan, Y. C., Lin, X. V., Li, S.,
Language Processing, pp. 8974–8983, Online and Punta Cana, Er, H., Li, I., Pang, B., Chen, T., Ji, E., Dixit, S., Proctor, D.,
Dominican Republic. Association for Computational Linguistics Shim, S., Kraft, J., Zhang, V., Xiong, C., Socher, R., Radev, D.:
(2021) SParC: cross-domain semantic parsing in context. In: Proceedings
96. Xu, X., Liu, C., Song, D.: Sqlnet: generating structured queries of the 57th Annual Meeting of the Association for Computational
from natural language without reinforcement learning (2017) Linguistics, pp. 4511–4523. Association for Computational Lin-
97. Xu, K., Wang, Y., Wang, Y., Wang, Z., Wen, Z., Dong, Y.: guistics, Florence (2019). https://doi.org/10.18653/v1/P19-1443
SeaD: end-to end Text-to-SQL generation with schema-aware 109. Zelle, J.M., Mooney, R.J.: Learning to parse database queries
denoising. In: Findings of the Association for Computational using inductive logic programming. In: Proceedings of the Thir-
Linguistics: NAACL 2022, pp. 1845–1853. Association for Com- teenth National Conference on Artificial Intelligence—Volume 2,
putational Linguistics, Seattle (2021). https://doi.org/10.18653/ AAAI’96, pp. 1050–1055. AAAI Press (1996)
v1/2022.findings-naacl.141 110. Zeng, Z., Lee, M. L., Ling, T. W.: Answering keyword queries
98. Yaghmazadeh, N., Wang, Y., Dillig, I., Dillig, T.: Sqlizer: query involving aggregates and groupby on relational databases. EDBT,
synthesis from natural language. In: PACMPL, pp. 63:1–63:26 pp. 161–172 (2016)
(2017) 111. Zhao, L., Cao, H., Zhao, Y.: Gp: Context-free grammar pre-
99. Yin, P., Neubig, G.: A syntactic neural model for general-purpose training for text-to-sql parsers (2021)
code generation. In: Proceedings of the 55th Annual Meeting of 112. Zhong, V., Xiong, C., Socher, R.: Seq2sql: generating struc-
the Association for Computational Linguistics (Volume 1: Long tured queries from natural language using reinforcement learning
Papers), pp. 440–450. Association for Computational Linguistics, (2017)
Vancouver (2017). https://doi.org/10.18653/v1/P17-1041
100. Yin, P., Neubig, G.: TRANX: a transition-based neural abstract
syntax parser for semantic parsing and code generation. In: Pro-
Publisher’s Note Springer Nature remains neutral with regard to juris-
ceedings of the 2018 Conference on Empirical Methods in Natural
dictional claims in published maps and institutional affiliations.

123

You might also like