RATSQL
RATSQL
7567
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7567–7578
July 5 - 10, 2020. c 2020 Association for Computational Linguistics
Natural Language Question: Desired SQL:
For the cars with 4 cylinders, which model has the largest horsepower? SELECT T1.model
FROM car_names AS T1 JOIN cars_data AS T2
Schema: ON T1.make_id = T2.id
WHERE T2.cylinders = 4
cars_data
id mpg cylinders edispl horsepower weight accelerate year
… ORDER BY T2.horsepower DESC LIMIT 1
schema linking difficult after both the column rep- 2019) achieves a test set accuracy of 91.8%, signif-
resentations and question word representations are icantly higher than the state of the art on Spider.
built. Second, it limits information propagation The recent state-of-the-art models evaluated on
during schema encoding to the predefined graph of Spider use various attentional architectures for
foreign key relations. The advent of self-attentional question/schema encoding and AST-based struc-
mechanisms in NLP (Vaswani et al., 2017) shows tural architectures for query decoding. IRNet (Guo
that global reasoning is crucial to effective repre- et al., 2019) encodes the question and schema sep-
sentations of relational structures. However, we arately with LSTM and self-attention respectively,
would like any global reasoning to still take into augmenting them with custom type vectors for
account the aforementioned schema relations. schema linking. They further use the AST-based de-
In this work, we present a unified framework, coder of Yin and Neubig (2017) to decode a query
called RAT-SQL,1 for encoding relational structure in an intermediate representation (IR) that exhibits
in the database schema and a given question. It uses higher-level abstractions than SQL. Bogin et al.
relation-aware self-attention to combine global rea- (2019a) encode the schema with a GNN and a simi-
soning over the schema entities and question words lar grammar-based decoder. Both works emphasize
with structured reasoning over predefined schema schema encoding and schema linking, but design
relations. We then apply RAT-SQL to the problems separate featurization techniques to augment word
of schema encoding and schema linking. As a re- vectors (as opposed to relations between words and
sult, we obtain 57.2% exact match accuracy on the columns) to resolve it. In contrast, the RAT-SQL
Spider test set. At the time of writing, this result framework provides a unified way to encode arbi-
is the state of the art among models unaugmented trary relational information among inputs.
with pretrained BERT embeddings – and further
Concurrently with this work, Bogin et al.
reaches to the overall state of the art (65.6%) when
(2019b) published Global-GNN, a different ap-
RAT-SQL is augmented with BERT. In addition,
proach to schema linking for Spider, which ap-
we experimentally demonstrate that RAT-SQL en-
plies global reasoning between question words and
ables the model to build more accurate internal
schema columns/tables. Global reasoning is imple-
representations of the question’s true alignment
mented by gating the GNN that encodes the schema
with schema columns and tables.
using the question token representations. This dif-
2 Related Work fers from RAT-SQL in two important ways: (a)
question word representations influence the schema
Semantic parsing of NL to SQL recently surged representations but not vice versa, and (b) like in
in popularity thanks to the creation of two new other GNN-based encoders, message propagation
multi-table datasets with the challenge of schema is limited to the schema-induced edges such as for-
generalization – WikiSQL (Zhong et al., 2017) and eign key relations. In contrast, our relation-aware
Spider (Yu et al., 2018b). Schema encoding is not transformer mechanism allows encoding arbitrary
as challenging in WikiSQL as in Spider because relations between question words and schema ele-
it lacks multi-table relations. Schema linking is ments explicitly, and these representations are com-
relevant for both tasks but also more challenging in puted jointly over all inputs using self-attention.
Spider due to the richer NL expressiveness and less
We use the same formulation of relation-aware
restricted SQL grammar observed in it. The state
self-attention as Shaw et al. (2018). However, they
of the art semantic parser on WikiSQL (He et al.,
only apply it to sequences of words in the context
1
Relation-Aware Transformer. of machine translation, and as such, their relation
7568
types only encode the relative distance between two computes a learned relation between all the in-
words. We extend their work and show that relation- put elements xi , and the strength of this relation
(h)
aware self-attention can effectively encode more is encoded in the attention weights αij . How-
complex relationships within an unordered set of ever, in many applications (including text-to-SQL
elements (in our case, columns and tables within a parsing) we are aware of some preexisting rela-
database schema as well as relations between the tional features between the inputs, and would like
schema and the question). To the best of our knowl- to bias our encoder model toward them. This is
edge, this is the first application of relation-aware straightforward for non-relational features (repre-
self-attention to joint representation learning with sented directly in each xi ). We could limit the at-
both predefined and softly induced relations in the tention computation only to the “hard” edges where
input structure. Hellendoorn et al. (2020) develop the preexisting relations are known to hold. This
a similar model concurrently with this work, where would make the model similar to a graph atten-
they use relation-aware self-attention to encode tion network (Veličković et al., 2018), and would
data flow structure in source code embeddings. also impede the Transformer’s ability to learn new
Sun et al. (2018) use a heterogeneous graph of relations. Instead, RAT provides a way to commu-
KB facts and relevant documents for open-domain nicate known relations to the encoder by adding
question answering. The nodes of their graph are their representations to the attention mechanism.
analogous to the database schema nodes in RAT- Shaw et al. (2018) describe a way to represent
SQL, but RAT-SQL also incorporates the question relative position information in a self-attention
in the same formalism to enable joint representation layer by changing Equation (1) as follows:
learning between the question and the schema. (h) (h)
xi WQ (xj WK + rij K )>
(h)
3 Relation-Aware Self-Attention eij = p
dz /H
n (2)
First, we introduce relation-aware self-attention, (h)
X (h) (h) V
a model for embedding semi-structured input se- zi = αij (xj WV + rij ).
j=1
quences in a way that jointly encodes pre-existing
relational structure in the input as well as induced Here the rij terms encode the known relationship
“soft” relations between sequence elements in the between the two elements xi and xj in the input.
same embedding. Our solutions to schema embed- While Shaw et al. used it exclusively for relative
ding and linking naturally arise as features imple- position representation, we show how to use the
mented in this framework. same framework to effectively bias the Transformer
Consider a set of inputs X = {xi }ni=1 where toward arbitrary relational information.
xi ∈ Rdx . In general, we consider it an unordered Consider R relational features, each a binary
set, although xi may be imbued with positional relation R(s) ⊆ X × X (1 ≤ s ≤ R). The RAT
embeddings to add an explicit ordering relation. A framework represents all the pre-existing fea-
tures for each edge (i, j) as rij K = rV =
self-attention encoder, or Transformer, introduced ij
by Vaswani et al. (2017), is a stack of self-attention (1) (R) (s)
Concat ρij , . . . , ρij where each ρij is either
layers where each layer (consisting of H heads)
a learned embedding for the relation R(s) if the
transforms each xi into yi ∈ Rdx as follows:
relation holds for the corresponding edge (i.e. if
(h)
(h)
xi WQ (xj WK )>
(h)
(h) (h) (i, j) ∈ R(s) ), or a zero vector of appropriate size.
eij = ; αij = softmax eij
p
dz /H j In the following section, we will describe the set
Xn of relations our RAT-SQL model uses to encode a
(h) (h) (h) (1) (H)
zi = αij (xj WV ); z i = Concat z i , · · · , z i given database schema.
j=1
7569
Type of x Type of y Edge label Description
S AME -TABLE x and y belong to the same table.
Column Column F OREIGN -K EY-C OL -F x is a foreign key for y.
F OREIGN -K EY-C OL -R y is a foreign key for x.
P RIMARY-K EY-F x is the primary key of y.
Column Table
B ELONGS -T O -F x is a column of y (but not the primary key).
P RIMARY-K EY-R y is the primary key of x.
Table Column
B ELONGS -T O -R y is a column of x (but not the primary key).
F OREIGN -K EY-TAB -F Table x has a foreign key column in y.
Table Table F OREIGN -K EY-TAB -R Same as above, but x and y are reversed.
F OREIGN -K EY-TAB -B x and y have foreign keys in both directions.
Table 1: Description of edge types present in the directed graph G created to represent the schema. An edge exists
from source node x ∈ S to target node y ∈ S if the pair fulfills one of the descriptions listed in the table, with the
corresponding label. Otherwise, no edge exists from x to y.
airports country abbrev airline id airline name
… … …
city
C∉T
primary key
primary key Table-Ques T-Table
airport code airport name country Table-Q Pri. Key C∈T
foreign key airlines
foreign key
flights
a
… … …
source airport dest airport
primary key primary key
How many airlines airline airline airports city
airline flight number abbreviation country id name
Figure 2: An illustration of an example schema as a Figure 3: One RAT layer in the schema encoder.
graph G. We do not depict all the edges and label types
of Table 1 to reduce clutter.
While G holds all the known information about
the schema, it is insufficient for appropriately en-
4.1 Problem Definition coding a previously unseen schema in the context
of the question Q. We would like our representa-
Given a natural language question Q and a schema
tions of the schema S and the question Q to be
S = hC, T i for a relational database, our goal is to
joint, in particular for modeling the alignment be-
generate the corresponding SQL P . Here the ques-
tween them. Thus, we also define the question-
tion Q = q1 . . . q|Q| is a sequence of words, and
contextualized schema graph GQ = hVQ , EQ i
the schema consistsof columns C = {c1 , . . . , c|C| }
where VQ = V ∪ Q = C ∪ T ∪ Q includes nodes
and tables T = t1 , . . . , t|T | . Each column
for the question words (each labeled with a cor-
name ci contains words ci,1 , . . . , ci,|ci | and each
responding word), and EQ = E ∪ EQ↔S are the
table name ti contains words ti,1 , . . . , ti,|ti | . The
schema edges E extended with additional special
desired program P is represented as an abstract
relations between the question words and schema
syntax tree T in the context-free grammar of SQL.
members, detailed in the rest of this section.
Some columns in the schema are primary keys,
used for uniquely indexing the corresponding table, For modeling text-to-SQL generation, we adopt
and some are foreign keys, used to reference a pri- the encoder-decoder framework. Given the input
mary key column in a different table. In addition, as a graph GQ , the encoder fenc embeds it into joint
each column has a type τ ∈ {number, text}. representations ci , ti , q i for each column ci ∈ C,
table ti ∈ T , and question word q ∈ Q respec-
Formally, we represent the database schema as a
tively. The decoder fdec then uses them to compute
directed graph G = hV, Ei. Its nodes V = C ∪ T
a distribution Pr(P | GQ ) over the SQL programs.
are the columns and tables of the schema, each la-
beled with the words in its name (for columns, we
4.2 Relation-Aware Input Encoding
prepend their type τ to the label). Its edges E are
defined by the pre-existing database relations, de- Following the state-of-the-art NLP literature, our
scribed in Table 1. Figure 2 illustrates an example encoder first obtains the initial representations cinit
i ,
graph (with a subset of actual edges and labels). tinit
i for every node of G by (a) retrieving a pre-
7570
trained Glove embedding (Pennington et al., 2014) Name-Based Linking Name-based linking
for each word, and (b) processing the embeddings refers to exact or partial occurrences of the
in each multi-word label with a bidirectional LSTM column/table names in the question, such as the
(BiLSTM) (Hochreiter and Schmidhuber, 1997). It occurrences of “cylinders” and “cars” in the
also runs a separate BiLSTM over the question Q question in Figure 1. Textual matches are the most
to obtain initial word representations q init
i . explicit evidence of question-schema alignment
The initial representations cinit
i , tinit , and q init
i i
and as such, one might expect them to be directly
are independent of each other and devoid of any beneficial to the encoder. However, in all our
relational information known to hold in EQ . To experiments the representations produced by
produce joint representations for the entire input vanilla self-attention were insensitive to textual
graph GQ , we use the relation-aware self-attention matches even though their initial representations
mechanism (Section 3). Its input X is the set of all were identical. Brunner et al. (2020) suggest
the node representations in GQ : that representations produced by Transformers
mix the information from different positions and
init init init init init init
X = (c1 , · · · , c|C| , t1 , · · · , t|T | , q 1 , · · · , q |Q| ). cease to be directly interpretable after 2+ layers,
which might explain our observations. Thus, to
The encoder fenc applies a stack of N relation- remedy this phenomenon, we explicitly encode
aware self-attention layers to X, with separate name-based linking using RAT relations.
weight matrices in each layer. The final representa- Specifically, for all n-grams of length 1 to 5 in
th
tions ci , ti , q i produced by the N layer constitute the question, we determine (1) whether it exactly
the output of the whole encoder. matches the name of a column/table (exact match);
Alternatively, we also consider pre-trained or (2) whether the n-gram is a subsequence of the
BERT (Devlin et al., 2019) embeddings to obtain name of a column/table (partial match).3 Then, for
the initial representations. Following (Huang et al., every (i, j) where xi ∈ Q, xj ∈ S (or vice versa),
2019; Zhang et al., 2019), we feed X to the BERT we set rij ∈ EQ↔S to Q UESTION -C OLUMN-M,
and use the last hidden states as the initial represen- Q UESTION -TABLE-M, C OLUMN -Q UESTION-M or
tations before proceeding with the RAT layers.2 TABLE -Q UESTION-M depending on the type of xi
Importantly, as detailed in Section 3, every RAT and xj . Here M is one of E XACT M ATCH, PAR -
layer uses self-attention between all elements of TIAL M ATCH , or N O M ATCH .
the input graph GQ to compute new contextual rep-
Value-Based Linking Question-schema align-
resentations of question words and schema mem-
ment also occurs when the question mentions any
bers. However, this self-attention is biased toward
values that occur in the database and consequently
some pre-defined relations using the edge vectors
K , r V in each layer. We define the set of used
participate in the desired SQL, such as “4” in Fig-
rij ij ure 1. While this example makes the alignment
relation types in a way that directly addresses the
explicit by mentioning the column name “cylin-
challenges of schema embedding and linking. Oc-
ders”, many real-world questions do not. Thus,
currences of these relations between the question
linking a value to the corresponding column re-
and the schema constitute the edges EQ↔S . Most
quires background knowledge.
of these relation types address schema linking (Sec-
The database itself is the most comprehensive
tion 4.3); we also add some auxiliary edges to aid
and readily available source of knowledge about
schema encoding (see Appendix A).
possible values, but also the most challenging to
process in an end-to-end model because of the
4.3 Schema Linking
privacy and speed impact. However, the RAT
Schema linking relations in EQ↔S aid the model framework allows us to outsource this processing
with aligning column/table references in the ques- to the database engine to augment GQ with po-
tion to the corresponding schema columns/tables. tential value-based linking without exposing the
This alignment is implicitly defined by two kinds model itself to the data. Specifically, we add a
of information in the input: matching names and new C OLUMN -VALUE relation between any word
matching values, which we detail in order below. qi and column name cj s.t. qi occurs as a value
2
In this case, the initial representations cinit init init
i , ti , q i are
3
This procedure matches that of Guo et al. (2019), but we
not strictly independent although still yet uninfluenced by E. use the matching information differently in RAT.
7571
Tree-structured
(or a full word within a value) of cj . This simple decoder
SELECT
approach drastically improves the performance of
RAT-SQL (see Section 5). It also directly addresses count(*) … WHERE =
Column?
the aforementioned DB challenges: (a) the model is
0.1 0.1 0.8
never exposed to database content that does not oc-
… … …
Self-attention
cur in the question, (b) word matches are retrieved
layers
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
quickly via DB indices & textual search. … … …
Memory-Schema Alignment Matrix Our intu- How many airlines airline airline airports city
id name
ition suggests that the columns and tables which
occur in the SQL P will generally have a corre- Figure 4: Choosing a column in a tree decoder.
sponding reference in the natural language ques-
expanding the parent AST node of the current
tion. To capture this intuition in the model, we
node, and nft is the embedding of the current
apply relation-aware attention as a pointer mecha-
node type. Finally, z t is the context representation,
nism between every memory element in y and all
computed using multi-head attention (with 8
the columns/tables to compute explicit alignment
heads) on ht−1 over Y.
matrices Lcol ∈ R|y|×|C| and Ltab ∈ R|y|×|T | :
For A PPLY RULE[R], we compute Pr(at =
yi WQcol (cfinal col K >
j WK + r ij ) A PPLY RULE[R] | a<t , y) = softmaxR (g(ht ))
L̃col
i,j = √ (3) where g(·) is a 2-layer MLP with a tanh non-
dx
K > linearity. For S ELECT C OLUMN, we compute
yi WQtab (tfinal tab
j WK + r ij )
L̃tab
i,j = √ sc )T
dx ht WQsc (yi WK
λ̃i = √ λi = softmax λ̃i
Lcol
col
Ltab
tab
i,j = softmax L̃i,j i,j = softmax L̃i,j dx i
j j
|y|
X
Intuitively, the alignment matrices in Eq. (3) Pr(at = S ELECT C OLUMN[i] | a<t , y) = λj Lcol
j,i
should resemble the real discrete alignments, there- j=1
fore should respect certain constraints like sparsity.
When the encoder is sufficiently parameterized, and similarly for S ELECT TABLE. We refer the
sparsity tends to arise with learning, but we can reader to Yin and Neubig (2017) for details.
also encourage it with an explicit objective. Ap- 5 Experiments
pendix B presents this objective and discusses our
experiments with sparse alignment in RAT-SQL. We implemented RAT-SQL in PyTorch (Paszke
et al., 2017). During preprocessing, the input of
4.4 Decoder questions, column names and table names are to-
The decoder fdec of RAT-SQL follows the tree- kenized and lemmatized with the StandfordNLP
structured architecture of Yin and Neubig (2017). toolkit (Manning et al., 2014). Within the encoder,
It generates the SQL P as an abstract syntax tree we use GloVe (Pennington et al., 2014) word em-
in depth-first traversal order, by using an LSTM to beddings, held fixed in training except for the 50
output a sequence of decoder actions that either (i) most common words in the training set. For RAT-
expand the last generated node into a grammar rule, SQL BERT, we use the WordPiece tokenization.
called A PPLY RULE; or when completing a leaf All word embeddings have dimension 300. The
node, (ii) choose a column/table from the schema, bidirectional LSTMs have hidden size 128 per di-
called S ELECT C OLUMN and S QELECT TABLE. rection, and use the recurrent dropout method of
Formally, Pr(P | Y) = t Pr(at | a<t , Y) Gal and Ghahramani (2016) with rate 0.2. We
where Y = fenc (GQ ) is the final encoding stack 8 relation-aware self-attention layers on top
of the question and schema, and a<t are all of the bidirectional LSTMs. Within them, we set
the previous actions. In a tree-structured de- dx = dz = 256, H = 8, and use dropout with rate
coder, the LSTM state is updated as mt , ht = 0.1. The position-wise feed-forward network has
fLSTM ([at−1 k z t k hpt k apt k nft ], mt−1 , ht−1 ) inner layer dimension 1024. Inside the decoder, we
where mt is the LSTM cell state, ht is the LSTM use rule embeddings of size 128, node type embed-
output at step t, at−1 is the embedding of the dings of size 64, and a hidden size of 512 inside
previous action, pt is the step corresponding to the LSTM with dropout of 0.21.
7572
Model Dev Test Split Easy Medium Hard Extra Hard All
IRNet (Guo et al., 2019) 53.2 46.7 RAT-SQL
Global-GNN (Bogin et al., 2019b) 52.7 47.4 Dev 80.4 63.9 55.7 40.6 62.7
IRNet V2 (Guo et al., 2019) 55.4 48.5 Test 74.8 60.7 53.6 31.5 57.2
RAT-SQL (ours) 62.7 57.2
RAT-SQL + BERT
With BERT: Dev 86.4 73.6 62.1 42.9 69.7
EditSQL + BERT (Zhang et al., 2019) 57.6 53.4 Test 83.0 71.3 58.3 38.4 65.6
GNN + Bertrand-DR (Kelkar et al., 2020) 57.9 54.6
IRNet V2 + BERT (Guo et al., 2019) 63.9 55.0 Table 3: Accuracy on the Spider development and test
RYANSQL V2 + BERT (Choi et al., 2020) 70.6 60.6
RAT-SQL + BERT (ours) 69.7 65.6 sets, by difficulty as defined by Yu et al. (2018b).
Table 2: Accuracy on the Spider development and test Model Accuracy (%)
sets, compared to the other approaches at the top of the RAT-SQL + value-based linking 60.54 ± 0.80
dataset leaderboard as of May 1st, 2020. The test set RAT-SQL 55.13 ± 0.84
results were scored using the Spider evaluation server. w/o schema linking relations 40.37 ± 2.32
w/o schema graph relations 35.59 ± 0.85
We used the Adam optimizer (Kingma and Ba,
Table 4: Accuracy (and ±95% confidence interval) of
2015) with the default hyperparameters. During
RAT-SQL ablations on the dev set.
the first warmup_steps = max_steps/20 steps
of training, the learning rate linearly increases from
most evaluations (other than the final accuracy mea-
0 to 7.4 × 10−4 . Afterwards, it is annealed to 0
step−warmup_steps surement) using the development set. It contains
with 7.4 × 10−4 (1 − max_steps−warmup_steps )−0.5 .
1,034 examples, with databases and schemas dis-
We use a batch size of 20 and train for up to 40,000
tinct from those in the training set. We report re-
steps. For RAT-SQL + BERT, we use a separate
sults using the same metrics as Yu et al. (2018a):
learning rate of 3×10−6 to fine-tune BERT, a batch
exact match accuracy on all examples, as well as
size of 24 and train for up to 90,000 steps.
divided by difficulty levels. As in previous work on
Hyperparameter Search We tuned the batch Spider, these metrics do not measure the model’s
size (20, 50, 80), number of RAT layers (4, 6, 8), performance on generating values in the SQL.
dropout (uniformly sampled from [0.1, 0.3]), hid-
den size of decoder RNN (256, 512), max learning 5.2 Spider Results
rate (log-uniformly sampled from [5 × 10−4 , 2 × In Table 2 we show accuracy on the (hidden) Spi-
10−3 ]). We randomly sampled 100 configurations der test set for RAT-SQL and compare to all other
and optimized on the dev set. RAT-SQL + BERT approaches at or near state-of-the-art (according to
reuses most hyperparameters of RAT-SQL, only the official leaderboard). RAT-SQL outperforms all
tuning the BERT learning rate (1 × 10−4 , 3 × 10−4 , other methods that are not augmented with BERT
5×10−4 ), number of RAT layers (6, 8, 10), number embeddings by a large margin of 8.7%. Surpris-
of training steps (4 × 104 , 6 × 104 , 9 × 104 ). ingly, it even beats other BERT-augmented models.
When RAT-SQL is further augmented with BERT,
5.1 Datasets and Metrics
it achieves the new state-of-the-art performance.
We use the Spider dataset (Yu et al., 2018b) for Compared with other BERT-argumented models,
most of our experiments, and also conduct pre- our RAT-SQL + BERT has smaller generalization
liminary experiments on WikiSQL (Zhong et al., gap between development and test set.
2017) to confirm generalization to other datasets. We also provide a breakdown of the accuracy
As described by Yu et al., Spider contains 8,659 by difficulty in Table 3. As expected, performance
examples (questions and SQL queries, with the ac- drops with increasing difficulty. The overall gen-
companying schemas), including 1,659 examples eralization gap between development and test of
lifted from the Restaurants (Popescu et al., 2003; RAT-SQL was strongly affected by the significant
Tang and Mooney, 2000), GeoQuery (Zelle and drop in accuracy (9%) on the extra hard questions.
Mooney, 1996), Scholar (Iyer et al., 2017), Aca- When RAT-SQL is augmented with BERT, the gen-
demic (Li and Jagadish, 2014), Yelp and IMDB eralization gaps of most difficulties are reduced.
(Yaghmazadeh et al., 2017) datasets.
As Yu et al. (2018b) make the test set accessi- Ablation Study Table 4 shows an ablation study
ble only through an evaluation server, we perform over different RAT-based relations. The ablations
7573
Figure 5: Alignment between the question “For the cars with 4 cylinders, which model has the largest horsepower”
and the database car_1 schema (columns and tables) depicted in Figure 1.
are run on RAT-SQL without value-based linking RAT-SQL is an important extension that we plan
to avoid interference with information from the to address outside the scope of this work.
database. Schema linking and graph relations make
statistically significant improvements (p<0.001). 5.4 Discussions
The full model accuracy here slightly differs from Alignment Recall from Section 4 that we explic-
Table 2 because the latter shows the best model itly model the alignment matrix between question
from a hyper-parameter sweep (used for test evalu- words and table columns, used during decoding
ation) and the former gives the mean over five runs for column and table selection. The existence of
where we only change the random seeds. the alignment matrix provides a mechanism for
the model to align words to columns. An accurate
alignment representation has other benefits such
5.3 WikiSQL Results
as identifying question words to copy to emit a
We also conducted preliminary experiments on constant value in SQL.
WikiSQL (Zhong et al., 2017) to test generalization In Figure 5 we show the alignment generated by
of RAT-SQL to new datasets. Although WikiSQL our model on the example from Figure 1.4 For the
lacks multi-table schemas (and thus, its challenge three words that reference columns (“cylinders”,
of schema encoding is not as prominent), it still “model”, “horsepower”), the alignment matrix cor-
presents the challenges of schema linking and gen- rectly identifies their corresponding columns. The
eralization to new schemas. For simplicity of exper- alignments of other words are strongly affected by
iments, we did not implement either BERT augmen- these three keywords, resulting in a sparse span-to-
tation or execution-guided decoding (EG) (Wang column like alignment, e.g. “largest horsepower”
et al., 2018), both of which are common in state-of- to horsepower. The tables cars_data and
the-art WikiSQL models. We thus only compare to cars_names are implicitly mentioned by the
the models that also lack these two enhancements. word “cars”. The alignment matrix success-
fully infers to use these two tables instead of
While not reaching state of the art, RAT-SQL
car_makers using the evidence that they con-
still achieves competitive performance on WikiSQL
tain the three mentioned columns.
as shown in Table 5. Most of the gap between its
accuracy and state of the art is due to the simpli- The Need for Schema Linking One natural
fied implementation of value decoding, which is question is how often does the decoder fail to select
required for WikiSQL evaluation but not in Spi- the correct column, even with the schema encod-
der. Our value decoding for these experiments is ing and linking improvements we have made. To
a simple token-based pointer mechanism, which 4
The full alignment also maps from column and table
often fails to retrieve multi-token value constants names, but those end up simply aligning to themselves or the
accurately. A robust value decoding mechanism in table they belong to, so we omit them for brevity.
7574
Dev Test
Model LF Acc% Ex. Acc% LF Acc% Ex. Acc%
IncSQL (Shi et al., 2018) 49.9 84.0 49.9 83.7
MQAN (McCann et al., 2018) 76.1 82.0 75.4 81.4
RAT-SQL (ours) 73.6 79.5 73.3 78.8
Coarse2Fine (Dong and Lapata, 2018) 72.5 79.0 71.7 78.5
PT-MAML (Huang et al., 2018) 63.1 68.3 62.8 68.0
Table 5: RAT-SQL accuracy on WikiSQL, trained without BERT augmentation or execution-guided decoding (EG).
Compared to other approaches without EG. “LF Acc” = Logical Form Accuracy; “Ex. Acc” = Execution Accuracy.
7575
References Pengcheng He, Yi Mao, Kaushik Chakrabarti, and
Weizhu Chen. 2019. X-SQL: reinforce schema
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin- representation with context. arXiv preprint
ton. 2016. Layer Normalization. arXiv:1607.06450. arXiv:1908.08113.
Ben Bogin, Jonathan Berant, and Matt Gardner. 2019a.
Vincent J. Hellendoorn, Charles Sutton, Rishabh Singh,
Representing schema structure with graph neural
Petros Maniatis, and David Bieber. 2020. Global
networks for text-to-SQL parsing. In Proceedings
relational models of source code. In International
of the 57th Annual Meeting of the Association for
Conference on Learning Representations.
Computational Linguistics, pages 4560–4565.
Ben Bogin, Matt Gardner, and Jonathan Berant. 2019b. Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Global reasoning over database structures for text- Long short-term memory. Neural computation,
to-SQL parsing. In Proceedings of the 2019 Con- 9(8):1735–1780.
ference on Empirical Methods in Natural Language
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob
Processing and the 9th International Joint Confer-
Uszkoreit, Ian Simon, Curtis Hawthorne, Noam
ence on Natural Language Processing (EMNLP-
Shazeer, Andrew M. Dai, Matthew D. Hoffman,
IJCNLP), pages 3657–3662.
Monica Dinculescu, and Douglas Eck. 2019. Music
Gino Brunner, Yang Liu, Damian Pascual, Oliver Transformer. In International Conference on Learn-
Richter, Massimiliano Ciaramita, and Roger Watten- ing Representations.
hofer. 2020. On identifiability in Transformers. In
International Conference on Learning Representa- Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-
tions. tau Yih, and Xiaodong He. 2018. Natural language
to structured query generation via meta-learning. In
DongHyun Choi, Myeong Cheol Shin, EungGyun Kim, Proceedings of the 2018 Conference of the North
and Dong Ryeol Shin. 2020. RYANSQL: Recur- American Chapter of the Association for Computa-
sively applying sketch-based slot fillings for com- tional Linguistics: Human Language Technologies,
plex text-to-SQL in cross-domain databases. arXiv Volume 2 (Short Papers), pages 732–738.
preprint arXiv:2004.03125.
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Krishnamurthy, and Luke Zettlemoyer. 2017. Learn-
Kristina Toutanova. 2019. BERT: Pre-training of ing a neural semantic parser from user feedback. In
deep bidirectional transformers for language under- Proceedings of the 55th Annual Meeting of the As-
standing. In Proceedings of the 2019 Conference of sociation for Computational Linguistics (Volume 1:
the North American Chapter of the Association for Long Papers), pages 963–973.
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages Amol Kelkar, Rohan Relan, Vaishali Bhardwaj,
4171–4186. Saurabh Vaichal, and Peter Relan. 2020. Bertrand-
DR: Improving text-to-SQL using a discriminative
Li Dong and Mirella Lapata. 2018. Coarse-to-fine de- re-ranker. arXiv preprint arXiv:2002.00557.
coding for neural semantic parsing. In Proceedings
of the 56th Annual Meeting of the Association for Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
Computational Linguistics (Volume 1: Long Papers), Method for Stochastic Optimization. In Interna-
pages 731–742, Melbourne, Australia. Association tional Conference on Learning Representations.
for Computational Linguistics.
Fei Li and H. V. Jagadish. 2014. Constructing an
Catherine Finegan-Dollak, Jonathan K. Kummerfeld, interactive natural language interface for relational
Li Zhang, Karthik Ramanathan, Sesh Sadasivam, databases. Proceedings of the VLDB Endowment,
Rui Zhang, and Dragomir Radev. 2018. Improving 8(1):73–84.
Text-to-SQL Evaluation Methodology. In Proceed-
ings of the 56th Annual Meeting of the Association Christopher D. Manning, Mihai Surdeanu, John Bauer,
for Computational Linguistics (Volume 1: Long Pa- Jenny Finkel, Steven J. Bethard, and David Mc-
pers), pages 351–360. Closky. 2014. The Stanford CoreNLP natural lan-
guage processing toolkit. In Association for Compu-
Yarin Gal and Zoubin Ghahramani. 2016. A Theoreti- tational Linguistics (ACL) System Demonstrations,
cally Grounded Application of Dropout in Recurrent pages 55–60.
Neural Networks. In Advances in Neural Informa-
tion Processing Systems 29, pages 1019–1027. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,
and Richard Socher. 2018. The natural language de-
Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, cathlon: Multitask learning as question answering.
Jian-Guang Lou, Ting Liu, and Dongmei Zhang. arXiv preprint arXiv:1806.08730.
2019. Towards complex text-to-SQL in cross-
domain database with intermediate representation. Adam Paszke, Sam Gross, Soumith Chintala, Gregory
In Proceedings of the 57th Annual Meeting of the Chanan, Edward Yang, Zachary DeVito, Zeming
Association for Computational Linguistics, pages Lin, Alban Desmaison, Luca Antiga, and Adam
4524–4535. Lerer. 2017. Automatic differentiation in PyTorch.
7576
Jeffrey Pennington, Richard Socher, and Christopher Pengcheng Yin and Graham Neubig. 2017. A Syntactic
Manning. 2014. Glove: Global vectors for word rep- Neural Model for General-Purpose Code Generation.
resentation. In Proceedings of the 2014 Conference In Proceedings of the 55th Annual Meeting of the
on Empirical Methods in Natural Language Process- Association for Computational Linguistics (Volume
ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso- 1: Long Papers), pages 440–450.
ciation for Computational Linguistics.
Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang,
Ana-Maria Popescu, Oren Etzioni, , and Henry Kautz. Dongxu Wang, Zifan Li, and Dragomir Radev.
2003. Towards a theory of natural language inter- 2018a. SyntaxSQLNet: Syntax Tree Networks for
faces to databases. In Proceedings of the 8th Inter- Complex and Cross-Domain Text-to-SQL Task. In
national Conference on Intelligent User Interfaces, Proceedings of the 2018 Conference on Empirical
pages 149–157. Methods in Natural Language Processing, pages
1653–1663.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
2018. Self-Attention with Relative Position Repre- Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
sentations. In Proceedings of the 2018 Conference Dongxu Wang, Zifan Li, James Ma, Irene Li,
of the North American Chapter of the Association Qingning Yao, Shanelle Roman, Zilin Zhang, and
for Computational Linguistics: Human Language Dragomir Radev. 2018b. Spider: A Large-Scale
Technologies, Volume 2 (Short Papers), pages 464– Human-Labeled Dataset for Complex and Cross-
468. Domain Semantic Parsing and Text-to-SQL Task.
In Proceedings of the 2018 Conference on Empiri-
Tianze Shi, Kedar Tatwawadi, Kaushik Chakrabarti, cal Methods in Natural Language Processing, pages
Yi Mao, Oleksandr Polozov, and Weizhu Chen. 2018. 3911–3921.
IncSQL: Training Incremental Text-to-SQL Parsers
with Non-Deterministic Oracles. arXiv:1809.05054 John M. Zelle and Raymond J. Mooney. 1996. Learn-
[cs]. ing to parse database queries using inductive logic
programming. In Proceedings of the Thirteenth Na-
Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn tional Conference on Artificial Intelligence - Volume
Mazaitis, Ruslan Salakhutdinov, and William Cohen. 2, pages 1050–1055.
2018. Open domain question answering using early
fusion of knowledge bases and text. In Proceed- Rui Zhang, Tao Yu, He Yang Er, Sungrok Shim,
ings of the 2018 Conference on Empirical Methods Eric Xue, Xi Victoria Lin, Tianze Shi, Caim-
in Natural Language Processing, pages 4231–4242, ing Xiong, Richard Socher, and Dragomir Radev.
Brussels, Belgium. Association for Computational 2019. Editing-based SQL query generation for
Linguistics. cross-domain context-dependent questions. In Pro-
ceedings of the 2019 Conference on Empirical Meth-
Lappoon R. Tang and Raymond J. Mooney. 2000. Au- ods in Natural Language Processing.
tomated construction of database interfaces: Inter-
grating statistical and relational learning for seman- Victor Zhong, Caiming Xiong, and Richard Socher.
tic parsing. In 2000 Joint SIGDAT Conference on 2017. Seq2SQL: Generating Structured Queries
Empirical Methods in Natural Language Processing from Natural Language using Reinforcement Learn-
and Very Large Corpora, pages 133–141. ing. arXiv:1709.00103 [cs].
7577
A Auxiliary Relations for Schema Model Exact Match Correctness
Encoding RAT-SQL 0.59 0.81
RAT-SQL + BERT 0.67 0.86
In addition to the schema graph edges E (Sec-
tion 4.2) and schema linking edges (Section 4.3), Table 7: Consistency of the two RAT-SQL models.
the edges in EQ also include some auxiliary rela-
tion types to aid the relation-aware self-attention.
Specifically, for each xi , xj ∈ VQ : to increase encoding depth eliminated the need for
explicit supervision of alignment. With few layers
• If i = j, then C OLUMN -I DENTITY or TABLE - in the Transformer, the alignment matrix provided
I DENTITY. additional degrees of freedom, which became un-
necessary once the Transformer was sufficiently
• xi ∈ Q, xj ∈ Q: Q UESTION -D IST-d, where deep to build a rich joint representation of the ques-
tion and the schema.
d = clip(j − i, D),
clip(a, D) = max(−D, min(D, a)). C Consistency of RAT-SQL
We use D = 2. In Spider dataset, most SQL queries correspond to
more than one question, making it possible to evalu-
• Otherwise, one of C OLUMN -C OLUMN, ate the consistency of RAT-SQL given paraphrases.
C OLUMN -TABLE, TABLE -C OLUMN, or We use two metrics to evaluate the consistency:
TABLE -TABLE. 1) Exact Match – whether RAT-SQL produces the
exact same predictions given paraphrases, 2) Cor-
B Alignment Loss
rectness – whether RAT-SQL achieves the same
The memory-schema alignment matrix is expected correctness given paraphrases. The analysis is con-
to resemble the real discrete alignments, therefore ducted on the development set.
should respect certain constraints like sparsity. For The results are shown in Table 7. We found that
example, the question word “model” in Figure 1 when augmented with BERT, RAT-SQL becomes
should be aligned with car_names.model more consistent in terms of both metrics, indicat-
rather than model_list.model or ing the pre-trained representations of BERT are
model_list.model_id. To further bias beneficial for handling paraphrases.
the soft alignment towards the real discrete
structures, we add an auxiliary loss to encourage
sparsity of the alignment matrix. Specifically,
for a column/table that is mentioned in the SQL
query, we treat the model’s current belief of the
best alignment as the ground truth. Then we use a
cross-entropy loss, referred as alignment loss, to
strengthen the model’s belief:
1 X
align_loss = − log max Lcol
i,j
|Rel(C)| i
j∈Rel(C)
1 X
− log max Ltab
i,j
|Rel(T )| i
j∈Rel(T )
7578