Improving Generalization in Language Model-Based Text-To-SQL
Improving Generalization in Language Model-Based Text-To-SQL
ing, even for state-of-the-art semantic parsers , budget _ in _ billions , num _ employees
based on pre-trained language models (LMs). Token Preprocessing (applied to SQL)
In this study, we empirically investigate im- Before: select avg ( flight.price) where
proving an LM’s generalization in semantic flight.origin = ‘New York’
After: select average ( flight . price ) where
parsing with two simple techniques: at the to-
flight . origin = ‘New York’
ken level, we introduce a token preprocessing
method to preserve the semantic boundaries Component Boundary Marking (applied to NL input and
SQL output)
of tokens produced by LM tokenizers; at the Before: How many heads of the departments are older than
sequence level, we propose to use special to- 56 ?
kens to mark the boundaries of components select count (head.*) where head.age > 56
aligned between input and output. Our exper- After: [sep0] How many heads of the departments
imental results on two text-to-SQL semantic [/sep0] [sep1] are older than 56 ? [/sep1]
parsing datasets show that our token prepro- [sep0] select count (head.*) [/sep0] [sep1]
where head.age > 56 [/sep1]
cessing, although simple, can substantially im-
prove the LM performance on both types of
generalization, and our component boundary Table 1: Our proposed techniques. Top: we preprocess
marking method is particularly helpful for com- the text such that its T5 tokenization aligns with word
positional generalization.1 semantics. Coloring indicates tokenization; for example,
“avg” is converted into three tokens of “a”, “v” and “g”.
1 Introduction Bottom: we add separator tokens to mark the boundaries
of aligned semantic components in the input and output.
Pre-trained language models (LMs)2 such as T5
(Raffel et al., 2020) have now been more and more
widely adopted for semantic parsing due to their Generalizing to such novel component compo-
promising performance and straightforward archi- sitions is known as compositional generalization.
tectures (Shaw et al., 2021; Scholak et al., 2021; Additionally, generalizing to new domains (e.g.,
Yin et al., 2021; Qi et al., 2022; Xie et al., 2022; from “entertainment” to “flight”) is referred to as
Qiu et al., 2021). However, recent work revealed domain generalization.
that these LMs still struggle to generalize on out- In this paper, we investigate these two types
of-distribution (OOD) samples (Lake and Baroni, of generalization of LMs in text-to-SQL seman-
2018; Keysers et al., 2019; Shaw et al., 2021; Qiu tic parsing, i.e., given a natural language (NL) in-
et al., 2022b). For example, if a parser has learned put and the database schema, producing a SQL
“how many heads are in the department” and “how query that can be executed against the database
many people are older than 56”, it is expected to for desired output. We conduct experiments us-
generalize to “how many heads of the departments ing the cross-database Spider benchmark (Yu et al.,
are older than 56”. 2018b) and its derivation Spider-CG (Gan et al.,
1
The source code for our implementation is available at 2022). Compared with existing benchmarks (Key-
https://github.com/Dakingrai/ood-generalizatio sers et al., 2019; Lake and Baroni, 2018), this task
n-semantic-boundary-techniques. setting is both more realistic (e.g., containing larger
2
We use “LMs” to refer to a broad set of models that
are pre-trained in (masked/autoregressive) language modeling language variations) and more challenging (e.g., re-
objectives, with encoder-decoder or decoder-only architecture. quiring grounding to the database context).
Although previous work tackling the two types Injecting Priors into Semantic Parsers. Our two
of generalization all requires non-trivial engineer- techniques can be viewed as injecting human prior
ing effort (see Section 2), in this work, we present knowledge into neural models for better general-
two simple yet effective techniques, which are ex- ization, which has been one of the major research
tremely easy to implement with LMs (Table 1). efforts on improving domain and compositional
Our techniques improve the generalization of LMs generalization. The key consideration to be taken
by preserving the semantic boundaries at the token when injecting priors is the trade-off between the
and the sequence levels. At the token level, our form and the generalizability. Strong priors in
first technique rewrites the inputs to handle naming the form of specialized model architectures (Shaw
conventions in database schemas and SQL queries et al., 2021; Herzig and Berant, 2021; Wang et al.,
such that a pre-trained LM tokenizer can split them 2021) are either too expensive or not applicable
into semantically meaningful tokens. across domains. Weaker priors in terms of special-
At the sequence level, our second technique in- ized training algorithms (Yin et al., 2021; Conklin
troduces special tokens to mark the semantic bound- et al., 2021) are more general, but often weaker in
aries (e.g., phrases) aligned between the source NL performance compared to other lines of methods.
and the target SQL. These special tokens implic- Our work is in the spirit of the third line on the
itly help the LM-based parser build more precise use of data augmentation (Andreas, 2020; Akyürek
input-output correspondences that are crucial for et al., 2020; Qiu et al., 2022a). However, instead of
compositional generalization. synthesizing new data from scratch, we “annotate”
On five evaluation sets, the experimental results the data with semantic boundary markers, which is
based on T5-base show that, albeit simple, our not only much simpler but also brings better perfor-
token-level technique dramatically improves both mance. The final line of work (Qiu et al., 2022b;
types of LM generalization, and our sequence-level Levy et al., 2022) is based on the learning capaci-
technique is particularly helpful for compositional ties in the context of large LMs, which is out of the
generalization. Combining them together leads to scope of this work.
further improvements. Our additional experiments
further demonstrate the generalizability of our ap- 3 Methods
proaches (e.g., to text-to-LISP expression parsing 3.1 Token Preprocessing
(Semantic Machines et al., 2020)).
Before preprocessing After preprocessing
2 Related Work
Snake case in schema items (add space)
Text-to-SQL Semantic Parsing. This task has booking_status_code booking _ status _ code
document_type document _ type
received considerate attention since the creation of
the WikiSQL (Zhong et al., 2017) and Spider (Yu Dot notation in column references (add space)
farm.cows farm . cows
et al., 2018b) datasets. While a large amount of
origin.flight origin . flight
existing work designed specialized architectures
SQL keyword (expand spelling)
for this task (Yu et al., 2018a; Zhang et al., 2019;
avg average
Wang et al., 2020; Lin et al., 2020), there has been desc descending
a trend of directly fine-tuning pre-trained sequence-
to-sequence models as semantic parsers (Shaw Table 2: Three token preprocessing types. Coloring
et al., 2021; Scholak et al., 2021; Xie et al., 2022; indicates tokenization, same as Table 1.
Qi et al., 2022). Our work follows the same line and
proposed approaches to further improve the LM We present our two techniques for improving
performance. On the other hand, Guo et al. (2019); the generalization of LM-based semantic parsers.
Gan et al. (2021); Herzig et al. (2021) showed that LM pre-training learns high-quality contextualized
simplifying the SQL representation in a way that word representation (Devlin et al., 2019), but to ef-
the new representation can semantically better align fectively use it on a downstream task, the tokeniza-
with the NL can dramatically improve the parsing tion needs to “make sense.” For example, if the text
performance. In our work, we follow the NatSQL “pet_age” is tokenized as “pet”, “_” and “age”, then
representation (Gan et al., 2021) as it has better the semantics of “pet” and “age” acquired during
alignments with the NL. pretraining can be directly used. However, if it is
Dataset Size Usage Generalization Type consists of a training set (SpiderT ) and a develop-
SpiderT 7,000 Train None (in-distribution) ment set (SpiderD ) with non-overlapping domains
SpiderD 1,034 Eval Domain but otherwise similar data characteristics (e.g.,
CG-SUBT 20,686 Eval None (in-distribution)
CG-SUBD 2,883 Eval Domain length). Thus, we train the models on SpiderT , and
CG-APPT 18,793 Eval Composition consider SpiderD as the evaluation for domain gen-
CG-APPD 3,237 Eval Domain & Composition eralization. Spider-CG is derived from Spider by
first dissecting each Spider instance into different
Table 3: Datasets in our experiments.
components according to its dependency parse and
generates data in two ways: substituting a compo-
tokenized as “pe”, “t_a” and “ge”, then pre-training nent in one instance with one from another instance
is hardly useful because the model does not even and appending one component from one instance
recognize the two semantic words. to another instance. Depending on whether the
Unfortunately, this latter case is very common instances come from the Spider training or devel-
when tokenizing non-natural language texts, such opment set, we get four splits: CG-SUBT , CG-
as database schemas and SQL queries. Thus, we SUBD , CG-APPT and CG-APPD , all of which are
propose a token preprocessing method to induce only used for evaluation. The instances created
more natural tokenization by, at a high level, adding under substitution share similar data characteristics
white spaces and handling the naming conventions while those under appending are much longer, so
in database schema and SQL queries. We show a good model performance on the latter requires
examples in Table 2 and details in Appendix A. compositional generalization. Table 3 summarizes
the dataset information. In addition, we use the
3.2 Component Boundary Marking NatSQL representation (Gan et al., 2021) through-
At the sequence level, our second technique further out the experiment due to its better alignment with
assists LMs in recognizing the semantic boundaries the NL input.
of components aligned between input and output.
An example is shown in Table 1. While prior work Evaluation Metrics. We follow the standard Spi-
has attempted the goal via implementing alignment- der benchmarking and employ two evaluation met-
based attention supervision (Yin et al., 2021), we rics. Exact Match (EM) compares the generated
propose to insert special tokens in input and output and the ground-truth query by performing exact
to inject such bias. Specifically, we use pairs of set matching at the lexical level (Yu et al., 2018b).
“[sepN ]” and “[/sepN ]”, N ∈ Z, to mark the Execution Match (EX) measures whether execut-
boundaries, so as to hint the LM that components ing the generated query on the given database can
within the paired special tokens should be aligned. yield the same results as using the ground truth.
In practice, we also observed cases where an Notably, for a fair comparison with existing seman-
NL component has to be aligned with a SQL com- tic parsers on the Spider leader board, we follow
ponent consisting of multiple non-continuous seg- Gan et al. (2022), convert each generated NatSQL
ments. To handle it, we will apply the same pair query into a SQL query, and report the evaluation
of special tokens to each segment of the same com- results based on the converted SQL query.
ponent. An example is shown in Table 8 in the
Appendix. Models, Baselines, and Implementation. We
Finally, we note that our method assumes the evaluate our proposed techniques by applying them
availability of component annotations. Such anno- to the pre-trained T5 model (Raffel et al., 2020).
tations can be obtained via human labeling (Gan Our experiments are conducted using T5-base, with
et al., 2021), heuristic rules (Yin et al., 2021), or the use of database contents following Lin et al.
other advanced machine learning algorithms, but (2020). As our second technique leverages com-
this is beyond the scope of our work. ponent boundary labels to encourage the composi-
tional generalization of LM, we compare it with a
4 Experiments baseline (Yin et al., 2021) which similarly assumes
the labels but utilizes them in a more complicated
4.1 Setup way, i.e., transforming the component alignments
Datasets. We use two datasets, Spider (Yu et al., into supervision on the cross attention between in-
2018b) and Spider-CG (Gan et al., 2022). Spider put and output of the LM. We denote this baseline
SpiderD CG-SUBT CG-SUBD CG-APPT CG-APPD
Model
EM EX EM EX EM EX EM EX EM EX
Semantic Parsers with Specialized Architectures (Gan et al., 2022)
RATSQLB(S) 71.9 - 91.0 - 72.6 - 79.8 - 61.5 -
RATSQLG(S) 74.5 - 91.4 - 76.7 - 82.5 - 68.3 -
Semantic Parsers based on LMs
T5-base 64.6 67.9 83.8 88.1 69.1 71.1 60.2 70.3 45.0 54.9
T5-base + Tok 71.8 75.6 85.9 89.5 74.1 78.6 65.2 73.8 54.2 65.9
T5-base + Comp 64.4 68.2 86.3 90.2 69.3 73.1 69.8 77.9 53.5 63.4
T5-base + Tok + Comp 69.4 73.2 86.6 90.7 76.6 79.8 71.1 77.8 61.0 69.4
T5-base + Tok + Attn. Sup 69.4 73.7 83.6 87.7 71.7 75.6 62.3 70.8 56.3 66.2
Table 4: Results (%) on different evaluation sets. Top: state-of-the-art model using specialized architecture; numbers
are collected from its paper and only EM is reported (code unavailable). Bottom: T5-base models with our proposed
or baseline techniques; we report the average performance of each model over three runs. Tok: token preprocessing.
Comp: component boundary marking. Attn. Sup: the attention supervision method of Yin et al. (2021).
as Attn. Sup.3 For both methods, we leverage tween the two types, it is also observed that compo-
component annotations from Spider-SS (Gan et al., sitional generalization (as measured by CG-APPT )
2022). is more challenging than domain generalization (as
These annotations were generated by applying measured by SpiderD and CG-SUBD ).
a syntactic parser to decompose the NL question Second, our results show that the token prepro-
into sub-questions and then manually annotating cessing method, albeit simple, can improve both
their corresponding NatSQL components. domain and compositional generalizations of LMs
We also compare with the state-of-the-art mod- dramatically. For example, comparing T5-base
els, RATSQLB(S) and RATSQLG(S) , from Gan with T5-base+Tok, the latter is improved by around
et al. (2022), although their models adopt a spe- 5-7% EM and 7% EX for domain generalization
cialized architecture (i.e., RATSQL (Wang et al., (on SpiderD and CG-SUBD ), 5% EM and 3.5% EX
2020)) and RATSQLG(S) additionally employed for compositional generalization (on CG-SUBT ),
task-specific pre-training (Shi et al., 2021). Both and 9% EM and 11% EX for the challenging case
models used the same component annotations from when both types occur (on CG-APPD ). Addition-
Spider-SS. ally, we also show the effectiveness of token pre-
Finally, for each of our model variants in Ta- processing with T5-3B on SpiderD in App. B.
ble 4, we repeat the experiment three times, using Moving on to our proposed component boundary
three random seeds consistently across all models, marking method, it shows to be particularly help-
and report the average results. We include more ful for compositional generalization. Specifically,
implementation details in Appendix D. applying it to T5-base leads to a 9% EM and 7%
EX increase on CG-APPT , and an 8% EM and 8%
4.2 Results EX increase on CG-APPD . On the in-distribution
Main Results. We present our results in Table evaluation set, this technique also gives slight im-
4. First, all models obtain the best performance on provement, whereas, for domain generalization,
the in-distribution evaluation set CG-SUBT while there is no obvious impact from this technique.
suffering from more than 10% performance drops Finally, augmenting T5-base with both tech-
on others, confirming the challenges of the domain niques (i.e., T5-base+Tok+Comp) leads to better
and compositional generation. As expected, all performance than applying each technique individ-
models have the worst performance on CG-APPD , ually in most evaluation sets, implying that our
which requires both types of generalization. Be- two techniques are complementary to each other.
Specifically, for in-distribution evaluation, using
3
In our implementation, we apply the supervision to cross- each technique individually or both of them to-
attention distribution averaged across all decoder layers and
heads. We also tried cross-attention from only the top decoder gether yield similar results; for domain general-
layer, but the results are similar. ization, there is no additional gain from applying
component boundary marking on the top of the Model Exact Match
token preprocessing; for compositional generaliza- COARSE2FINE + SS (Span-level Sup.) 47.4
tion, the two techniques together contribute the best T5-base 63.9
EM across all models and baselines. Overall, com- T5-base + Tok 65.1
T5-base + Tok + Comp 67.7
bining the two techniques shrinks the performance
gap between in-distribution and domain OOD by Table 5: Results (%) on SMCalFlow-Compositional
around 2-4% EM, composition OOD by 7%, and Skills dataset (16-shot setting). Top: Result from Yin
joint OOD by 13%. et al. (2021). Bottom: T5-base models with our pro-
Compared with Special Architectures. De- posed or baseline techniques; we report the average
spite its simplicity, our T5-base+Tok+Comp model performance of each model over three runs.
achieves comparable or better performance than the
two RATSQL variants on CG-SUBD . It also per- NL input, not the database schema (note that literal
forms comparably to RATSQLB(S) on CG-APPD . values are grounded and attached to schema items
Compared with Attn. Sup. Surprisingly, the at- in their input representations; see Appendix D for
tention supervision has only led to around 2% EM details). This reveals a new challenge of LM gen-
and 1.5% EX gains on CG-APPD , while no further eralization in text-to-SQL semantic parsing, i.e.,
advantage is observed on other evaluation sets. In how to properly handle the database schema when
our conjecture, this is due to the misalignment be- injecting prior knowledge into LMs for composi-
tween the objective of Attn. Sup (Yin et al., 2021) tional generalization.
and the attention mechanism of pre-trained LMs. Generalizing to Other Semantic Parsing Tasks.
Specifically, Attn. Sup encourages the attention While our main focus in this work is on text-to-
distribution of different heads to be consistent with SQL parsing, we also investigate whether our ap-
the component alignment supervision. However, proaches can generalize beyond this specific task.
prior work (Voita et al., 2019) suggests that differ- To this end, we implemented both of our techniques
ent attention heads of even the same layer may have to SMCalFlow-CS (Yin et al., 2021), a composi-
different functions and roles. Thus, when coarsely tional generalization dataset for text-to-LISP ex-
defining the objective function, it may not allow for pression parsing (Semantic Machines et al., 2020).
the most effective supervision. Furthermore, simi- For “+Comp”, We utilize the span-level alignments
lar to our finding, Yin et al. (2021) did not observe heuristically derived by Yin et al. (2021) as com-
performance gain when they applied Attn. Sup to ponent annotations.4 Our results in Table 5 show
T5-base on CFQ (Keysers et al., 2020). that: (1) Our token preprocessing can be univer-
Qualitative Analysis on Tokenization. To qual- sally helpful for LMs to model schema items, pred-
itatively understand how our token preprocessing icates, etc., leading to 1.2% performance gain over
helps the generalization, we randomly sampled 50 T5-base; (2) Our component boundary marking
examples from the SpiderD to analyze how fre- method is highly effective for compositional gener-
quently the T5 tokenizer divides tokens into less alization, which offers 2.6% additional gain.
meaningful subtokens. Consequently, we found
243 tokenization issues in total, and 140 of them 5 Conclusion
can be resolved by our token preprocessing. The In this paper, we present two simple yet effective
remaining cases are like splitting “id” into “i” and techniques to improve the domain and composi-
“d” as shown in Table 1, which is beyond our scope. tional generalization of LMs in text-to-SQL seman-
Error Analysis on Component Boundary Mark- tic parsing. Our techniques aid LMs in preserving
ing. We manually examined 50 error predictions the semantic boundaries of tokens and components
from T5-base+Tok+Comp and contrasted them in their input and output. We also demonstrate
with the errors of T5-base+Tok. Intriguingly, we their potential to be generalized to other semantic
observed much more frequent schema items or parsing tasks.
value hallucinations from the former. For exam- 4
Yin et al.’s approach requires knowing the ground-truth
ple, it may generate queries accessing non-existing LISP expression when deriving the component boundaries
columns in a table, or misspells the literal values for the input question. In our experiment, we assume the
availability of these question boundaries at test time and focus
in the queries. We conjecture that this is because on showcasing the potential of “Comp”, while automating this
our component boundaries are only applied to the question decomposition is left as future work.
Limitations deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
Future work can further apply our approaches the North American Chapter of the Association for
to other semantic parsing tasks. For example, Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
for parsing texts to lambda-calculus expressions
4171–4186, Minneapolis, Minnesota. Association for
for knowledge base question answering (Dong Computational Linguistics.
and Lapata, 2016), one can similarly preprocess
the schema items (e.g., “department_time” into Li Dong and Mirella Lapata. 2016. Language to logical
form with neural attention. In Proceedings of the
“department _ time”) and typed values (e.g., 54th Annual Meeting of the Association for Compu-
“dallas:ci” into “dallas : ci”) for more mean- tational Linguistics (Volume 1: Long Papers), pages
ingful subword tokenization results. In addition, 33–43, Berlin, Germany. Association for Computa-
our experiments are based on T5. To further verify tional Linguistics.
the effectiveness of our techniques, one can apply Yujian Gan, Xinyun Chen, Qiuping Huang, and
them to other pre-trained language models such as Matthew Purver. 2022. Measuring and improving
BART (Lewis et al., 2020) and GPT-2 (Radford compositional generalization in text-to-SQL via com-
et al., 2019) as well. ponent alignment. In Findings of the Association for
Computational Linguistics: NAACL 2022, pages 831–
843, Seattle, United States. Association for Compu-
Acknowledgments tational Linguistics.
We would like to thank all anonymous reviewers Yujian Gan, Xinyun Chen, Jinxia Xie, Matthew Purver,
for their constructive comments. We also thank Yu- John R. Woodward, John Drake, and Qiaofu Zhang.
jian Gan and Xinyun Chen for their help in using 2021. Natural SQL: Making SQL easier to infer from
natural language specifications. In Findings of the
the NatSQL and the Spider-SS datasets, as well as Association for Computational Linguistics: EMNLP
Pengcheng Yin for using the code base of Attn. Sup. 2021, pages 2030–2042, Punta Cana, Dominican Re-
This project was supported by resources provided public. Association for Computational Linguistics.
by the Office of Research Computing at George
Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-
Mason University (https://orc.gmu.edu) and Guang Lou, Ting Liu, and Dongmei Zhang. 2019. To-
funded in part by grants from the National Sci- wards complex text-to-SQL in cross-domain database
ence Foundation (Awards Number 1625039 and with intermediate representation. In Proceedings of
2018631). the 57th Annual Meeting of the Association for Com-
putational Linguistics, pages 4524–4535, Florence,
Italy. Association for Computational Linguistics.
References Jonathan Herzig and Jonathan Berant. 2021. Span-
based semantic parsing for compositional general-
Ekin Akyürek, Afra Feyza Akyürek, and Jacob An- ization. In Proceedings of the 59th Annual Meet-
dreas. 2020. Learning to recombine and resample ing of the Association for Computational Linguistics
data for compositional generalization. arXiv preprint and the 11th International Joint Conference on Natu-
arXiv:2010.03706. ral Language Processing (Volume 1: Long Papers),
pages 908–921, Online. Association for Computa-
Jacob Andreas. 2020. Good-enough compositional data tional Linguistics.
augmentation. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin- Jonathan Herzig, Peter Shaw, Ming-Wei Chang, Kelvin
guistics, pages 7556–7566, Online. Association for Guu, Panupong Pasupat, and Yuan Zhang. 2021. Un-
Computational Linguistics. locking compositional generalization in pre-trained
models using intermediate representations. arXiv
Henry Conklin, Bailin Wang, Kenny Smith, and Ivan preprint arXiv:2104.07478.
Titov. 2021. Meta-learning to compositionally gen-
eralize. In Proceedings of the 59th Annual Meet- Daniel Keysers, Nathanael Schärli, Nathan Scales,
ing of the Association for Computational Linguistics Hylke Buisman, Daniel Furrer, Sergii Kashubin,
and the 11th International Joint Conference on Natu- Nikola Momchev, Danila Sinopalnikov, Lukasz
ral Language Processing (Volume 1: Long Papers), Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang,
pages 3322–3335, Online. Association for Computa- Marc van Zee, and Olivier Bousquet. 2020. Measur-
tional Linguistics. ing compositional generalization: A comprehensive
method on realistic data. In International Conference
DeepSpeed. 2023. https://github.com/microsoft/deepspeed. on Learning Representations.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Daniel Keysers, Nathanael Schärli, Nathan Scales,
Kristina Toutanova. 2019. BERT: Pre-training of Hylke Buisman, Daniel Furrer, Sergii Kashubin,
Nikola Momchev, Danila Sinopalnikov, Lukasz Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Stafiniak, Tibor Tihon, et al. 2019. Measuring com- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
positional generalization: A comprehensive method Wei Li, Peter J Liu, et al. 2020. Exploring the limits
on realistic data. arXiv preprint arXiv:1912.09713. of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21(140):1–67.
Brenden Lake and Marco Baroni. 2018. Generalization
without systematicity: On the compositional skills Torsten Scholak, Nathan Schucher, and Dzmitry Bah-
of sequence-to-sequence recurrent networks. In Pro- danau. 2021. PICARD: Parsing incrementally for
ceedings of the 35th International Conference on constrained auto-regressive decoding from language
Machine Learning, volume 80 of Proceedings of Ma- models. In Proceedings of the 2021 Conference on
chine Learning Research, pages 2873–2882. PMLR. Empirical Methods in Natural Language Processing,
pages 9895–9901, Online and Punta Cana, Domini-
Itay Levy, Ben Bogin, and Jonathan Berant. 2022. can Republic. Association for Computational Lin-
Diverse demonstrations improve in-context guistics.
compositional generalization. arXiv preprint
arXiv:2212.06800. Semantic Machines, Jacob Andreas, John Bufe, David
Burkett, Charles Chen, Josh Clausman, Jean Craw-
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan ford, Kate Crim, Jordan DeLoach, Leah Dorner, Ja-
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, son Eisner, Hao Fang, Alan Guo, David Hall, Kristin
Veselin Stoyanov, and Luke Zettlemoyer. 2020. Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Sm-
BART: Denoising sequence-to-sequence pre-training riti Jha, Dan Klein, Jayant Krishnamurthy, Theo Lan-
for natural language generation, translation, and com- man, Percy Liang, Christopher H. Lin, Ilya Lints-
prehension. In Proceedings of the 58th Annual Meet- bakh, Andy McGovern, Aleksandr Nisnevich, Adam
ing of the Association for Computational Linguistics, Pauls, Dmitrij Petters, Brent Read, Dan Roth, Subhro
pages 7871–7880, Online. Association for Computa- Roy, Jesse Rusak, Beth Short, Div Slomin, Ben Sny-
tional Linguistics. der, Stephon Striplin, Yu Su, Zachary Tellman, Sam
Thomson, Andrei Vorobev, Izabela Witoszko, Jason
Xi Victoria Lin, Richard Socher, and Caiming Xiong. Wolfe, Abby Wray, Yuchen Zhang, and Alexander
2020. Bridging textual and tabular data for cross- Zotov. 2020. Task-oriented dialogue as dataflow syn-
domain text-to-SQL semantic parsing. In Findings thesis. Transactions of the Association for Computa-
of the Association for Computational Linguistics: tional Linguistics, 8:556–571.
EMNLP 2020, pages 4870–4888, Online. Association
for Computational Linguistics. Peter Shaw, Ming-Wei Chang, Panupong Pasupat, and
Kristina Toutanova. 2021. Compositional generaliza-
Jiexing Qi, Jingyao Tang, Ziwei He, Xiangpeng Wan, tion and natural language variation: Can a semantic
Chenghu Zhou, Xinbing Wang, Quanshi Zhang, and parsing approach handle both? In Proceedings of the
Zhouhan Lin. 2022. Rasat: Integrating relational 59th Annual Meeting of the Association for Compu-
structures into pretrained seq2seq model for text-to- tational Linguistics and the 11th International Joint
sql. arXiv preprint arXiv:2205.06983. Conference on Natural Language Processing (Vol-
ume 1: Long Papers), pages 922–938, Online. Asso-
Linlu Qiu, Peter Shaw, Panupong Pasupat, Pawel ciation for Computational Linguistics.
Nowak, Tal Linzen, Fei Sha, and Kristina Toutanova.
2022a. Improving compositional generalization with Peng Shi, Patrick Ng, Zhiguo Wang, Henghui Zhu,
latent structure and data augmentation. In Proceed- Alexander Hanbo Li, Jun Wang, Cicero Nogueira
ings of the 2022 Conference of the North Ameri- dos Santos, and Bing Xiang. 2021. Learning con-
can Chapter of the Association for Computational textual representations for semantic parsing with
Linguistics: Human Language Technologies, pages generation-augmented pre-training. In Proceedings
4341–4362, Seattle, United States. Association for of the AAAI Conference on Artificial Intelligence,
Computational Linguistics. volume 35, pages 13806–13814.
Linlu Qiu, Peter Shaw, Panupong Pasupat, Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
Paweł Krzysztof Nowak, Tal Linzen, Fei Sha, nrich, and Ivan Titov. 2019. Analyzing multi-head
and Kristina Toutanova. 2021. Improving composi- self-attention: Specialized heads do the heavy lift-
tional generalization with latent structure and data ing, the rest can be pruned. In Proceedings of the
augmentation. arXiv preprint arXiv:2112.07610. 57th Annual Meeting of the Association for Computa-
tional Linguistics, pages 5797–5808, Florence, Italy.
Linlu Qiu, Peter Shaw, Panupong Pasupat, Tianze Shi, Association for Computational Linguistics.
Jonathan Herzig, Emily Pitler, Fei Sha, and Kristina
Toutanova. 2022b. Evaluating the impact of model Bailin Wang, Mirella Lapata, and Ivan Titov. 2021.
scale for compositional generalization in semantic Structured reordering for modeling latent alignments
parsing. arXiv preprint arXiv:2205.12253. in sequence transduction. Advances in Neural Infor-
mation Processing Systems, 34:13378–13391.
Alec Radford, Jeff Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
models are unsupervised multitask learners. Polozov, and Matthew Richardson. 2020. RAT-SQL:
Relation-aware schema encoding and linking for text- Victor Zhong, Caiming Xiong, and Richard Socher.
to-SQL parsers. In Proceedings of the 58th Annual 2017. Seq2sql: Generating structured queries from
Meeting of the Association for Computational Lin- natural language using reinforcement learning. arXiv
guistics, pages 7567–7578, Online. Association for preprint arXiv:1709.00103.
Computational Linguistics.
A Token Preprocessing Details
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier- We propose a simple token preprocessing method.
ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Instead of directly feeding the input to the subword
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, tokenizer, we introduce three preprocessing steps:
Teven Le Scao, Sylvain Gugger, Mariama Drame, (1) For schema items in input and output, reversing
Quentin Lhoest, and Alexander Rush. 2020. Trans- the snake case to the normal, e.g., “pet_age” to
formers: State-of-the-art natural language processing.
“pet _ age”; (2) For any call of “Table.Column”,
In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System splitting the tokens around the access operator “.”
Demonstrations, pages 38–45, Online. Association (i.e., “Table . Column”); and (3) Replacing any
for Computational Linguistics. reserved words that cannot be properly handled
Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, in NatSQL, e.g., “avg” to “average”. In practice,
Torsten Scholak, Michihiro Yasunaga, Chien-Sheng we also handle formalism-specific special tokens,
Wu, Ming Zhong, Pengcheng Yin, Sida I Wang, e.g., adding the “less than” operator “<” to the
et al. 2022. Unifiedskg: Unifying and multi-tasking vocabulary of T5 tokenizer. While we showcase
structured knowledge grounding with text-to-text lan-
guage models. arXiv preprint arXiv:2201.05966. our token preprocessing under text-to-SQL parsing,
the intuition can be generalized to other formalisms
Pengcheng Yin, Hao Fang, Graham Neubig, Adam (e.g., regex, λ-expression) easily.
Pauls, Emmanouil Antonios Platanios, Yu Su, Sam In addition, we also check the issue of tokeniza-
Thomson, and Jacob Andreas. 2021. Compositional
generalization for neural semantic parsing via span- tion in other popular LM tokenizers and found that
level supervised attention. In Proceedings of the 2021 the tokenization issue is not specific to T5. Exam-
Conference of the North American Chapter of the ples of bad tokenization from BERT (Devlin et al.,
Association for Computational Linguistics: Human 2019) and GPT2 (Radford et al., 2019) tokeniz-
Language Technologies, pages 2810–2823, Online.
Association for Computational Linguistics. ers and after our token preprocessing are listed in
Table 6.
Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang,
Dongxu Wang, Zifan Li, and Dragomir Radev. 2018a. GPT2 Tokenizer
SyntaxSQLNet: Syntax tree networks for complex Before: student_enrolment_courses
and cross-domain text-to-SQL task. In Proceedings After: student _ enrolment _ courses
of the 2018 Conference on Empirical Methods in Nat- Before: transcripts.transcript_date
ural Language Processing, pages 1653–1663, Brus- After: transcripts . transcript _ date
sels, Belgium. Association for Computational Lin- Before: avg
guistics.
After: average
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, BERT Tokenizer
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- Before: singer.NetWorthMillions
ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir After: singer . Net Worth Millions
Radev. 2018b. Spider: A large-scale human-labeled Before: avg
dataset for complex and cross-domain semantic pars- After: average
ing and text-to-SQL task. In Proceedings of the 2018 Before: asc
Conference on Empirical Methods in Natural Lan- After: ascending
guage Processing, pages 3911–3921, Brussels, Bel-
gium. Association for Computational Linguistics.
Table 6: Tokenization of snake case, camel case, and
Rui Zhang, Tao Yu, Heyang Er, Sungrok Shim, Eric token notation in BERT and GPT2 tokenizer. Coloring
Xue, Xi Victoria Lin, Tianze Shi, Caiming Xiong, indicates tokenization, same as Table 1.
Richard Socher, and Dragomir Radev. 2019. Editing-
based SQL query generation for cross-domain
context-dependent questions. In Proceedings of the
2019 Conference on Empirical Methods in Natu- B T5-3B Experiment
ral Language Processing and the 9th International
Joint Conference on Natural Language Processing
To assess the effectiveness of our token preprocess-
(EMNLP-IJCNLP), pages 5338–5349, Hong Kong, ing technique with larger LMs, we apply it to T5-
China. Association for Computational Linguistics. 3B and evaluate the model on SpiderD . The results
SpiderD format and order as Scholak et al. (2021) (except
Model
EM EX our additional token preprocessing, if applied), i.e.,
“Question | Database 1 | Table 1: Column 1,
T5-3B (w deepspeed) 73.2 77.4
T5-3B (w/o deepspeed) 76.0 79.8 Column 2,...| Table 2: Column 1, Column
T5-3B + Tok (w deepspeed) 74.4 78.7 2...”. We also use the database contents as parts
T5-3B + Tok (w/o deepspeed) 77.4 80.9 of the input, following Lin et al. (2020). For ex-
ample, if the NL question mentions a literal value
Table 7: Results (%) on SpiderD when T5-3B(+Tok) (e.g., “New York”), the appearance of whom can be
was trained with or without using deepspeed. found in the contents of a certain “Column 1” via
fuzzy string matching, then when we represent the
database schema, we will include it via “Database
are shown in Table 7. Our results show that T5-
1 | Table 1: Column 1 (New York), Column
3B+Tok has a performance gain of 1.1%, indicating
2, ...”.
that it is helpful for larger LMs as well. Addition-
We fine-tune the T5-base LM that consists of
ally, we also provide results with and without using
220 million parameters on NVIDIA A100 GPU for
DeepSpeed (2023), a deep learning optimization
10-12 hours. It was trained with a learning rate of
library that is used to train large models more effi-
10−4 and batch size 16 for T5-base for a maximum
ciently. Surprisingly, although DeepSpeed (2023)
of 20K training steps. The model is evaluated on
helped us improve training speed, we found a per-
SpiderD for every 1K training steps, and the best
formance drop of around 2.1-2.2% EX while using
checkpoint is selected based on the model EM on
it. However, our token preprocessing consistently
SpiderD . In inference time, we perform simple
leads to around 1.0% absolute performance gain.
greedy decoding.
C Component Boundary Marking Details We use the PyTorch-Transformers library (Wolf
et al., 2020), which is a library for state-of-the-
In Table 8, we present one more example of com- art pre-trained models for NLP, to fine-tune our
ponent boundary marking. In this example, the NL models. Specifically, our code for fine-tuning T5-
component “What is the most populace city” is base is adapted from PICARD’s implementation
aligned with two non-continuous SQL segments, (Scholak et al., 2021). Furthermore, we also use
“select city.Name, city.Population” and DeepSpeed (2023) to fine-tune all of our T5-base
“order by city.Population desc limit 1”. models.
To handle such cases, we apply the same pair of
Datasets. We used Spider (Yu et al., 2018b), Nat-
special tokens “[sep0]” “[/sep0]” twice, one
SQL (Gan et al., 2021), Spider-CG (Gan et al.,
for each segment.
2022), and SMCalFlow-CS (Yin et al., 2021) in
Component Boundary Marking Example our work. They are under the license of CC BY-SA
Before: What is the most populace city that speaks English? 4.0. Our use of these datasets is consistent with
Select city.Name, city.Population where their intended use, i.e., for scientific research. All
countrylanguage.Language = “English” order
by city.Population desc limit 1
datasets are in English. They contain annotated NL
After: [sep0] What is the most populace city [/sep0] [sep1] and SQL or NatSQL or LISP expression pairs from
that speaks English? [/sep1] the open domain.
[sep0] select city.Name , city.Population
[/sep0] [sep1] where countrylanguage.Language
= "English" [/sep1] [sep0] order by
city.Population desc limit 1 [/sep0]
D Implementation Details
Our experiments are conducted based on the pre-
trained T5 model. The input to T5 follows the same