STaR SQL Self Taught Reasoner For Text To SQL
STaR SQL Self Taught Reasoner For Text To SQL
performance of large language models on com- ural language, automatically translating them into
plex reasoning tasks. However, applying such SQL queries and returning the results (Cai et al.,
techniques to structured tasks, such as text-to- 2017; Xu et al., 2017; Yaghmazadeh et al., 2017).
SQL, remains largely unexplored. In this paper, Despite significant advancements in this field,
we introduce Self-Taught Reasoner for text-to- most existing approaches primarily harness LLMs
SQL (STaR-SQL), a novel approach that re-
for their instruction-following capabilities, focus-
frames SQL query generation as a reasoning-
driven process. Our method prompts the LLM
ing on schema selection optimization and result
to produce detailed reasoning steps for SQL refinement (Pourreza and Rafiei, 2024a), as illus-
queries and fine-tunes it on rationales that lead trated in Figure 1. However, these prompts can
to correct outcomes. Unlike traditional meth- be rigid and consume a substantial portion of the
ods, STaR-SQL dedicates additional test-time available context tokens. Smaller open-source mod-
computation to reasoning, thereby positioning els may also struggle to interpret and follow the
LLMs as spontaneous reasoners rather than carefully crafted prompts on which these methods
mere prompt-based agents. To further scale the
rely. Moreover, this narrow emphasis on prompt
inference process, we incorporate an outcome-
supervised reward model (ORM) as a verifier, engineering frequently overlooks the powerful rea-
which enhances SQL query accuracy. Experi- soning capabilities inherent in LLMs (Liu et al.,
mental results on the challenging Spider bench- 2023b; Frieder et al., 2024). While these meth-
mark demonstrate that STaR-SQL significantly ods perform well on simple queries, they tend to
improves text-to-SQL performance, achieving falter when confronted with more complex ones
an execution accuracy of 86.6%. This sur- (Eyal et al., 2023). This shortcoming is particularly
passes a few-shot baseline by 31.6% and a
problematic for non-experts, who may have trou-
baseline fine-tuned to predict answers directly
by 18.0%. Additionally, STaR-SQL outper-
ble verifying whether the generated SQL queries
forms agent-like prompting methods that lever- accurately capture their original intent. Complex
age more powerful yet closed-source models misalignments in SQL queries can be especially
such as GPT-4. These findings underscore the difficult for users to detect and correct.
potential of reasoning-augmented training for To address these challenges, we reconceptual-
structured tasks and open the door to extend- ize text-to-SQL as a reasoning-driven process, en-
ing self-improving reasoning models to text-to- abling LLMs to handle complex queries by gener-
SQL generation and beyond.
ating step-by-step rationales. This approach offers
1 Introduction several key advantages:
Large Language Models (LLMs) have demon- • Robustness for Complex Queries: A step-by-
strated remarkable potential in various language step chain-of-thought reasoning method enables
tasks (Brown et al., 2020; Achiam et al., 2023), the model to systematically break down complex
including text-to-SQL translation (Rajkumar et al., queries, handle intricate database schemas more
2022; Liu et al., 2023a). Interacting with com- effectively, and produce more accurate results.
plex relational databases typically requires both • Scalability through Reasoning: By allocating
†
Corresponding author. additional computational resources at inference
PLM-based Method LLM-based Prompt Engineering Reasoning-driven STaR-SQL
Figure 1: A comparison of different text-to-SQL methods: Traditional PLM-based methods focus on how to encode
the schema (e.g., RATSQL (Wang et al., 2019)). Current LLM-based methods employ carefully designed prompts
and subtask flows to simplify and understand the task, functioning in an agent-like manner and using many tokens
in the prompt (e.g., DIN-SQL (Pourreza and Rafiei, 2024a)). We treat text-to-SQL as a reasoning-driven process.
By leveraging the LLM’s existing reasoning capabilities, we iteratively bootstrap its ability to generate high-quality
rationales. In addition, by allocating more test-time computation, we further improve the reliability of the process.
time, reasoning performance can be improved. We demonstrate the effectiveness of our method
Techniques such as best-of-N sampling (Nakano on the challenging cross-domain benchmark Spider.
et al., 2021; Askell et al., 2021; Cobbe et al., Using the two official evaluation metrics (execu-
2021) can further boost accuracy. tion accuracy and exact set match accuracy (Zhong
et al., 2020)), our method achieves an execution
• Enhanced Transparency: Step-by-step ratio-
accuracy of 86.6%, outperforming both a few-shot
nales provide outputs that are more interpretable
baseline (+31.6%) and a baseline fine-tuned to pre-
and verifiable compared to traditional end-to-end
dict answers directly (+18.0%). It even surpasses
generation approaches.
prompting methods (Pourreza and Rafiei, 2024a;
Therefore, we introduce the Self-Taught Rea- Gao et al., 2023) that rely on more powerful closed-
soner for text-to-SQL (STaR-SQL), a scalable boot- source models such as GPT-4, setting a new stan-
strapping method that enables LLMs to learn to dard for reasoning-driven text-to-SQL approaches.
generate high-quality rationales for text-to-SQL.
Specifically, we employ few-shot prompting to 2 Related Work
have a LLM self-generate rationales and then re-
fine its capabilities by fine-tuning on rationales that 2.1 Text-to-SQL
yield correct answers. To further improve perfor- Text-to-SQL (Cai et al., 2017; Zelle and Mooney,
mance on complex queries, we provide the correct 1996; Xu et al., 2017; Yu et al., 2018a; Yagh-
answer to the model to guide the generation of use- mazadeh et al., 2017), which aims to convert natu-
ful rationales. These rationales are incorporated ral language instructions or questions into SQL
into the training data, allowing the model to learn queries, has drawn significant attention. Since
to solve increasingly challenging queries. We re- the work of Dong and Lapata (2016), leading
peat this procedure, using the improved model to text-to-SQL models have adopted attention-based
generate subsequent training sets. Recently, some sequence-to-sequence architectures to translate
works have shown that LLMs can leverage addi- questions and schemas into well-formed SQL
tional test-time computation to improve their out- queries. These models have increasingly benefited
puts (Snell et al., 2024; Brown et al., 2024; He from pre-trained transformer architectures, rang-
et al., 2024). In our experiments, we introduced a ing from BERT (Hwang et al., 2019; Lin et al.,
verification mechanism to ensure result accuracy by 2020) to larger language models such as T5 (Raf-
employing an Outcome-supervised Reward Model fel et al., 2020) in Scholak et al. (2021), OpenAI
(ORM) (Cobbe et al., 2021; Yu et al., 2023a), a CodeX (Chen et al., 2021), and GPT variants (Ra-
straightforward yet effective verifier that demon- jkumar et al., 2022; Liu and Tan, 2023; Pourreza
strably improves overall performance. and Rafiei, 2024a). Along with using pre-trained
models, various task-specific enhancements have time verification, which involves generating mul-
been introduced, including improved schema en- tiple candidate solutions and ranking them with
coding via more effective representation learning a separate verifier (Cobbe et al., 2021; He et al.,
(Bogin et al., 2019) and fine-tuned attention mech- 2024) to select the most accurate one. For example,
anisms for sequence-to-sequence models (Wang the DIVERSE framework (Li et al., 2022) employs
et al., 2019). On the decoding side, some methods a variety of CoT prompts together with a verifier
incorporate the syntactic structure of SQL (Hwang to address reasoning challenges, while CoRe (Zhu
et al., 2019; Xu et al., 2017; Hui et al., 2021). et al., 2022) fine-tunes both the generator and veri-
Recent advances in LLMs have also extended fier in a dual-process system, improving LLM per-
their multi-task capabilities to text-to-SQL. In zero- formance on math word problems.
shot scenarios, a task-specific prompt is added
before the schema and the question, guiding the 3 STaR-SQL
LLM to generate an SQL query. Rajkumar et al.
In this section, we introduce STaR-SQL, a method
(2022); Liu et al. (2023a) showed that OpenAI
that evokes the intrinsic reasoning capabilities of
CodeX can achieve 67% execution accuracy using
LLMs to enhance performance on complex text-to-
this approach. Building on this, few-shot prompt-
SQL tasks. We begin by describing the problem
ing strategies have been investigated. In particular,
formulation (§ 3.1), followed by an explanation of
Pourreza and Rafiei (2024a); Liu and Tan (2023)
how we generate step-by-step rationales (§ 3.2) for
proposed GPT-4-based DIN-SQL, which divides
self-improvement. Finally, we outline our approach
the problem into four subtasks (schema linking,
to verifier training and scaling up test-time compute
classification, generation, and self-correction) and
to further enhance accuracy (§ 3.3). A schematic
achieves strong performance on the Spider bench-
overview of the algorithm is provided in Figure 1.
mark. However, Pourreza and Rafiei (2024a) also
noted that DIN-SQL encounters difficulties when 3.1 Problem Formulations
dealing with complex queries. In contrast to these
approaches, our method reframes text-to-SQL as a The text-to-SQL task involves mapping a question
reasoning task. By doing so, it leverages the inher- Q = (q1 , .1. . , qm ) 1and a database schema S =
ent reasoning capabilities of LLMs to boost perfor- table1 (col1 . . . colc1 ), . . . , tableT (col1 . . . colcTT )
T
mance and facilitates the integration of additional to a valid SQL query Y = (y1 , . . . , yn ). Perfor-
reasoning techniques into text-to-SQL systems. mance is typically evaluated using two metrics:
1) exact match, which compares the predicted
2.2 Multi-step Reasoning query to the golden query in terms of overall
Complex reasoning tasks have sparked extensive structure and within each field token by token, and
research in LLMs, which are crucial for han- 2) execution match, which checks whether the
dling challenging queries (Kaddour et al., 2023; prediction produces the same results as the golden
Lightman et al., 2023; Huang et al., 2023). One query when executed on the database.
prominent strategy is the Chain-of-Thought (CoT)
prompting technique (Wei et al., 2022), along with 3.2 Self-Taught Reasoner
its variants (Kojima et al., 2022; Wang et al., 2022; Self-Taught Reasoner (STaR; Zelikman et al.
Yao et al., 2024), which decompose the reason- (2022)) is an iterative approach in which a lan-
ing process into sequential steps and systemati- guage model improves itself using correctness
cally approach problem-solving in a human-like feedback. We begin with a pre-trained LLM πθ
manner. To further enhance the accuracy of these as a generator and an initial text-to-SQL dataset
intermediate steps, recent studies leverage exten- D = {(Qi , Si , Yi )}D i=1 , where each instance com-
sive synthetic datasets, which are either distilled prises a question Qi , a database schema Si , and
from cutting-edge models (Yu et al., 2023b; Luo a corresponding golden SQL query Yi . Our
et al., 2023) or composed of self-generated ratio- method also assumes a small prompt set P of
nales (Zelikman et al., 2022; Yuan et al., 2023; Ni examples with intermediate rationales R: P =
et al., 2022), to fine-tune the LLMs. Such training {(Qpi , Sip , Rip , Yip )}Pi=1 , where P ≪ D (for in-
strategy effectively sharpens the models’ ability to stance, P = 3). Following the standard few-shot
produce correct chain-of-thought reasoning. prompting procedure, we concatenate this prompt
Additionally, there is growing interest in test- set P to each example in D, then sample k ratio-
STaR-SQL Verifier Training
Figure 2: An overview of the STaR-SQL framework. It consists of three main steps: step-by-step rationale generation
for self-improvement, verifier training, and test-time verification. We transform text-to-SQL into a reasoning task
and further explore scaling up test-time computation by incorporating a verifier and employing best-of-N sampling.
nales followed by an answer from the generator: using the negative log-likelihood objective:
{(Rij , Ŷij ) ∼ πθ (R, Ŷ |P, Qi , Si )}kj=1 .
|R|+|Y |
Having access to golden SQL queries Yi , we X
LSFT = −E(X,R,Y )∼DSFT log πθ (ti |t<i , X)
can assign a binary correctness label to each gen-
i=1
erated query Ŷij using the indicator 1[Ŷ = Y ]. A (1)
rationale is labeled as correct if its final query Ŷ where X is the concatenation of the question Q
matches the golden query Y . Intuitively, correct and the schema S, i.e., X = (Q, S).
queries should stem from higher-quality rationales, The newly fine-tuned generator is used in subse-
so we only retain those correct rationales. However, quent iterations. Once we collect a new dataset, we
under these conditions, models tend to over-sample always return to the original pre-trained model πθ
solutions for simpler queries while under-sampling for re-initialization (as opposed to continually fine-
solutions for more complex queries, a phenomenon tuning the same model) to mitigate overfitting. This
known as tail narrowing (Ding et al., 2024). This process is repeated until performance plateaus.
results in a training set for the next iteration domi-
nated by rationales for simpler problems, with lim- 3.3 Test-time verification
ited coverage of more challenging queries, thereby Previous self-improvement methods such as RFT
introducing sampling bias. (Yuan et al., 2023), STaR, and ReST (Gulcehre
To address this issue, we employ a straightfor- et al., 2023) typically discard incorrect model-
ward difficulty-based resampling strategy, which generated solutions. However, even incorrect so-
has proven sufficiently effective in practice. Specif- lutions can contain useful information: a language
ically, for each question, we resample L times, model may learn from the discrepancies between
where L is the number of incorrect initial responses correct and incorrect solutions, identifying com-
for that question. To improve accuracy, we provide mon error patterns and thereby improving its over-
the golden SQL query as a hint to the model and all accuracy. In this work, we propose utilizing
ask it to generate rationales in the same style as dur- both correct and incorrect solutions in the itera-
ing the previous rationale-generation step. Given tive process to train a verifier. Following Cobbe
the golden SQL query, the model can more easily et al. (2021), we introduce a verifier, also known
reason backwards to produce a rationale that yields as an outcome-supervised reward model (ORM).
the correct answer. For correct initial responses, An ORM estimates the probability that a candidate
we directly add them to the training set. rationale T is correct for a given problem. It is built
We then form a new dataset, DSFT , and perform upon a LLM with an additional randomly initial-
supervised fine-tuning (SFT) of the generator πθ ized linear layer that outputs a scalar value. The
ORM is trained with a binary classification loss: not take values into account. The execution accu-
racy (EX) compares the execution output of the
LORM = AT log rT + (1 − AT ) log(1 − rT ) (2) predicted SQL query with that of the ground truth
SQL query on some database instances. Execution
where AT is the correctness label (AT = 1 if T is accuracy provides a more precise estimate of the
correct, otherwise AT = 0), and rT is the ORM’s model’s performance since there may be multiple
sigmoid output. In our context, AT is defined by valid SQL queries for a given question, and exact
the execution match label; i.e., whether the gener- set match accuracy only evaluates the predicted
ated SQL query matches the golden query when SQL against one of them.
executed. Since each generated rationale is labeled
during every iteration, these labeled pairs form an Parameter Setting We used Llama-3.1-8B-
ideal training set DVER for the verifier. Instruct as our base language model. This open-
We further scale up test-time compute through source model demonstrates non-trivial performance
best-of-N sampling strategy (Nakano et al., 2021; on the text-to-SQL task while leaving room for fur-
Askell et al., 2021; Cobbe et al., 2021), which im- ther improvements, making it an ideal testbed for
proves the reliability of the final answer. Specifi- our study. To construct the training dataset, we se-
cally, at test time, the language model generates N lected 7,000 problems from the Spider training set
candidate solutions in parallel, and the one with the and sampled 8 solutions for each problem. We then
highest verifier score is chosen as the final output. filtered the correct solutions to train the generator
and used the entire dataset to train the verifier. We
4 Experiments ran STaR-SQL until performance plateaued and
report the best results observed.
4.1 Experimental Setup
Datasets Several large text-to-SQL datasets have Baselines We conducted a comparative evalua-
been created, some with single schemas (Wang tion against several well-established methods, in-
et al., 2019) or with simple queries (Zhong et al., cluding traditional pre-trained transformer-based
2017). Notably, the Spider dataset (Yu et al., models (PLM-based) that directly predict SQL
2018b) consists of 10,181 questions and 5,693 or intermediary representations. For LLM-based
unique complex SQL queries across 200 databases, methods, we compared STaR-SQL with several
covering 138 domains, each containing multiple notable prompt-engineering approaches utilizing
tables. The standard protocol for this dataset di- strong closed-source LLMs, with particular em-
vides it into 8,659 training examples across 146 phasis on DAIL-SQL (Gao et al., 2023), which
databases and 1,034 development examples across is currently the SOTA approach of this kind. We
20 databases, with non-overlapping databases in also compared our method with fine-tuned special-
each set. SQL queries are categorized into four ized code LLMs, such as CodeS (Li et al., 2024),
difficulty levels, based on the number of SQL key- DTS (Pourreza and Rafiei, 2024b) and ROUTE
words used, the presence of nested subqueries, and (Qin et al., 2024). Regarding training data genera-
the usage of column selections and aggregations. tion, we considered Question Decomposition (QD)
The dataset is used to assess the generalization (Eyal et al., 2023) as a baseline. In this approach,
capabilities of text-to-SQL models on complex the model is instructed to first produce a custom
queries with unseen schemas. We focus on this intermediary language, QPL, which is then trans-
dataset for our experiments, as it enables compari- lated into the rationale. To assess data quality, we
son with many previous methods. compared a model trained on QD-generated data
with our own approach. Finally, we included an
Metrics The performance of our models are eval- LLM fine-tuned to predict answers directly, with-
uated using the official metrics of Spider (Zhong out revealing its reasoning steps, to demonstrate the
et al., 2020): exact-setmatch accuracy (EM) and ex- importance of incorporating a reasoning process.
ecution accuracy (EX). The exact-set-match accu-
racy (EM) treats each clause as a set and compares 4.2 Main Results
the prediction for each clause to its corresponding Most of our evaluation during development was
clause in the reference query. A predicted SQL conducted on the Spider development set, which
query is considered correct only if all of its com- was easily accessible, unlike the test set that was
ponents match the ground truth. This metric does only accessible through the evaluation server pro-
Classification Methods Models EX EM
PLM-based NatSQL (Gan et al., 2021) RAT-SQL (Wang et al., 2019) 73.7 -
QPL (Eyal et al., 2023) Flan-T5-XL (Chung et al., 2024) 77.4 -
Graphix-T5 (Li et al., 2023) Graphix-T5 78.2 75.6
Prompting with LLMs Few-shot Llama-3.1-8B-Instruct 55.0 34.2
Qwen2.5-7B (Yang et al., 2024a) 72.5 -
CodeX Cushman 43.1 30.9
CodeX Davinci 61.5 50.2
GPT-4 67.4 54.3
DIN-SQL Llama-3.1-8B-Instruct 45.2 26.5
(Pourreza and Rafiei, 2024a) GPT-4 74.2 60.1
MAC-SQL Llama-3-8B 64.3 -
(Wang et al., 2023) Qwen2.5-7B 71.7 -
MCP (Qin et al., 2024) Llama-3-8B 75.0 -
Qwen2.5-7B 78.3 -
DAIL-SQL (Gao et al., 2023) GPT-3.5-TURBO 77.8 63.9
GPT-4 81.7 69.1
Fine-Tuning predict SQL-only Llama-3.1-8B-Instruct 68.6 57.9
with Open-Source LLMs QD (Eyal et al., 2023) Llama-3.1-8B-Instruct 64.5 54.3
CodeS (Li et al., 2024) StarCoder 69.8 -
DTS-SQL (Pourreza and Rafiei, 2024b) Mistral-7B 77.1 69.3
SENSE-7B (Yang et al., 2024b) CodeLlama-7B 83.2 -
ROUTE (Qin et al., 2024) Qwen2.5-7B 83.6 -
STaR-SQL Llama-3.1-8B-Instruct 75.0 64.9
STaR-SQL ORM@16 Llama-3.1-8B-Instruct 86.6 72.5
Table 1: Execution accuracy (EX) and exact set match accuracy (EM) (both in %) on the dev set of Spider. Bold
indicates the best results, and underline indicates the second best.
vided by Yu et al. (2018b). As shown in Ta- sampling 16 solutions for each problem and apply-
bles 1, our proposed method significantly en- ing ORM for selection, our approach significantly
hances the original performance of Llama-3.1-8B- surpasses other PLM-based and LLM-based meth-
Instruct, improving its accuracy from 55.0% to ods in terms of exact set match. For example, it
75.0% (+20.0%). Although small open-source achieves the highest accuracy of 86.6%, outper-
models cannot directly apply reasoning to the text- forming DAIL-SQL (the best GPT-4 prompting
to-SQL task and thus perform poorly, they demon- method) by 4.9% and the previous state-of-the-art
strate the potential to employ reasoning abilities ROUTE by 3.0%. Furthermore, training ORM does
when trained on correct rationales. Our approach not require additional data because it is derived en-
also outperforms naive few-shot prompting meth- tirely from STaR-SQL’s iterative training process.
ods, showing that it is crucial for LLMs to be As a result, this method is both data-efficient and
familiar with the reasoning patterns required for straightforward, leveraging both correct and incor-
this task: STaR-SQL surpasses few-shot prompt- rect solutions from an iteratively trained generator
ing with stronger closed-source LLMs like GPT-4 to build a robust verifier. These results highlight
by a large margin (+7.6%), and it is comparable STaR-SQL’s strong performance and scalability
to advanced prompt engineering techniques and when increasing test-time compute.
specialized code LLMs like CodeS and DTS-SQL. We attribute these improvements to the following
Notably, it even outperforms DIN-SQL, which re- factors: 1) Reasoning Integration: Beyond lever-
lies on extensive compute to simplify schemas and aging the large language model’s understanding
refine the output. Compared to predicting only the capability, we also utilize its reasoning ability dur-
final SQL, our results demonstrate the necessity of ing inference. This transforms the model from
integrating the reasoning process during inference, a mere “agent” into a “reasoner,” enabling it to
as this improves accuracy by an additional 6.4%. handle complex query problems more effectively.
When we scale up test-time compute, the ben- 2) Expanded Test-Time Computation: We scale
efit of reframing the text-to-SQL task as a rea- up test-time computation, which complements our
soning process becomes even more evident. By approach of reframing text-to-SQL as a reason-
NatSQL + RAT-SQL
90 QPL
Few-shot GPT-4
80 Din-SQL GPT-4
ROUTE
STaR-SQL
EX Accuracy (%)
70
60
50
40
Figure 3: Execution accuracy comparison across different query difficulty levels on the Spider development set.
Binyuan Hui, Xiang Shi, Ruiying Geng, Binhua Li, Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-
Yongbin Li, Jian Sun, and Xiaodan Zhu. 2021. Im- guang Lou, Chongyang Tao, Xiubo Geng, Qingwei
proving text-to-sql with schema dependency learning. Lin, Shifeng Chen, and Dongmei Zhang. 2023. Wiz-
arXiv preprint arXiv:2103.04399. ardmath: Empowering mathematical reasoning for
large language models via reinforced evol-instruct.
Wonseok Hwang, Jinyeong Yim, Seunghyun Park, and arXiv preprint arXiv:2308.09583.
Minjoon Seo. 2019. A comprehensive exploration
on wikisql with table-aware word contextualization. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,
arXiv preprint arXiv:1902.01069. Long Ouyang, Christina Kim, Christopher Hesse,
Shantanu Jain, Vineet Kosaraju, William Saunders,
Jean Kaddour, Joshua Harris, Maximilian Mozes, Her- et al. 2021. Webgpt: Browser-assisted question-
bie Bradley, Roberta Raileanu, and Robert McHardy. answering with human feedback. arXiv preprint
2023. Challenges and applications of large language arXiv:2112.09332.
models. arXiv preprint arXiv:2307.10169.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Olek-
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- sandr Polozov, Christopher Meek, Dragomir Radev,
guage models are zero-shot reasoners. Advances in and Jianfeng Gao. 2022. Learning math reasoning
neural information processing systems, 35:22199– from self-sampled correct and partially-correct solu-
22213. tions. arXiv preprint arXiv:2205.14318.
Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xi- Mohammadreza Pourreza and Davood Rafiei. 2024a.
aokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Din-sql: Decomposed in-context learning of text-to-
Cuiping Li, and Hong Chen. 2024. Codes: Towards sql with self-correction. Advances in Neural Infor-
building open-source language models for text-to-sql. mation Processing Systems, 36.
Proceedings of the ACM on Management of Data,
2(3):1–28. Mohammadreza Pourreza and Davood Rafiei. 2024b.
Dts-sql: Decomposed text-to-sql with small large
Jinyang Li, Binyuan Hui, Reynold Cheng, Bowen Qin, language models. arXiv preprint arXiv:2402.01117.
Chenhao Ma, Nan Huo, Fei Huang, Wenyu Du, Luo
Si, and Yongbin Li. 2023. Graphix-t5: Mixing pre- Yang Qin, Chao Chen, Zhihang Fu, Ze Chen, Dezhong
trained transformers with graph-aware layers for text- Peng, Peng Hu, and Jieping Ye. 2024. Route: Ro-
to-sql parsing. In Proceedings of the AAAI Con- bust multitask tuning and collaboration for text-to-sql.
ference on Artificial Intelligence, volume 37, pages arXiv preprint arXiv:2412.10138.
13076–13084.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, B. Chen, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Jian-Guang Lou, and Weizhu Chen. 2022. Making
Wei Li, and Peter J Liu. 2020. Exploring the lim-
language models better reasoners with step-aware
its of transfer learning with a unified text-to-text
verifier. In Annual Meeting of the Association for
transformer. Journal of machine learning research,
Computational Linguistics.
21(140):1–67.
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri
Edwards, Bowen Baker, Teddy Lee, Jan Leike, Nitarshan Rajkumar, Raymond Li, and Dzmitry Bah-
John Schulman, Ilya Sutskever, and Karl Cobbe. danau. 2022. Evaluating the text-to-sql capabil-
2023. Let’s verify step by step. arXiv preprint ities of large language models. arXiv preprint
arXiv:2305.20050. arXiv:2204.00498.
Xi Victoria Lin, Richard Socher, and Caiming Xiong. Torsten Scholak, Nathan Schucher, and Dzmitry Bah-
2020. Bridging textual and tabular data for cross- danau. 2021. Picard: Parsing incrementally for
domain text-to-sql semantic parsing. arXiv preprint constrained auto-regressive decoding from language
arXiv:2012.12627. models. arXiv preprint arXiv:2109.05093.
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- Metamath: Bootstrap your own mathematical ques-
mar. 2024. Scaling llm test-time compute optimally tions for large language models. arXiv preprint
can be more effective than scaling model parameters. arXiv:2309.12284.
arXiv preprint arXiv:2408.03314.
Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Radev. 2018a. Typesql: Knowledge-based type-
Polozov, and Matthew Richardson. 2019. Rat-sql: aware neural text-to-sql generation. arXiv preprint
Relation-aware schema encoding and linking for text- arXiv:1804.09769.
to-sql parsers. arXiv preprint arXiv:1911.04942.
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
Jiaqi Bai, Qian-Wen Zhang, Zhao Yan, and Zhoujun ing Yao, Shanelle Roman, et al. 2018b. Spider: A
Li. 2023. Mac-sql: Multi-agent collaboration for large-scale human-labeled dataset for complex and
text-to-sql. arXiv preprint arXiv:2312.11242. cross-domain semantic parsing and text-to-sql task.
arXiv preprint arXiv:1809.08887.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting
Denny Zhou. 2022. Self-consistency improves chain Dong, Chuanqi Tan, and Chang Zhou. 2023. Scal-
of thought reasoning in language models. arXiv ing relationship on learning mathematical reason-
preprint arXiv:2203.11171. ing with large language models. arXiv preprint
arXiv:2308.01825.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Good-
et al. 2022. Chain-of-thought prompting elicits rea- man. 2022. Star: Bootstrapping reasoning with rea-
soning in large language models. Advances in neural soning. Advances in Neural Information Processing
information processing systems, 35:24824–24837. Systems, 35:15476–15488.
Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, John M Zelle and Raymond J Mooney. 1996. Learning
and Yiming Yang. 2024. Inference scaling laws: to parse database queries using inductive logic pro-
An empirical analysis of compute-optimal inference gramming. In Proceedings of the national conference
for problem-solving with language models. arXiv on artificial intelligence, pages 1050–1055.
preprint arXiv:2408.00724.
Ruiqi Zhong, Tao Yu, and Dan Klein. 2020. Semantic
Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sql- evaluation for text-to-sql with distilled test suites.
net: Generating structured queries from natural lan- arXiv preprint arXiv:2010.02840.
guage without reinforcement learning. arXiv preprint
arXiv:1711.04436. Victor Zhong, Caiming Xiong, and Richard Socher.
2017. Seq2sql: Generating structured queries from
Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and natural language using reinforcement learning. arXiv
Thomas Dillig. 2017. Sqlizer: query synthesis from preprint arXiv:1709.00103.
natural language. Proceedings of the ACM on Pro-
gramming Languages, 1(OOPSLA):1–26. Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang,
Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and Yu-
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, jiu Yang. 2022. Solving math word problems via co-
Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, operative reasoning induced language models. arXiv
Fei Huang, Haoran Wei, et al. 2024a. Qwen2. 5 preprint arXiv:2210.16257.
technical report. arXiv preprint arXiv:2412.15115.