0% found this document useful (0 votes)
22 views11 pages

Solid-SQL Enhanced Schema-Linking Based In-Context Learning For

Solid-SQL is a robust text-to-SQL solution that enhances the performance of SQL generation from natural language queries by integrating a pre-processing pipeline and a schema-linking model. It addresses the vulnerabilities of existing LLM-based text-to-SQL systems by improving their robustness against adversarial perturbations, achieving state-of-the-art accuracy on various benchmarks. The methodology includes a two-round example retrieval strategy and LLM-based data augmentation, demonstrating significant performance improvements over baseline models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

Solid-SQL Enhanced Schema-Linking Based In-Context Learning For

Solid-SQL is a robust text-to-SQL solution that enhances the performance of SQL generation from natural language queries by integrating a pre-processing pipeline and a schema-linking model. It addresses the vulnerabilities of existing LLM-based text-to-SQL systems by improving their robustness against adversarial perturbations, achieving state-of-the-art accuracy on various benchmarks. The methodology includes a two-round example retrieval strategy and LLM-based data augmentation, demonstrating significant performance improvements over baseline models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Solid-SQL: Enhanced Schema-linking based In-context Learning for

Robust Text-to-SQL
Geling Liu1 * † , Yunzhi Tan2 , Ruichao Zhong2 , Yuanzhen Xie2 ,
Lingchen Zhao1† , Qian Wang1† , Bo Hu2 * , Zang Li 2 ,
1
School of Cyber Science and Engineering, Wuhan University, China
2
Big Data and AI Platform Department, Tencent, China
{liugl-, lczhaocs, qianwang}@whu.edu.cn
{boristan, answerzhong, ashexie, harryyfhu, gavinzli}@tencent.com

Abstract Question Database Schema


What are all distinct
Recently, large language models (LLMs) have countries where singers
above age 20 are from?
significantly improved the performance of text-
arXiv:2412.12522v1 [cs.CL] 17 Dec 2024

to-SQL systems. Nevertheless, many state-of-


Pre-processing
the-art (SOTA) approaches have overlooked the
critical aspect of system robustness. Our exper- Prompt
iments reveal that while LLM-driven methods Your task is to Generate SQL statements based on the given
excel on standard datasets, their accuracy is no- question and database schema ...
tably compromised when faced with adversar-
ial perturbations. To address this challenge, we SQL-generation
propose a robust text-to-SQL solution, called
Solid-SQL, designed to integrate with vari- Generated SQL Query
ous LLMs. We focus on the pre-processing SELECT DISTINCT Country FROM Singer WHERE Age ≥ 20;
stage, training a robust schema-linking model
enhanced by LLM-based data augmentation. Post-processing
Additionally, we design a two-round, struc-
tural similarity-based example retrieval strategy Refined SQL Query
for in-context learning. Our method achieves SELECT DISTINCT Country FROM Singer WHERE Age > 20;
SOTA SQL execution accuracy levels of 82.1%
and 58.9% on the general Spider and Bird Figure 1: The general three-stage pipeline of LLM-
benchmarks, respectively. Furthermore, exper- based text-to-SQL systems.
imental results show that Solid-SQL delivers
an average improvement of 11.6% compared to
baselines on the perturbed Spider-Syn, Spider- proving their efficacy (Gao et al., 2024; Pourreza
Realistic, and Dr. Spider benchmarks. and Rafiei, 2023). State-of-the-art approaches that
top text-to-SQL leaderboards, such as Spider (Yu
1 Introduction et al., 2018) and BIRD (Li et al., 2023), leverage
Text-to-SQL serves as an automated tool that facil- advanced Large Language Models (LLMs) like
itates the transformation of natural language into GPT-4 (Achiam et al., 2023) for SQL generation.
structured query language (SQL) commands, en- Considering the role of text-to-SQL in sensitive
abling individuals without specialized knowledge domains such as finance and healthcare, where sys-
and skills to write SQL and query databases (Baig tem reliability and security are of critical impor-
et al., 2022). Traditional text-to-SQL techniques tance, the robustness of text-to-SQL systems is es-
have relied on rigid syntax tree templates (Xu et al., sential, which, however, has not received adequate
2017; Guo et al., 2019; Wang et al., 2020) or su- consideration in LLM-based text-to-SQL systems.
pervised fine-tuning of sequence-to-sequence mod- A robust text-to-SQL system should have the ability
els (Xie et al., 2022; Scholak et al., 2021) for to maintain the correct SQL output when faced with
executing the transition from text to SQL. How- adversarial perturbations in the text or database (Pi
ever, the recent past has witnessed a surge in the et al., 2022), such as changes in sentence structure,
application of LLMs in text-to-SQL operations, synonym descriptions, etc. Experimental results
* Corresponding
reveal that leading LLM-based methods (Dong
author

Key Laboratory of Aerospace Information Security and et al., 2023; Wang et al., 2024; Li et al., 2024a) per-
Trusted Computing, Ministry of Education form poorly on benchmarks that aim at testing the
text-to-SQL robustnes, like Spider-Syn (Gan et al., skeletons. When constructing the prompt, we in-
2021), Spider-Realistic (Deng et al., 2021) and corporate explicit attention mechanisms to stabilize
Dr.Spider (Chang et al., 2023). However, efforts to the output for inputs that have been perturbed.
enhance text-to-SQL robustness (Fürst et al., 2024; Our contributions are summarized as follows:
Shen et al., 2023; Zhuo et al., 2023) continue to be • We address the existing gap in discussions
centered around traditional sequence-to-sequence on the robustness of LLM-based text-to-SQL
architectures, with limited exploration of LLM- systems. To address this, we propose Solid-
based alternatives (Zhuo et al., 2023). SQL, a pre-processing pipeline designed to
As illustrated in Figure 1, an LLM-based text-to- enhance the robustness of LLM-based text-to-
SQL system comprises three distinct stages: pre- SQL systems in generating SQL.
processing, SQL-generation, and post-processing.
These stages are tasked with parsing the input • We design several effective modules, includ-
question to synthesize effective prompts, querying ing a robust schema-linking model, example
the LLM to produce SQL statements, and refin- retrieval methods, and an explicit attention
ing the generated SQL, respectively. In the SQL- mechanism. Moreover, we validate Solid-
generation stage, to achieve good results, the em- SQL’s universality and applicability through
ployed LLMs typically have a very large size or are its integration with various SQL-generation
closed-source, making them difficult to fine-tune LLMs.
for rubustness. The post-processing phase entails
• We conduct extensive experiments, demon-
refining the already generated SQL, with these re-
strating that Solid-SQL achieves SOTA per-
finements being independent of disturbances on
formance on general benchmarks and signif-
the input side. The pre-processing stage, in con-
icantly outperforms existing solutions on ro-
trast, deals with disruptions originating from the
bustness benchmarks. Additionally, the effec-
textual and tabular inputs. Therefore, handling in
tiveness of the modules is validated through
the pre-processing stage is crucial for enhancing
ablation studies.
the robustness of the LLM-based text-to-SQL sys-
tems. Specifically, how to process the text and 2 Related Work
schema to obtain a prompt that can stabilize perfor-
mance in the SQL-generation stage is the problem 2.1 LLM-based Text-to-SQL Techniques
we need to solve. In tandem with the rapid advancement and
In this paper, to address the aforementioned is- widespread adoption of large language models
sues, we design a robust pre-processing pipeline, (LLMs) across various natural language processing
called Solid-SQL, to generate prompts for SQL- (NLP) domains, the text-to-SQL field has also seen
generation. We craft the necessary pre-processing significant benefits from recent methodological
steps based on the components required for the breakthroughs involving LLMs, achieving notable
prompt. A SQL statement consists of two com- performance milestones. LLMs demonstrate im-
ponents: first, the syntactic framework that deter- pressive zero-shot reasoning and domain general-
mines the structure and logic of the statement; and ization capabilities, contributing to unprecedented
second, the database schema, which includes the achievements on the cross-domain Spider leader-
specific names of the tables and columns being board (Yu et al., 2018). For instance, C3 (Dong
accessed. To guide from both aspects, we aim to et al., 2023) is a zero-shot text-to-SQL methodol-
include pre-selected schemas and SQL statement ogy that enhances ChatGPT through three designs:
examples with similar structures within our prompt. Clear Prompting for effective input; Calibration
For robust schema selection, we specifically utilize with Hints to correct model biases; and Consistent
LLMs to generate varied data for adversarial train- Output to ensure query reliability. The Chain-of-
ing and format training data specially for schema Thought approach (Wei et al., 2022) has also been
linking tasks to fine-tune a language model, ad- applied to text-to-SQL tasks. DIN-SQL (Pourreza
dressing the lack of relevant datasets. To assist and Rafiei, 2023) tackles the text-to-SQL task by
with in-context learning, we design effective meth- decomposing it into four modules: schema linking,
ods for extracting text and SQL skeletons based query classification and decomposition, SQL gen-
on the chosen schemas and retrieve relevant SQL eration, and self-correction, each implemented us-
statement examples based on the similarity of these ing prompting techniques to leverage the granular
capabilities of LLMs. Some methods further ex- sides, confusion on the table side (e.g., substitut-
plore LLM’s ability of in-context learning. DAIL- ing column descriptions or incorporating distract-
SQL (Gao et al., 2024) has revitalized the SOTA ing columns within the table) will further under-
on Spider through a comprehensive examination of mine the precision of text-to-SQL systems (Pi et al.,
in-context learning, investigating the optimal selec- 2022). And for a holistic robustness assessment,
tion of examples and their proper organization in Dr. Spider (Chang et al., 2023), a diagnostic bench-
prompts within a few-shot scenario. Other research mark encompassing 15,000 perturbed examples
has explored the selection of few-shot demonstra- that cover a multitude of perturbation types from
tions by synthesizing in-domain examples (Chang three perspectives: the database, natural language
and Fosler-Lussier, 2023) and retrieving question questions, and SQL, has been unveiled.
skeletons (Guo et al., 2023). Furthermore, MAC- To enhance the robustness of text-to-SQL sys-
SQL (Wang et al., 2024) and CHESS (Talaei et al., tems, several strategies have been employed, in-
2024) employ multi-agent collaboration for Text- cluding manually adding synonym annotations to
to-SQL tasks. In addition to maximizing the ability the schema to provide a more precise descrip-
to explore the LLMs without modifying it, other op- tion (Gan et al., 2021), generating adversarial ex-
tions involve fine-tuning the model. Approaches in amples for adversarial training of the sequence-to-
DAIL-SQL (Gao et al., 2024), DTS-SQL(Pourreza sequence model (Pi et al., 2022; Gan et al., 2021),
and Rafiei, 2024), and CodeS (Li et al., 2024b) aim designing specialized training frameworks (Deng
to enhance the capabilities of open-source LLMs et al., 2021) and crafting innovative encoding strate-
through supervised fine-tuning, striving to compete gies to transition from text to SQL (Shen et al.,
with or surpass their larger, proprietary counter- 2023).
parts. However, these methods either require signif-
Although these methods have achieved impres- icant manual effort, are not suitable for new do-
sive results on the leaderboards, only a few of them mains and large databases, or can just be applied
have been evaluated for robustness (Li et al., 2024b; to traditional encoder-decoder frameworks but not
Gao et al., 2024). Moreover, according to our exper- on the currently popular LLMs with large-scale
iments, the performance of many in-context learn- parameters. In contrast, our robustness strategy
ing based methods on robustness benchmarks ap- is compatible with LLMs and ensures that text-
pears to be somewhat inferior compared to their to-SQL systems utilizing this strategy perform no
performance on the Spider and BIRD (Li et al., worse than SOTA methods on conventional bench-
2023) leaderboards. marks, while markedly enhancing performance on
robustness evaluation benchmarks.
2.2 Adversarial Robustness
Despite the remarkable performance of neural net-
3 Methodology
works across various domains, they continue to
exhibit significant vulnerabilities when subjected 3.1 Problem Definition
to perturbations (Szegedy et al., 2014). This sus-
ceptibility is not only evident in traditional neural Text-to-SQL is a task that generates a SQL state-
networks but has also been observed in systems ment for querying a database based on a natural
such as text-to-SQL models (Shen et al., 2023), language text which is a demand for some infor-
where adversarial inputs can lead to degraded per- mation about the database. It can be represented
formance. LLMs show potential as zero-shot text- as S = M (Q, SC), where S is the generated SQL
to-SQL parsers, but their performance declines statement, M is the text-to-SQL system, Q is the
when faced with adversarial attacks and domain natural language text, i.e., the question, and SC is
generalization disturbances, exhibiting varying lev- the database schema.
els of robustness in response to different types of Our robustness goal is to stabilize the output of
perturbations (Zhang et al., 2023). It has been sub- the text-to-SQL system when faced with pertur-
stantiated that removing explicitly stated column bations. Specifically, without affecting the funda-
names (Deng et al., 2021) or replacing database mental goal of Q, adding perturbations to Q yields
schema-related content with synonyms (Gan et al., Q∗ . A robust LLM-based text-to-SQL strategy
2021) in the question will compromise the acces- satisfies DB(M (Q, SC)) = DB(M (Q∗ , SC)),
sibility and accuracy of the generated SQL. Be- where DB is the access database.
(A) Schema Linking Model Training (B) Scheme Linking Predict (D) Complete SQL Generation Pipeline
Robust Data Enhancement Input Question Round 1
How many classes are professor whose last
Question1 SQL1 name is Graztevski has? Question Schema Linking Model Predicted Schema
Question1a SQL1 Database Schema +
Question1b SQL1 CREATE TABLE `Club`(`STUDENT.STU_NUM` INT... Question-similarity
CREATE TABLE `PROFESSOR`(`... + Question Skeleton
Using LLM for generating CREATE TABLE `ENROLL`(...
Search in
Based Examples
CREATE TABLE `EMPLOYEE`(...
variations of each question Skeleton Lib
...

Schema Linking Model + SQLround 1

Predicted Schema
Model Training Chosen Tables: employee, class
Round 2
Chosen Columns: employee.emp_num, SQLround 1 Predicted Schema
Expanded training data class.prof_num, employee.emp_lname

SQL-similarity
Question SQL (C) Skeleton Lib Construct + SQL Skeleton
Search in
Based Examples
+ Parsing Question Skeleton Lib
+ Gold Schema
Database Schema Gold Schema SQL Query
+ SQL
Input Target Output
Question & Query Skeleton
Search in Skeleton Lib Skeleton Lib
How many [ ] are [ ] whose last name is [ ] has?
Schema Linking Model Match the most similar Question Skeleton Question
SELECT COUNT(*) FROM [ ] AS T1 JOIN [ ] AS
Question / SQL Skeleton SQL Skeleton SQL
T2 ON T1.[ ] = T2.[ ] WHERE T1.[ ] = [ ]
Example Needed

Figure 2: The pipeline of Solid-SQL.

3.2 Overview of SQL generation, culminating in the final SQL


output (Figure 2(D)-Round 2).
Solid-SQL is a novel plug-in solution for robust
text-to-SQL, which uses a carefully designed pre- 3.3 Schema Linking
processing pipeline to extract the elements re- The schema linking task is a preliminary step of
quired for prompt composition. It includes a robust text-to-SQL, simplifying the generation of SQL
scheme linking model, an effective example re- queries. Its goal is to select the actual tables and
trieve method and an explicit attention mechanism. columns to be accessed from the entire database
Figure 2 shows the pipeline of Solid-SQL. schema based on a given question. For schema
Firstly, we employ an LLM to introduce minor linking, since the input contains numerous SQL
perturbations to the text of text-to-SQL training statements that define the database structure, it is
data, while preserving its semantic integrity. In challenging for a base LLM with general capabil-
this way, we create an augmented dataset of clean ities to understand such input and produce output
and perturbed data. This augmented dataset is then in the expected format. Therefore, we fine-tune a
used to fine-tune a language model for schema model to complete the task.
linking(Figure 2(A)). Subsequently, leveraging the
outcomes of schema linking(Figure 2(B)), we ask 3.3.1 Robust Data Enhancement
an LLM to extract and remove domain-specific in- In order to improve the robustness of our schema
formation and value information from the input text linking model, we first expanded the original text-
query, to derive the query’s skeleton(Figure 2(C)). to-SQL training dataset. The data in the training
This skeleton is matched against a pool of candi- set is in the form of a triplet Q, SC, S, where Q is
date skeletons to retrieve an appropriate number of the input text query, SC is the complete database
relevant samples as examples based on similarity. schema, and S is the correct SQL query output.
The question, complete schema, and examples are We use an LLM to rewrite Q, including changing
combined into a prompt, with explicit emphasis the sentence structure and replacing synonyms (i.e.,
on the filtered schema, to query the LLM for SQL substituting ’singer’ with ’musician’), resulting in
generation in the first round(Figure 2(D)-Round 1). new questions Q1 and Q2 . Then, we add the new
Following this, the SQL generated from the first triplets Q1 , SC, S and Q2 , SC, S to the training
round is parsed to extract its backbone, and addi- dataset. This expanded training set introduces per-
tional examples are retrieved for the second round turbations and adversarial examples, and the model
trained with this augmented dataset can effectively identify the underlying patterns and relationships
improve its robustness. essential for converting natural language queries
into executable SQL commands. This method en-
3.3.2 Model Training sures that the LLM’s in-context learning is attuned
We set the target operation to choose schema to the structural intricacies critical for the task.
in Solid-SQL, as shown in Figure 2(B), formu- To derive Q⋆ from Q, we leverage the language
lated as {T, C} = G(Q, SC), where T = understanding capabilities of an LLM and employ a
{T1 , T2 , . . . , T|T | } and C = {C1 , C2 , . . . , C|C| } prompting-based technique. We input the question
are the selected tables and columns, respec- Q along with the schema inferred by our schema
tively, and G is a fine-tuned generation model for linking model into a universal LLM and parse the
schema linking. The absence of a training set for output to extract Q⋆ . As shown in Figure 2(C),
schema linking is problematic, and existing text- this process obscures the domain-specific informa-
to-SQL training datasets offer data in the form of tion and values, leaving only the question’s skele-
(Q, SC) − S pairs without providing the chosen tal structure. This extraction process should be
schema. Therefore, we parse the SQL statement applied to both the questions Qi within the can-
S related to question Q to obtain the ground truth didate library (Q1 , S1 ), (Q2 , S2 ), . . . , (Q|E| , S|E| )
T, C. and the given target question Q itself. Based
on the cosine similarity between Q⋆ and the set
3.4 Relative Example Retrieval Q⋆1 , Q⋆ 2, . . . , Q⋆ |E|, we can identify the top N
In crafting prompts for LLMs, the selection of task- most similar candidate skeletons, corresponding to
relevant examples is paramount, as it enhances example pairs (Q1 , S1 ), (Q2 , S2 ), . . . , (QN , SN ).
the model’s comprehension and task performance
by leveraging contextual adaptability, knowledge 3.4.2 SQL Skeleton Similarity-Based
transfer, and ambiguity mitigation. In addition to indirectly utilizing questions to match
Verified by DAIL-SQL (Gao et al., 2024), ex- the samples to be retrieved, direct matching can
amples formatted as pairs consisting of a text also be achieved through the similarity of SQL
query and the corresponding correct SQL state- statements. As shown in Figure 2(D), after the
ment benefit the in-context learning for the text- SQL generation is completed by the LLM in round
to-SQL task, rather than containing only the text 1, we can select examples for the SQL generation
or the SQL query. We define the example set in round 2 based on the similarity of the SQL skele-
as E = {E1 , E2 , . . . , EN }, where each Ei corre- tons.
sponds to a question-SQL pair, denoted as (Qi , Si ). We extract the skeleton S ⋆ of an SQL statement
S by identifying and manipulating its various com-
3.4.1 Question Skeleton Similarity-Based
ponents using an SQL parsing tool1 . This process
The correlation between two SQL statements, Si involves parsing the SQL statement to generate a
and Sj , suggests a corresponding relationship be- syntax tree, then recognizing and replacing the ta-
tween their associated questions, Qi and Qj . To ble names, column names, and values within it,
guide the LLM towards generating the desired SQL, while preserving the SQL keywords and logical
it is reasonable to select analogous questions from structure. Ultimately, a skeleton that contains only
the candidate set as examples, based on the target placeholders and the structure of the SQL is pro-
question Q. duced. This extraction process is applied to both
However, the emphasis on similarity should fo- the candidate library and the generated SQL from
cus on the structural alignment of the SQL state- round 1.
ments rather than their thematic proximity. To We employ the edit distance derived from the
achieve this, it is crucial to abstract the target ques- parse tree of an SQL skeleton to quantify the struc-
tion Q by removing domain-specific details and tural similarity, which provides an analytical ap-
value-related content, revealing its core structure, proach that emphasizes the logical framework of
or ’skeleton’, denoted as Q⋆ . Q⋆ serves as the foun- SQL statements rather than their superficial textual
dation for identifying analogous questions within similarities. This technique enables a more pre-
the candidate set, with a focus on those exhibit- cise identification of key element correspondences
ing a similar structural pattern. By aligning the
examples with Q⋆ , the model can more accurately 1
https://github.com/tobymao/sqlglot
than calculating cosine similarity of embedded vec- been manually adjusted to remove explicit column
tors (Gao et al., 2024). It can also effectively dis- names while keeping SQL queries unchanged.
counts disparities arising from diverse expressive Dr. Spider (Chang et al., 2023) is a robustness
forms or functional applications. Consequently, it evaluation benchmark for text-to-SQL models. It
facilitates a more robust assessment of the concep- applies perturbations to the database, natural lan-
tual similitude within SQL statements. guage query, and SQL components, and contains
15,000 pre- and post-perturbation examples.
3.5 Information Utilization
4.1.2 Evaluation Metrics
We design suitable prompt templates to integrate We assess the performance of the text-to-SQL
existing information and query LLMs accomplish model by evaluating the quality of the generated
SQL-generation effectively. A key consideration is SQL. Execution Accuracy (EX) is defined as the
the use of an explicit attention mechanism to embed proportion of questions in the evaluation set for
table and column names filtered through schema which the execution outcomes of both the predicted
linking into the prompt. Existing approaches (Gao SQL queries and the ground-truth queries are the
et al., 2024) often present only the filtered schema same. It is calculated relative to the total number of
information to the LLM, reducing token count queries. The EX of the generated SQLs indicates
and excluding unnecessary information. However, how well the model meets the availability and pre-
this method has a significant drawback: if crucial cision requirements in real-world scenarios. We
schema information is omitted in the last step, the also use Exact Match Accuracy (EM) as an adjunct.
LLM will be unable to generate the correct SQL EM is the portion of generated SQLs that totally
statement. In contrast, we use "focus on" to em- match the ground truth SQL statements.
phasize schema elements that are more likely to
form the final SQL statement to the LLM. Thus 4.1.3 LLMs
the model still has a comprehensive view of the Solid-SQL employs a prompting methodology that
entire schema information while understanding the supports the use of various interchangeable LLMs.
priority, ensuring the stability and fault tolerance To validate the compatibility and generalizability
of the LLM in generating SQL statements when of our proposed solution, we conducted experi-
faced with perturbation. ments using four distinct LLMs for SQL genera-
tion. These models included both open-source and
4 Experiments closed-source options.
LLama3-70b: An open-source LLM with 70 bil-
4.1 Experimental Setup lion parameters by Meta AI, optimized for diverse
4.1.1 Datasets NLP tasks including text generation and transla-
We evaluate the performance of Solid-SQL on a tion.
simple clean test set called Spider, a difficult clean Deepseek-coder-33b-instruct: A 33-billion-
test set called Bird, and three perturbed test sets. parameter model from the Deepseek Coder series,
leading in open-source code generation across mul-
Spider (Yu et al., 2018) is a dataset for semantic
tiple programming languages.
parsing and text-to-SQL, created by Yale students.
GPT-4o-mini: A compact version of GPT-4o,
It contains 10,181 questions and 5,693 SQL queries
retaining core text capabilities with faster inference
across 200 databases and 138 domains.
due to fewer parameters.
Bird (Li et al., 2023) is a large-scale text-to-
GPT-4: OpenAI’s advanced generative pre-
SQL benchmark developed by Alibaba DAMO
trained transformer, adept at complex tasks like
Academy. It features 12,751 question-SQL pairs,
essay writing and coding with high accuracy and
95 databases, and spans over 37 domains.
creativity.
Spider-Syn (Gan et al., 2021) is an adapted ver-
sion of the Spider dataset with 5,672 questions 4.1.4 Baselines
modified by replacing words with their synonyms, We compare with other prompting-based text-to-
using 273 synonyms and 189 phrases. On average, SQL solutions which have SOTA performances on
there is one alteration per question. Spider and Bird.
Spider-Realistic (Deng et al., 2021) is a per- DAIL-SQL (Gao et al., 2024): Ranked second
turbed evaluation set based on Spider. It has on the Spider leaderboard, DAIL-SQL employs a
Table 1: The EX (Excute Accuracy) and EM (Exact Match) of Solid-SQL on Spider dev set, Spider-Syn test set and
Spider-Realistic test set, comparing with SOTA open source prompting based methods.

Spider Spider-Syn Spider-Realistic Avg


Alternative LLM Method
EX EM EX EM EX EM EX EM

DAIL-SQL (Gao et al., 2024) 60.5 42.4 58.0 33.3 59.8 41.5 59.4 39.1
MAC-SQL (Wang et al., 2024) 76.3 27.9 65.1 21.7 71.7 24.4 71.0 24.7
Llama3-70b DIN-SQL (Pourreza and Rafiei, 2023) \ \ \ \
Solid-SQL (round1) 81.5 59.3 74.2 52.3 76.6 52.0 77.4 54.5
Solid-SQL (round2) 82.1 61.0 74.6 54.7 77.1 54.3 77.9 56.7

DAIL-SQL (Gao et al., 2024) 68.6 52.9 50.1 37.4 60.2 48.0 59.6 46.1
MAC-SQL (Wang et al., 2024) 50.0 5.8 39.7 3.7 37.4 3.9 42.4 4.5
deepseek-13b DIN-SQL (Pourreza and Rafiei, 2023) \ \ \ \
Solid-SQL (round1) 77.3 52.5 68.4 43.4 71.4 50.8 72.3 48.9
Solid-SQL (round2) 77.8 51.6 68.5 43.4 72.3 50.4 72.9 48.5

DAIL-SQL (Gao et al., 2024) 77.4 51.4 66.1 37.8 70.5 49.8 71.3 46.3
MAC-SQL (Wang et al., 2024) 78.1 37.8 68.7 29.4 76.0 39.0 74.3 35.4
GPT-4o-mini DIN-SQL (Pourreza and Rafiei, 2023) 70.3 36.4 64.3 33.4 62.0 38.8 65.5 36.2
Solid-SQL (round1) 80.4 60.0 74.2 52.3 76.1 54.7 76.9 55.7
Solid-SQL (round2) 79.9 60.3 74.6 53.3 76.7 54.3 77.1 56.0

variety of example selection methods and a struc- on multiple text-to-SQL benchmarks.


tured format for example organization. Leveraging
4.1.5 Solid-SQL Details
GPT-4, it achieves high performance in SQL gener-
ation quality and query efficiency. In the deployment of Solid-SQL, we employ the
LLama3-8B-Instruct model as the foundational ar-
MAC-SQL (Wang et al., 2024): This method
chitecture for the schema linking task. The model
features a core decomposer agent for Text-to-SQL
is subjected to a training regimen consisting of five
with few-shot chain-of-thought reasoning, sup-
epochs on an augmented dataset, which comprises
ported by two auxiliary agents for obtaining sub-
approximately 22,000 question-SQL pair instances.
databases and refining SQL queries. The agents
For the extraction of question skeletons, we use
work in tandem, with the flexibility to integrate
the LLM align with SQL-generation. Furthermore,
new tools or features for enhanced Text-to-SQL
when assessing the cosine similarity between two
parsing.
question embeddings, we employ the bge-large-en-
DIN-SQL (Pourreza and Rafiei, 2023): DIN- v1.5 embedding model before the computation.
SQL breaks down the text-to-SQL task into
sub-problems: schema linking, query classifica- 4.2 Overall Performance
tion & decomposition, SQL generation, and self- We test the performance of Solid-SQL across var-
correction. Utilizing prompting techniques, it ious benchmarks and compared it with the base-
demonstrates that LLMs can effectively solve these lines.
sub-problems when appropriately decomposed. Table 1 presents the performance of Solid-SQL
CodeS (Li et al., 2024b): CodeS is an open- on the Spider benchmark as well as its robustness
source language model series designed for text- test variants, Spider-syn and Spider-realistic. Lever-
to-SQL, offering high accuracy with smaller pa- aging the prompting-based nature of Solid-SQL
rameters compared to closed-source LLMs. It and the compared baseline methods, we imple-
employs an incremental pre-training strategy on mented a plugin-style approach to substitute var-
a SQL-specific corpus and addresses schema link- ious LLMs for SQL generation. Solid-SQL sig-
ing and domain adaptation challenges. Evaluations nificantly outperforms the baselines in both execu-
show CodeS achieves state-of-the-art performance tion accuracy (EX) and exact match (EM) across
Table 2: The EX (Execute Accuracy) of Solid-SQL (ours) and baselines on Dr.Spider. "Pert Level" is where the
perturbation is added and "Pert Type" is the how the perturbation added.

CodeS-15B Llama3-70b GPT-4o-mini


Pert Level Pert Type
(Li et al.) MAC-SQL ours-round1 MAC-SQL ours-round1
DB DBcontent-equivalence 47.6 58.4 62.8 59.9 63.2
schema-abbreviation 78.7 72.9 77.9 74.1 78.0
schema-synonym 66.9 65.9 72.8 66.8 72.6
NLQ column-attribute 68.9 63.9 69.2 65.7 68.9
column-carrier 79.1 66.3 78.9 68.9 79.8
column-synonym 64.7 53.3 70.3 54.8 70.1
column-value 76.3 68.8 78.8 69.6 77.4
keyword-carrier 91.7 89.5 91.1 90.5 91.2
keyword-synonym 73.5 62.0 74.8 64.6 73.1
multitype 69.4 58.3 71.5 61.2 70.9
others 81.2 69.8 79.2 71.1 81.0
value-synonym 71.9 60.1 73.2 62.5 73.9
comparison 71.9 75.3 77.6 76.2 77.8
DB-number 85.9 80.5 83.7 81.8 84.6
SQL DB-text 80.7 72.8 79.6 75.2 80.2
nonDB-number 84.0 84.0 87.0 86.3 89.7
sort-order 84.9 62.0 77.3 66.1 79.5

Table 3: The EX (Execute Accuracy) on Bird dev set. our Solid-SQL approach significantly outperforms
MAC-SQL and matches the current state-of-the-art
Method Bird model in robustness, CodeS-15B, which has been
GPT-4 46.2 fine-tuned with extensive data. Additionally, it is
DIN-SQL + GPT-4 50.7 evident that Solid-SQL has a distinct advantage
DAIL-SQL + GPT-4 54.8 in terms of robustness against perturbations that
MAC-SQL + GPT-4 50.6 involve the use of synonyms.
Solid-SQL + GPT-4 58.9 Table 3 displays the experimental results on the
Bird benchmark, which similarly demonstrates the
consistent performance of Solid-SQL under com-
all datasets, with an average execution accuracy plex requirements.
12.4% higher than the baselines and an average
4.3 Ablation Study & Hyper-parameter Study
exact match that also exceeds the baseline. It is
noteworthy that certain methods exhibit a strong 4.3.1 Schema Linking Training
dependency on specific LLMs, seeing that DAIL- Table 4 presents the results of an ablation study on
SQL’s performance decreases heavily on LLama schema linking training. Compared to the baseline
compared with GPT series and DIN-SQL is even model without supervised fine-tuning (SFT), the
unable to output reasonable SQL statements when model fine-tuned with the basic training set shows
using Deepseek and Llama. In contrast, Solid-SQL a significant improvement of approximately 25%
performs well on all the test LLMs, showing versa- in the accuracy of column name choosing, under-
tility and compatibility. scoring the necessity and efficacy of SFT. More-
Table 2 demonstrates the execution accuracy of over, employing a robustness-enhanced augmented
Solid-SQL when generating SQL queries under training set with added perturbations further im-
various levels and types of perturbations within the proves the accuracy of schema linking, especially
Dr.Spider dataset. Due to MAC-SQL’s most stable on perturbed benchmarks such as Spider-Syn and
performance in cooperation with different LLMs Spider-Realistic, with an enhancement of about 2%,
among the baselines in Table 1, we choose it as the highlighting the contribution to the robustness of
object of comparison. The results clearly show that schema linking.
Table 4: The ablation study of SFT-based schema link- Table 5: The ablation study of the design of explicit
ing in Solid-SQL. Values in the table are the searching attention in prompt construction. Values in the table are
accuracy of cloumn names. "w/o SFT" referring to the the EX of SQL generated by one-round Solid-SQL with
base model without finetuning, "w/o Enhance" referring Llama3-70b. N is the number of examples retrieved
to being finetuned with basic training data, and "with for in-context learning. "with focus" and "w/o focus"
Enhance" referring to being finetuned with enhanced referring to have the design or not, respectively.
training data.
N Spider Syn Realistic
Spider Syn Realistic w/o focus 78.5 66.9 74.2
3
w/o SFT 65.6 59.9 62.5 with focus 81.7 (↑3.2) 73.8 (↑6.9) 76.0 (↑1.8)

w/o Enhance 88.6 82.4 83.1 w/o focus 79.8 67.5 74.8
9
with Enhance 89.7 (↑1.1) 84.2 (↑1.9) 85.1 (↑2.0) with focus 81.1 (↑1.3) 73.9 (↑6.4) 76.4 (↑1.6)

Table 6: The study of how many examples should be


4.3.2 Explicit Attention in Prompt added into the prompt. N is the number of examples
retrieved ranging from 1 to 9. Values in the table are
Table 5 presents the results of an ablation study the executing accuracy of SQL generated in round 1 of
of the explicit attention mechanism. The study ex- Solid-SQL.
amines how the existence of it in prompts impacts
SQL-generation. The study reveals that incorpo- N 1 3 5 7 9
rating this mechanism significantly improves SQL Spider 80.9 81.7 81.2 81.5 81.1
exact match accuracy across three datasets, espe- Syn 74.0 73.8 73.7 74.2 73.9
cially when dealing with synonym perturbations, Realistic 75.6 76.0 76.4 76.6 76.4
as seen in Spider-Syn. Additionally, it is notewor-
Avg 76.83 77.17 77.10 77.43 77.13
thy that the positive effect of this design is more
pronounced when the number of examples in the
prompt is reduced. This suggests that in scenarios option we have not yet fully investigate due to time
where conserving tokens is crucial, our design can constraints.
effectively enhance the performance of LLMs in
generating SQL queries. 6 Conclusion

4.3.3 Number of Retrieved Examples In this paper, we present Solid-SQL, a robust text-
to-SQL solution designed to address the robust lim-
Table 6 presents a study on the optimal number itations of current SOTA LLM-based methods. By
of examples to include in prompts for in-context focusing on pre-processing techniques and integrat-
learning. Although performance across various ing a robust schema-linking model, along with a
benchmarks varies with different numbers of exam- two-round example retrieval strategy, Solid-SQL
ples, there is a general trend where the ability of significantly improves SQL execution accuracy on
LLMs to generate SQL queries initially strength- both standard and adversarially perturbed bench-
ens and then weakens as the number of examples marks. Solid-SQL achieves SOTA performance
increases. Based on the average performance, we on general benchmarks and an average advance-
ultimately set the number of examples, denoted as ment of 11.6% over baselines on perturbed bench-
N , to be recalled to 7. marks, proving its effectiveness in enhancing the
robustness of text-to-SQL systems. This work high-
5 Limitations lights the importance of system robustness in the
The Solid-SQL approach offers opportunities for development of text-to-SQL models and lays the
enhancement, particularly in the procedural design. groundwork for future research in this field.
We could define conditions for advancing to a sec-
Acknowledgments
ond round of queries, which would streamline the
process by eliminating unnecessary steps and in- We thank the anonymous reviewers for their help-
crease algorithmic efficiency. Furthermore, the ful and valuable feedback. This work was par-
plugin-based architecture of Solid-SQL suggests tially supported by NSFC under Grants U2441240,
the possibility of integrating it with other method- U21B2018, and 62302344.
ologies to achieve performance improvements, an
References Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xi-
aokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan,
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Cuiping Li, and Hong Chen. 2024b. Codes: Towards
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, building open-source language models for text-to-sql.
Diogo Almeida, Janko Altenschmidt, Sam Altman, Proc. of PACMMOD.
Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774. Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua
Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying
Muhammad Shahzaib Baig, Azhar Imran, Aman Ullah Geng, Nan Huo, et al. 2023. Can llm already serve
Yasin, Abdul Haleem Butt, and Muhammad Imran as a database interface? a big bench for large-scale
Khan. 2022. Natural language to sql queries: A database grounded text-to-sqls. In Proc. of NeurIPS.
review. Proc. of IJIST.
Qing Li, Tao You, Jinchao Chen, Ying Zhang, and
Shuaichen Chang and Eric Fosler-Lussier. 2023. Selec- Chenglie Du. Li-emrsql: Linking information en-
tive demonstrations for cross-domain text-to-sql. In hanced text2sql parsing on complex electronic medi-
Proc. of findings-ACL. cal records. Proc. of IEEE Trans Reliab.
Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Xinyu Pi, Bing Wang, Yan Gao, Jiaqi Guo, Zhoujun Li,
Henghui Zhu, Alexander Hanbo Li, Wuwei Lan, and Jian-Guang Lou. 2022. Towards robustness of
Sheng Zhang, Jiarong Jiang, Joseph Lilien, et al. text-to-sql models against natural and realistic adver-
2023. Dr. spider: A diagnostic evaluation benchmark sarial table perturbation. In Proc. of ACL.
towards text-to-sql robustness. In Proc. of ICLR.
Mohammadreza Pourreza and Davood Rafiei. 2023.
Xiang Deng, Ahmed Hassan, Christopher Meek, Olek- Din-sql: decomposed in-context learning of text-to-
sandr Polozov, Huan Sun, and Matthew Richardson. sql with self-correction. In Proc. of NeuIPS.
2021. Structure-grounded pretraining for text-to-sql.
In Proc. of NAACL. Mohammadreza Pourreza and Davood Rafiei. 2024.
Dts-sql: Decomposed text-to-sql with small large
Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, language models. arXiv preprint arXiv:2402.01117.
Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. 2023.
C3: Zero-shot text-to-sql with chatgpt. arXiv Torsten Scholak, Nathan Schucher, and Dzmitry Bah-
preprint arXiv:2307.07306. danau. 2021. Picard: Parsing incrementally for
constrained auto-regressive decoding from language
Jonathan Fürst, Catherine Kosten, Farhard models. In Proc. of EMNLP.
Nooralahzadeh, Yi Zhang, and Kurt Stockinger.
2024. Evaluating the data model robustness of Hao Shen, Ran Shen, Gang Sun, Yiling Li, Yifan Wang,
text-to-sql systems based on real user queries. arXiv and Pengcheng Zhang. 2023. Sequential feature aug-
preprint arXiv:2402.08349. mentation for robust text-to-sql. In Proc. of ACDP.

Yujian Gan, Xinyun Chen, Qiuping Huang, Matthew Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,
Purver, John R Woodward, Jinxia Xie, and Peng- Joan Bruna, Dumitru Erhan, Ian Goodfellow, and
sheng Huang. 2021. Towards robustness of text-to- Rob Fergus. 2014. Intriguing properties of neural
sql models against synonym substitution. In Proc. of networks. In Proc. of ICLR.
ACL and IJCNLP.
Shayan Talaei, Mohammadreza Pourreza, Yu-Chen
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Chang, Azalia Mirhoseini, and Amin Saberi. 2024.
Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. Chess: Contextual harnessing for efficient sql synthe-
Text-to-sql empowered by large language models: A sis. arXiv preprint arXiv:2405.16755.
benchmark evaluation. Proc. of VLDB.
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr
Chunxi Guo, Zhiliang Tian, Jintao Tang, Pancheng Polozov, and Matthew Richardson. 2020. Rat-sql:
Wang, Zhihua Wen, Kang Yang, and Ting Wang. Relation-aware schema encoding and linking for text-
2023. Prompting gpt-3.5 for text-to-sql with de- to-sql parsers. In Proc. of ACL.
semanticization and skeleton retrieval. In Proc. of
PRICAI. Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Ji-
aqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang,
Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian- Di Yin, Xing Sun, et al. 2024. Mac-sql: A multi-
Guang Lou, Ting Liu, and Dongmei Zhang. 2019. To- agent collaborative framework for text-to-sql. arXiv
wards complex text-to-sql in cross-domain database preprint arXiv:2312.11242.
with intermediate representation. In Proc. of ACL.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Bosma, Brian Ichter, Fei Xia, Ed H Chi, Quoc V Le,
and Nan Tang. 2024a. The dawn of natural lan- and Denny Zhou. 2022. Chain-of-thought prompting
guage to sql: Are we fully ready? arXiv preprint elicits reasoning in large language models. In Proc.
arXiv:2406.01265. of NeuIPS.
Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong,
Torsten Scholak, Michihiro Yasunaga, Chien-Sheng
Wu, Ming Zhong, Pengcheng Yin, Sida I Wang,
et al. 2022. Unifiedskg: Unifying and multi-tasking
structured knowledge grounding with text-to-text lan-
guage models. In Proc. of EMNLP.
Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sql-
net: Generating structured queries from natural lan-
guage without reinforcement learning. arXiv preprint
arXiv:1711.04436.
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
ing Yao, Shanelle Roman, et al. 2018. Spider: A
large-scale human-labeled dataset for complex and
cross-domain semantic parsing and text-to-sql task.
In Proc. of EMNLP.
Weixu Zhang, Yu Wang, and Ming Fan. 2023. Towards
robustness of large language models on text-to-sql
task: An adversarial and cross-domain investigation.
In Proc. of ICANN.
Terry Yue Zhuo, Zhuang Li, Yujin Huang, Fatemeh
Shiri, Weiqing Wang, Gholamreza Haffari, and Yuan-
Fang Li. 2023. On robustness of prompt-based se-
mantic parsing with large pre-trained language model:
An empirical study on codex. In Proc. of EACL.

You might also like