0% found this document useful (0 votes)
11 views46 pages

Project Report - 7 - Merged

The document outlines a project titled 'Natural Language to SQL: An Intelligent Conversational Interface for Database Querying,' aimed at simplifying database interactions for non-technical users by converting natural language queries into SQL. It employs Large Language Models (LLMs) and dynamic few-shot learning to enhance query accuracy and relevance, along with a user-friendly chatbot interface for real-time responses. The project is part of a Bachelor of Engineering degree at Anna University and includes acknowledgments, a synopsis, and a detailed table of contents.

Uploaded by

thou.71772117146
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views46 pages

Project Report - 7 - Merged

The document outlines a project titled 'Natural Language to SQL: An Intelligent Conversational Interface for Database Querying,' aimed at simplifying database interactions for non-technical users by converting natural language queries into SQL. It employs Large Language Models (LLMs) and dynamic few-shot learning to enhance query accuracy and relevance, along with a user-friendly chatbot interface for real-time responses. The project is part of a Bachelor of Engineering degree at Anna University and includes acknowledgments, a synopsis, and a detailed table of contents.

Uploaded by

thou.71772117146
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

NATURAL LANGUAGE TO

SQL : AN INTELLIGENT
PROJECT WORK CONVERSATIONAL
INTERFACE FOR DATABASE
QUERYING

PROJECT SUBMITTED IN PARTIAL FULFILLMENT OF


THE REQUIREMENTS FOR THE AWARD OF THE
DEGREE OF BACHELOR OF ENGINEERING IN
COMPUTER SCIENCE AND ENGINEERING
OF THE ANNA UNIVERSITY

2025
Submitted by
THOUFEEQ A 71772117146
VAITHEESHWARAN S 71772117147
VISWESWARAN G S 71772117150
BOOPATHI HARI R 71772117L03

Under the Guidance of


Dr.J.C.Miraclin Joyce Pamila.,

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


GOVERNMENT COLLEGE OF TECHNOLOGY
(An Autonomous Institution affiliated to Anna University)
COIMBATORE - 641 013
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
GOVERNMENT COLLEGE OF TECHNOLOGY
(An Autonomous Institution affiliated to Anna University)
COIMBATORE - 641 013

PROJECT WORK

APRIL 2025

This is to certify that this project work entitled

NATURAL LANGUAGE TO SQL : AN


INTELLIGENT CONVERSATIONAL
INTERFACE FOR DATABASE QUERYING
is the bonafide record of project work done by
THOUFEEQ A [ 71772117146 ]
VAITHEESHWARAN S [ 71772117147 ]
VISWESWARAN G S [ 71772117150 ]
BOOPATHI HARI R [ 71772117L03 ]

of B.E. (COMPUTER SCIENCE AND ENGINEERING) during the year 2024 - 2025

Dr.J.C.Miraclin Joyce Pamila., ​ Dr.J.C.Miraclin Joyce Pamila.,


Project guide​ Head of the Department

Submitted for the Project Viva-Voce Examination held on _______________


at Government College of Technology, Coimbatore - 13.

Internal Examiner​ External Examiner


ACKNOWLEDGEMENT

Great achievements are not possible without standing on the shoulders of


giants. Without the active involvement of the following experts this project would
not have been a reality.

We express our sincere gratitude to Dr.K.Manonmani M.E, Ph.D., Principal,


Government College of Technology, Coimbatore for providing us all facilities that
we needed for the completion of this project.

Our thankfulness and gratitude to our respectable project guide


Dr.J.C.Miraclin Joyce Pamila M.E., Ph.D., Professor and Head of the
Department of Computer Science and Engineering who has been an immense
help through the various phases of the project. With her potent ideas and excellent
guidance, we were able to comprehend the essential aspects involved.

We extend our sincere thanks to our respected panel members, Dr.A.Meena


Kowshalya M.E., Ph.D., Associate Professor, and Dr.T.Raja Senbagam, M.E.,
Ph.D., Assistant Professor, for all their valuable suggestions to the completion of
the project.

We would like to thank our faculty advisor Prof.L.Sumathi M.E., Assistant


Professor for her continuous support and encouragement throughout this project.

We would like to dedicate the work to our parents for their constant
encouragement throughout the project. We also thank all our friends for their
cooperation and suggestions towards the successful completion of this project.

III
SYNOPSIS

This project suggests a cutting-edge Text-to-SQL system that will try to


convert natural language queries into SQL queries and respond back with
paraphrased, understandable responses to the user. The ultimate goal is to make
relational databases accessible to non-technical users by enabling interaction in
common language.

The system utilizes Large Language Models (LLMs) to create SQL queries
from user queries. To make it more accurate, it employs dynamic few-shot learning
methods. Relevant examples are selected depending on their similarity to the user
query, thereby enhancing relevance.

One of the most important features of the system is dynamic table selection,
wherein the system scans input keywords and queries the relevant tables in the
database schema. This improves the accuracy of SQL results by avoiding
unnecessary or irrelevant references from tables.

The SQL query is constructed via GPT and then runs on a networked
MySQL database. The result of the query is then passed through GPT once more
to create a paraphrased, natural language response, thus becoming more
user-friendly. To handle multi-turn conversations and follow-up queries, the system
uses LangChain's Conversation Buffer Memory. Using this feature, the chatbot can
keep context awareness between questions, thus making the conversations more
natural and coherent.

The frontend is built on Next.js, which provides an interactive user interface


to input user questions and get answers in real time. The backend is based on
Django, which takes care of request routing, database operations, and model
integration with external services such as GPT.

This project illustrates the effective utilization of chat-based AI, prompt


engineering, and memory-aware models in the creation of smart database
systems. It represents a major advancement in making advanced data operations
available through easy language.

IV
TABLE OF CONTENTS

CHAPTER NO TITLE PAGE NO.

BONAFIDE CERTIFICATE II

ACKNOWLEDGEMENT III

SYNOPSIS IV

TABLE OF CONTENTS V

1 INTRODUCTION 1

1.1 DESCRIPTION

1.2 EXISTING SYSTEM

1.3 PROBLEM DEFINITION

1.4 PROPOSED SYSTEM

1.5 ORGANIZATION OF THE PROJECT

2 LITERATURE REVIEW 4

2.1 NATURAL LANGUAGE TO SQL GENERATION 4

2.1.1 DESCRIPTION

2.1.2 MERIT

2.1.3 DEMERIT

2.2 FEW-SHOT LEARNING FOR TEXT-TO-SQL 5

2.2.1 DESCRIPTION

2.2.2 MERIT

2.2.3 DEMERIT

V
2.3 DYNAMIC TABLE SELECTION 5

2.3.1 DESCRIPTION

2.3.2 MERIT

2.3.3 DEMERIT

2.4 PROMPT OPTIMIZATION 6

2.4.1 DESCRIPTION

2.4.2 MERIT

2.4.3 DEMERIT

2.5 QUERY REPHRASING USING LLMs 7

2.5.1 DESCRIPTION

2.5.2 MERIT

2.5.3 DEMERIT

2.6 LANGCHAIN MEMORY FOR CONTEXT 8


RETENTION

2.4.1 DESCRIPTION

2.4.2 MERIT

2.4.3 DEMERIT

3 SYSTEM SPECIFICATION 9

3.1 SYSTEM REQUIREMENTS 9

3.1.1 HARDWARE REQUIREMENTS

3.2 SOFTWARE REQUIREMENTS 9

3.2.1 PACKAGES USED


4 METHODOLOGY 13

4.1 METHOD USED

4.2 ARCHITECTURE DESCRIPTION

4.3 MODULE DESCRIPTION

4.4 MODEL BUILDING

4.5 TRAINING

4.6 TESTING AND EVALUATION

4.7 TOOLS AND TECHNOLOGIES

4.8 DATASETS USED

5 IMPLEMENTATION AND RESULTS 21

5.1 IMPLEMENTATION 21

5.1.1 REQUIREMENTS

5.1.2 BACKEND IMPLEMENTATION CODE

5.1.3 FRONTEND IMPLEMENTATION CODE

5.2 OUTPUT 27

6 CONCLUSION 29

6.1 CONCLUSION

7 REFERENCES 30
CHAPTER 1
INTRODUCTION

1.1​ DESCRIPTION

Text-to-SQL generation is a process that converts natural language


questions into structured SQL queries. It bridges the gap between non-technical
users and complex databases by allowing users to ask questions. This technology
uses natural language processing and machine learning to understand the intent
behind a query. It is particularly useful in business intelligence and data analytics
platforms. The model learns patterns from large datasets containing text and
corresponding SQL queries.It simplifies database interaction and makes data
more accessible to everyone.

1.2​ EXISTING SYSTEM

1.​ Model Selection: With the emergence of various LLMs, selecting the
most suitable model for Text-to-SQL is challenging due to variations in
architecture, size, training data, and computational requirements.
2.​ No Dynamic Table Selection: Existing models typically consider the
entire schema during query generation, which leads to longer response times and
reduced accuracy.
3.​ Minimal Use of Few-Shot Learning: While few-shot learning could
enhance adaptability, existing systems rarely use contextual examples to improve
performance on previously unseen query patterns.
4.​ Lack of Memory Cache: Increases query processing time as results
are not stored for reuse, leading to redundant computations.

1
1.3​ PROBLEM DEFINITION

In many businesses, non-technical users struggle to retrieve insights


from databases due to a lack of SQL knowledge. This dependency on database
administrators or analysts often leads to delays in decision-making. To overcome
this challenge, a real-time Natural Language to SQL (NL2SQL) system is
developed. It allows users to enter natural language queries and instantly receive
structured SQL results, simplifying database interaction and enabling faster,
data-driven decisions.

1.4​ PROPOSED SYSTEM

The proposed system for consists of the following key components:


1.​ LLM-Driven Query Generation: Utilizes large language models with
LangChain to convert natural language queries into accurate SQL statements with
improved contextual understanding.
2.​ Dynamic Table Selection: Relevant tables are automatically selected
from the database based on keyword extraction from the user's input, minimizing
irrelevant data processing.
3.​ Few-Shot Learning: Dynamic few-shot examples are provided to the
model to guide and optimize SQL generation based on the input context.
4.​ Memory Cache for Context Retention: Past user interactions are
stored using LangChain’s memory, allowing the system to handle follow-up queries
with context awareness.
5.​ Chatbot UI: The final response is presented to the user through a
conversational chatbot interface for a seamless and interactive experience.

2
1.5​ ORGANIZATION OF THE PROJECT

●​ Literature reviews of already existing proposals are discussed in


chapter 2.
●​ Chapter 3 has system specification which tells about the software
and hardware requirements.
●​ Chapter 4 discusses the overall project and design which tells the
brief description of each of the modulus in this project.
●​ Chapter 5 has the implementation and experimental result of the
project.
●​ Chapter 6 deals with the conclusion and future work.
●​ Finally Chapter 7 deals with the references.

3
CHAPTER 2

LITERATURE REVIEW

2.1 NATURAL LANGUAGE TO SQL GENERATION

2.1.1 DESCRIPTION

The base paper emphasizes the use of large language models (LLMs),
such as GPT-based architectures, for translating natural language into SQL
queries. It focuses on prompt formatting strategies and benchmark testing to
evaluate model performance. While effective in structured experiments, this
approach is limited by static prompt design and lacks integration with live database
systems.Our project builds upon this by implementing a real-time, schema-aware
solution using LangChain.Instead of relying on static schema inputs, we
dynamically fetch and inject database schema information into prompts, enabling
more accurate and context-sensitive SQL generation. Unlike the base paper,
which evaluates outputs offline, our system interacts directly with live databases,
delivering instant query execution and feedback. This shift makes the solution
more practical and adaptable to real-world use cases.

2.1.2 MERIT

●​ Improves database accessibility for non-technical users.


●​ Automates query generation without the need for SQL expertise.
●​ Schema-aware responses that adapt to different domains.

2.1.3 DEMERIT

●​ SQL generation was static and not executed in real-time.


●​ Ambiguous natural language can lead to incorrect SQL generation.
●​ Requires continuous prompt refinement for optimal performance.

4
2.2 FEW-SHOT LEARNING FOR TEXT-TO-SQL

2.2.1 DESCRIPTION

The base paper explores prompt-based learning using two types of fixed
prompt structures (Type I and Type II), but it applies few-shot examples in a static
manner—embedding hardcoded natural language and SQL pairs into prompts
during benchmarking. While this improves performance, it lacks adaptability
across dynamic schemas or varying user queries. In contrast, our solution adopts
a dynamic few shot strategy using LangChain FewShotPromptTemplate.Here, 3–5
relevant natural language questions and their corresponding SQL queries are
programmatically selected and inserted into the prompt based on the user's
current query context. This method simulates how humans learn: by observing
examples before attempting similar tasks. It improves generalization without the
need for large training datasets or model fine-tuning. The dynamic nature allows
better alignment with schema variations and query intent in real-time.

2.2.2 MERIT

●​ Enhances model adaptability with minimal examples.


●​ Requires no fine-tuning or retraining for each schema.
●​ Improves SQL generation even with minimal data availability.

2.2.3 DEMERIT

●​ Few-shot examples were static.


●​ Can be inconsistent if the prompt structure is not standardized.
●​ Struggles with complex or novel queries not covered in examples.

2.3 DYNAMIC TABLE SELECTION

2.3.1 DESCRIPTION

In multi-table databases, accurately identifying which tables to reference is


essential for generating valid SQL queries. The base paper relies on static prompt
injection of entire schemas, where all table names and columns are included in the

5
prompt regardless of their relevance to the user’s query. This increases prompt
length, introduces noise, and reduces accuracy—especially in large or complex
databases.Our solution overcomes these limitations through dynamic table
selection using semantic embeddings. Each table name and its metadata are
converted into vector representations using a pre-trained embedding model.This
allows the model to focus on only the necessary schema elements, improving the
quality of SQL generation and minimizing the inclusion of irrelevant or conflicting
tables.

2.3.2 MERIT

●​ Reduces SQL generation errors by avoiding irrelevant tables.


●​ Enhances contextual awareness in multi-table databases.
●​ Scales effectively with growing schema complexity.

2.3.3 DEMERIT

●​ Includes all schema tables in the prompt, regardless of relevance.


●​ May return incorrect tables if the query is ambiguous.
●​ Manual schema injection makes it harder to scale or adapt dynamically.

2.4 PROMPT OPTIMIZATION

2.4.1 DESCRIPTION

The base paper explores two prompt types (Type I and Type II), showing
that changing the structure and verbosity of prompts significantly impacts
performance. However, these prompts were manually crafted and static, requiring
trial-and-error tuning per use case. There was no support for dynamically adapting
the prompt based on schema, task, or query complexity.Our implementation
addresses this limitation through LangChain’s PromptTemplate and
FewShotPromptTemplate, which allow programmatic and flexible prompt
construction. Each prompt includes a clear task instruction, dynamically injected
schema, and optionally relevant few-shot examples—tailored to the user’s current

6
query. This reduces manual overhead, ensures consistency, and allows the
system to handle varied schemas and query types more effectively.

2.4.2 MERIT

●​ Improves consistency and quality of SQL outputs.


●​ Reduces ambiguity in model interpretation by clearly separating schema,
task, and input.
●​ Essential for tailoring the model to specific domains or schemas.

2.4.3 DEMERIT

●​ Manually crafting effective prompts can be time-consuming.


●​ Hard to generalize across very different schema or domains.
●​ Too long prompts may exceed model token limits.

2.5 QUERY REPHRASING USING LLMs

2.5.1 DESCRIPTION

Natural language queries can be vague or ambiguous, making it difficult for


the model to interpret them accurately. To solve this, the concept of query
rephrasing is introduced. In earlier versions of the system, user queries were
automatically rephrased into multiple semantically similar versions using LLMs.
Each variant was then tested, and the one that yielded the most accurate SQL
was chosen. Though not used in the current implementation, this method forms a
critical part of modern NL2SQL research. It allows the system to overcome
limitations of unclear or grammatically incorrect user input. By rephrasing the
question into more structured and explicit versions, the model is better able to
generate the correct SQL, especially for edge cases.

2.5.2 MERIT

●​ Increases reliability by reducing ambiguity.


●​ Allows for flexible interpretation of varied user phrasing.
●​ Can improve accuracy without modifying the underlying model.

7
2.5.3 DEMERIT

●​ Rephrasing was not part of the system, only used in testing.


●​ Requires a selection mechanism to pick the best variant.
●​ May not significantly help with deeply complex queries.

2.6 LANGCHAIN MEMORY FOR CONTEXT RETENTION

2.6.1 DESCRIPTION

While the base paper provided a foundational approach to natural language


queries over data, it lacked a mechanism for content retention across user
interactions. This limited the system’s ability to handle follow-up questions or
sustain coherent conversations across multiple turns.To overcome this limitation,
our solution integrates LangChain’s memory module, which enables the system to
retain and utilize past interactions. With memory, users no longer need to repeat
details in every query. For instance, after asking for “clients in Bangalore,” a user
can simply follow up with, “How many total orders do they have?” The system
understands that “they” refers to the clients retrieved in the previous query, thereby
supporting natural, human-like dialogue.

2.6.2 MERIT

●​ Maintains dialogue continuity for multi-turn queries.


●​ Reduces user effort by eliminating repetitive inputs.
●​ Users can explore data incrementally through chained queries.

2.6.3 DEMERIT

●​ The base system treats each query in isolation, discarding any context from
previous interactions.
●​ Users must repeat entire query details even for simple follow-ups, which
can be inefficient and frustrating.
●​ Potential risk of context leakage between user sessions.

8
CHAPTER 3

SYSTEM SPECIFICATION

3.1 SYSTEM REQUIREMENTS

3.1.1 HARDWARE REQUIREMENTS

System​ :​ Intel(R) Core(TM) i5/i7 Processor or higher

Hard Disk​ :​ 256 GB or higher

RAM​ ​ :​ 8 GB or higher

3.2 SOFTWARE REQUIREMENTS

Platform Support

Google Colab (Cloud-based platform with GPU/TPU support)

Operating System

Any system capable of accessing Google Colab (e.g., Windows, macOS)

Browser Performance

Google Chrome or Mozilla Firefox for optimal Colab experience

3.2.1 PACKAGE: OPENAI

OpenAI provides tools and APIs for developing AI-based applications utilizing
advanced language models such as ChatGPT (GPT-3.5 / GPT-4).

KEY FEATURES

●​ Access to high-performance language models (ChatGPT).

●​ Supports natural language understanding and generation.

9
●​ Facilitates tasks such as text generation, summarization, translation, and
conversational AI.

●​ Easy integration using API keys in a developer-friendly environment.

3.2.2 PACKAGE: LANGCHAIN

Langchain is an open-source framework designed for building applications


powered by language models.

KEY FEATURES

●​ Enables chaining of LLM calls for complex workflows.

●​ Provides tools for prompt engineering and output parsing.

●​ Supports integration with memory modules, agents, and external APIs.

3.2.3 PACKAGE: LANGCHAIN-GOOGLE-GENAI

Langchain-Google-GenAI provides connectors to utilize Google Generative AI


models within the Langchain framework.

KEY FEATURES

●​ Simplified interface for interaction with Google Generative AI.

●​ Leverages Langchain's LLM functionality through Google's APIs.

●​ Easy integration for building scalable AI-powered systems.

3.2.4 PACKAGE: LANGSMITH

Langsmith is a debugging, monitoring, and management platform for


Langchain-based applications.

KEY FEATURES

●​ Real-time observability and debugging support.

●​ Assists in tracing execution flow and performance of LLM applications.

●​ Provides tools for logging, error tracking, and execution tracing.

10
3.2.5 PACKAGE: PYMYSQL

PyMySQL is a pure-Python MySQL client library used to connect Python


applications with MySQL databases.

KEY FEATURES

●​ Supports SQL query execution within Python.

●​ Lightweight and easy-to-use interface.

●​ No need for MySQL client libraries on the system.

3.2.6 PACKAGE: SQLALCHEMY

SQLAlchemy is a powerful SQL toolkit and Object Relational Mapper (ORM)

KEY FEATURES

●​ Provides high-level ORM features for database operations.

●​ Database-agnostic and supports multiple database engines.

●​ Seamlessly integrates with PyMySQL for MySQL operations.

3.2.7 PACKAGE: CHROMADB

ChromaDB is an open-source vector database for storing and querying vector


embeddings.

KEY FEATURES

●​ Efficient similarity search and vector data storage.

●​ Ideal for AI applications involving embedding-based search.

●​ Lightweight and scalable solution.

3.2.8 PACKAGE: LANGCHAIN-COMMUNITY

Langchain-Community provides additional tools and integrations contributed by


the open-source community for Langchain.

KEY FEATURES

11
●​ Includes connectors and components for third-party tools.

●​ Extends the functionality of the Langchain framework.

●​ Open-source and regularly updated by the community.

3.2.9 PACKAGE: LANGCHAIN-OPENAI

Langchain-OpenAI offers seamless integration of OpenAI GPT models into the


Langchain framework.

KEY FEATURES

●​ Direct connectivity with OpenAI models via API.

●​ Supports embeddings, chat models, and prompt management.

●​ Easy configuration and customization for various use cases.

3.2.10 PACKAGE: PANDAS

Pandas is a powerful Python library for data manipulation and analysis.

KEY FEATURES

●​ Provides DataFrame structures for handling structured data.

●​ Supports efficient data cleaning, transformation, and analysis.

●​ Integrates well with other Python libraries like NumPy and Matplotlib.

12
CHAPTER 4
METHODOLOGY

4.1 METHOD USED

ARCHITECTURE DIAGRAM

Figure 1.1 : System Architecture Overview


An overview of the system architecture depicting the core components, their interactions,
and the flow of data within the system.

●​ This proposed system follows a text-to-SQL conversion approach


using dynamic few-shot learning and dynamic table selection to generate SQL
queries from natural language user prompts.
●​ The system is enhanced with LangChain memory to support
conversational interactions and query refinement. The generated SQL queries are
executed against a database schema, and the output is rephrased for better
readability in a chatbot interface.

13
4.2 ARCHITECTURE DESCRIPTION

The architecture consists of multiple processing modules, as shown


in the provided diagram. The key steps in the workflow include,

1.​ User Prompt: The user provides a natural language query.


2.​ Dynamic Table Selection: Relevant tables are dynamically chosen based
on keyword extraction from the prompt.
3.​ Few-Shot Learning: Dynamic few-shot examples are used to optimize
SQL query generation.
4.​ Query Generation: ChatGPT generates SQL queries based on the
optimized input.
5.​ Query Execution: The generated SQL is executed against the connected
database schema.
6.​ LangChain Memory: Stores past conversations to provide context-aware
responses for follow-up questions.
7.​ Query Answer Rephrasing: The retrieved results are rephrased to
enhance readability.
8.​ Chatbot UI: Presents the final response to the user in a conversational
manner.

4.3 MODULE DESCRIPTION

The proposed system is composed of multiple modules, each


contributing uniquely to the functionality of the natural language interface for
structured databases. The modular design enhances maintainability, scalability,
and clarity of implementation. Below is a detailed explanation of each module
involved in the system,

4.3.1 Description/Prompt Input

The Description/Prompt module is the entry point for the user to


interact with the system. Here, the user types in a natural language question or
instruction that they want to translate into a SQL query. The module captures this
prompt and forwards it to the back-end for further processing. It ensures that the

14
entire input is correctly taken, stored, and validated before moving to the next step.
This module also supports multi-line prompts and edge cases like incomplete or
ambiguous queries. It acts as a bridge between the human input and machine
logic. Moreover, it enhances usability by allowing flexible and varied query formats.
By handling user input gracefully, it increases the system's overall
user-friendliness.

4.3.2 Dynamic Table Selection

The Dynamic Table Selection module plays a pivotal role in identifying


the relevant tables needed to answer the user query. It analyzes the given prompt
and cross-references it with the schema to pinpoint the required database
components. The module uses semantic analysis or keyword extraction to map
parts of the question to corresponding table names. This approach ensures that
only the essential parts of the schema are used during query generation. It
optimizes performance by minimizing the number of tokens passed to the
language model. Additionally, it supports queries involving joins by recognizing
relationships between tables. This module ensures better accuracy, faster
processing, and efficient use of computational resources. Its inclusion significantly
improves the precision of the generated SQL queries.

4.3.3 Dynamic Few-Shot Learning

Dynamic Few-Shot Learning is a key enhancement over static prompt


engineering in natural language to SQL systems. This module intelligently selects
a small set of highly relevant example pairs (natural language question + SQL
query) from a predefined dataset or logs. These examples are chosen based on
the current user query’s structure and keywords to maximize contextual similarity.
By dynamically tailoring the examples, the module improves the LLM’s ability to
generalize and generate accurate queries for unseen inputs. Unlike traditional
fixed-prompt methods, this approach is adaptive and scales better with diverse
query types. It reduces ambiguity and enhances model accuracy without requiring
model fine-tuning. Additionally, it ensures the prompts remain concise and
token-efficient for optimal inference performance. This dynamic strategy is
especially beneficial in real-time applications with varied query styles and user
behavior.

15
4.3.4 Query Generation using GPT

This module is responsible for generating syntactically and


semantically correct SQL queries using large language models (LLMs). A few-shot
learning strategy is employed where the model is given a few sample pairs of
natural language questions and SQL queries along with the user’s input. The
prompt is dynamically constructed using both the current user query and selected
relevant schema details. This technique improves the model’s performance by
guiding it through example-driven learning. It allows the model to adapt to new or
unseen query types with minimal retraining. The resulting SQL query is then
passed for execution. This module is the heart of the system and plays a key role
in translating human intentions into database-understandable commands. It offers
flexibility, reusability, and language model compatibility.

4.3.5 LangChain Memory

LangChain Memory Integration enables the system to maintain the


context of a conversation across multiple user turns. It stores the history of past
queries and responses to make interactions more dynamic and coherent. This is
particularly useful when users ask follow-up questions like “What about last
month’s sales?” after a previous query. The memory module allows the system to
recall previous context and combine it with the current prompt. By supporting
contextual continuity, this module simulates real-time human dialogue. It helps
avoid redundant explanations and supports conversational SQL generation.
LangChain memory increases the intelligence of the system and improves the
user experience by providing more meaningful and personalized responses. It is
especially beneficial in enterprise applications where multi-turn queries are
common.

4.3.6 Query Answer Rephrasing

Once the SQL query is executed and the raw result is retrieved from
the database, this module takes over to convert the output into a user-friendly
sentence. It ensures that the final result is not a plain table or number but a
well-structured and grammatically correct English sentence. This is crucial for
users who are not familiar with SQL output formats. The module uses

16
template-based or model-based methods to transform tabular responses into
human-readable narratives. It may also include additional explanation or
summarization for clarity. This layer adds polish to the user experience and
bridges the gap between technical data and user understanding. It plays a crucial
role in making the system usable by non-technical stakeholders. Its goal is to
provide accurate, readable, and meaningful results.

4.3.7 Chatbot UI

The ChatBot UI is the graphical user interface that allows the user to
communicate with the system. It features an input box for questions, a display
area for the responses, and a conversational layout for interaction history. The UI
is designed to be simple, clean, and responsive to accommodate all types of
users. Built with modern front-end technologies, it supports real-time interactions
and dynamic updates. The interface makes it easy to handle multi-turn queries
and lets users visualize both their input and the output clearly. The UI plays a key
role in increasing engagement and usability. It also offers basic validations, error
handling, and retry mechanisms. Overall, this module makes the underlying
complex system accessible to all users.

4.4 MODEL BUILDING

We leveraged OpenAI’s GPT-3.5 model via the LangChain framework.


Static few-shot examples were pre-defined, and dynamic examples were
generated on-the-fly based on prompt and schema. The entire prompt—including
the user query, selected tables, and few-shot examples—was formatted into a
template that guided the GPT model for SQL generation. Additional model
behavior like memory management was achieved through LangChain’s
ConversationBufferMemory.

17
4.5 TRAINING

No custom model training was done as we used pretrained LLMs.


However, extensive prompt engineering was carried out to improve model
reliability and reduce hallucinations. We fine-tuned the structure of few-shot
examples and prompt templates for better performance.

4.6 TESTING AND EVALUATION

The generated SQL queries were tested by running them against a


MySQL database using the provided schema. The correctness of the SQL was
validated by comparing the result with the expected output. Evaluation criteria
included:

●​ Query correctness

●​ Schema relevance

●​ Query execution success

●​ Response clarity

4.7 TOOLS AND TECHNOLOGIES​

Tools used in this project:

●​ OpenAI GPT-3.5 – for natural language processing and query generation.

●​ Google Colab - for cloud-based training and experimentation.

●​ LangChain – for memory management and few-shot handling.

●​ MySQL – for storing and querying data.

●​ Django – for Backend development.

●​ Python – core implementation language.

●​ NextJs- for Frontend Implementation.

18
4.8 DATASETS USED

The dataset used in this project is based on a comprehensive


relational database schema designed for a sales and order management system.
This database consists of multiple interrelated tables, each storing specific types
of business data necessary for answering user queries through natural language.

The tables include:

●​ Customers: Contains customer-related information such as name,


contact details, address, sales representative, and credit limit.

●​ Orders: Holds order-specific data including order date, shipment


status, and the associated customer.

●​ OrderDetails: Stores details of each product in an order, such as


quantity, unit price, and product information.

●​ Products: Maintains information about each product including name,


vendor, stock level, and pricing.

●​ ProductLines: Categorizes products into different lines and includes


descriptive information.

●​ Employees: Contains employee records, job titles, office locations,


and reporting hierarchy.

●​ Offices: Holds details about office locations such as city, country,


phone number, and address.

●​ Payments: Includes payment history linked to each customer along


with payment amount and date.​

This structured schema allows the system to perform complex SQL queries across
multiple tables, supporting a wide range of user prompts. The dataset was
synthetically generated to mimic real-world e-commerce data and is used as the
backend for validating SQL queries and generating responses.

19
CHAPTER 5
IMPLEMENTATION AND RESULT

5.1.1 REQUIREMENTS
The following packages were installed by using command

!pip install -r requirements.txt

●​ gunicorn
●​ Django>=4.2
●​ openai
●​ langchain>=0.1.13
●​ langchain-community>=0.0.26
●​ langchain-core>=0.1.25
●​ sqlalchemy>=2.0
●​ pymysql
●​ djangorestframework
●​ django-cors-headers

5.1.2 BACKEND IMPLEMENTATION CODE

Langchain_core.py

import os
import re
from functools import lru_cache
from django.conf import settings

from langchain_community.chat_models import ChatOpenAI


from langchain_community.utilities import SQLDatabase
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnableSequence
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.output_parsers import StrOutputParser

# === Setup ===


os.environ["OPENAI_API_KEY"] = settings.OPENAI_API_KEY

# === Model ===


llm = ChatOpenAI(model_name="gpt-4", temperature=0)

20
# === Session Memory ===
store = {}
def get_memory(session_id):
if session_id not in store:
store[session_id] = ChatMessageHistory()
return store[session_id]

# === DB Connection ===


db_user = settings.DB_USER
db_password = settings.DB_PASSWORD
db_host = settings.DB_HOST
db_name = settings.DB_NAME
db_port = settings.DB_PORT
db_url =
f"mysql+pymysql://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}"
db = SQLDatabase.from_uri(db_url)

# === Schema Description Caching ===


@lru_cache()
def get_schema_description():
return db.get_table_info()

# === Few-shot SQL Examples ===


EXAMPLES = """
Examples:

Q: List all customers from France.


A:
SELECT customerName FROM customers WHERE country = 'France';

Q: Which product has the highest price?


A:
SELECT productName FROM products ORDER BY buyPrice DESC LIMIT 1;

Q: What is the total payment amount received?


A:
SELECT SUM(amount) FROM payments;

Q: How many orders were placed by each customer?


A:
SELECT customerNumber, COUNT(*) as order_count FROM orders GROUP BY
customerNumber;
"""

21
# === SQL Prompt Builder ===
def build_sql_prompt():
schema = get_schema_description()
return PromptTemplate.from_template(
f"""You are an expert SQL assistant.
Use the schema and examples below to write a valid MySQL query for the user's
question.
Only return the raw SQL query without explanation or markdown formatting.

Schema:
{schema}

{EXAMPLES}

Chat History:
{{chat_history}}

User: {{question}}
SQL:"""
)

# === Strip markdown or extra explanation from SQL ===


def clean_sql_output(sql):
sql = sql.strip()
sql = re.sub(r"```sql|```", "", sql).strip()
sql = re.split(r'\n(?=SELECT|WITH|INSERT|UPDATE|DELETE)', sql,
maxsplit=1)[-1]
return sql.strip()

# === Correct Common SQL Mistakes ===


def correct_common_sql_errors(query):
corrections = {
"customer_id": "customerNumber",
"order_id": "orderNumber",
"order_details": "orderdetails",
"product_id": "productCode",
"products.price": "products.buyPrice"
}
for wrong, right in corrections.items():
query = query.replace(wrong, right)
return query

# === Execute SQL ===


def execute_query(query):

22
try:
print(f"\nExecuting SQL Query:\n{query}")
result = db.run(query)
return result if result else {"info": "No results found."}
except Exception as e:
print(f"\nSQL Execution Error: {e}")
return {"error": f"Query failed: {str(e)}"}

# === Rephrase Results ===


answer_prompt = PromptTemplate.from_template(
"""Given the user question, SQL query, and result, return a helpful,
user-friendly answer.

Question: {question}
SQL Query: {query}
SQL Result: {result}
Answer:"""
)
rephrase_chain = answer_prompt | llm | StrOutputParser()

# === Final Processor ===


def process_question(question, session_id="user-1"):
sql_prompt = build_sql_prompt()
memory_chain = RunnableWithMessageHistory(
RunnableSequence(sql_prompt | llm | StrOutputParser()),
get_session_history=get_memory,
input_messages_key="question",
history_messages_key="chat_history"
)

# 1. Generate SQL
raw_sql = memory_chain.invoke(
{"question": question},
config={"configurable": {"session_id": session_id}}
)
clean_query = clean_sql_output(raw_sql)
clean_query = correct_common_sql_errors(clean_query)

# 2. Execute SQL
sql_result = execute_query(clean_query)

# 3. Rephrase answer
return rephrase_chain.invoke({
"question": question,

23
"query": clean_query,
"result": str(sql_result)
})

urls.py

from django.urls import path


from .views import AskQuestion

urlpatterns = [
path("ask/", AskQuestion.as_view(), name="ask-question"),
]

views.py

from rest_framework.views import APIView


from rest_framework.response import Response
from .langchain_logic import process_question

class AskQuestion(APIView):
def post(self, request):
question = request.data.get("question")
session_id = request.data.get("session_id", "default-session")
answer = process_question(question, session_id)
return Response({"answer": answer})

5.1.3 FRONTEND IMPLEMENTATION CODE

Index.tsx

import { useState, useEffect, useRef } from "react";


import axios from "axios";
import {
Send,
Trash2,
Database,
Code,
MessageSquare,
ChevronRight,
Search,
AlertCircle,

24
AlertTriangle,
CheckCircle,
HelpCircle,
Loader,
} from "lucide-react";

type Message = {
role: "user" | "assistant";
text: string;
timestamp: Date;
status?: "success" | "error" | "pending";
};

export default function Home() {


const [messages, setMessages] = useState<Message[]>([]);
const [question, setQuestion] = useState("");
const [loading, setLoading] = useState(false);
const [showWelcome, setShowWelcome] = useState(true);
const messagesEndRef = useRef<HTMLDivElement>(null);
const textareaRef = useRef<HTMLTextAreaElement>(null);
const [isExpanded, setIsExpanded] = useState(false);
const [theme, setTheme] = useState<"light" | "dark">("light");

// Sample suggestions for the welcome screen with categories


const suggestions = [
{ text: "Show all users who signed up last week", category: "Users" },
{
text: "Find products with inventory below 10 units",
category: "Inventory",
},
{
text: "What are the top 5 most ordered products?",
category: "Analytics",
},
{ text: "Show transactions over $1000", category: "Transactions" },
{ text: "List tables in the database", category: "Schema" },
{ text: "Find customers with no orders", category: "Relationships" },
];

// Auto-resize textarea as user types


useEffect(() => {
if (textareaRef.current) {
textareaRef.current.style.height = "56px";
textareaRef.current.style.height = `${Math.min(

25
textareaRef.current.scrollHeight,
150
)}px`;
}
}, [question]);

// Scroll to bottom whenever messages change


useEffect(() => {
scrollToBottom();
}, [messages]);

const scrollToBottom = () => {


messagesEndRef.current?.scrollIntoView({ behavior: "smooth" });
};

const askQuestion = async (text = question) => {


if (!text.trim()) return;

const newMessage: Message = {


role: "user",
text,
timestamp: new Date(),
};

const updatedMessages = [...messages, newMessage];


setMessages(updatedMessages);
setLoading(true);
setQuestion("");
setShowWelcome(false);

try {
const res = await axios.post(
"https://nl2sql-backend-zqrg.onrender.com/api/ask/",
{
question: text,
session_id: "frontend-user",
}
);

setMessages([
...updatedMessages,
{
role: "assistant",
text: res.data.answer,

26
timestamp: new Date(),
status: "success",
},
]);
} catch (err) {
setMessages([
...updatedMessages,
{
role: "assistant",
text: "Something went wrong. Please try again.",
timestamp: new Date(),
status: "error",
},
]);
} finally {
setLoading(false);
}
};

const handleKeyDown = (e: React.KeyboardEvent) => {


if (e.key === "Enter" && !e.shiftKey) {
e.preventDefault();
askQuestion();
}
};

const useSuggestion = (suggestion: string) => {


askQuestion(suggestion);
};

const formatTime = (date: Date) => {


return date.toLocaleTimeString([], { hour: "2-digit", minute: "2-digit" });
};

const clearConversation = () => {


setMessages([]);
setShowWelcome(true);
};

const toggleTheme = () => {


setTheme(theme === "light" ? "dark" : "light");
};

return (

27
<div
className={`flex flex-col h-screen ${
theme === "dark" ? "bg-gray-900 text-white" : "bg-gray-50 text-gray-900"
}`}
>
{/* Header with better branding */}
<header
className={`${
theme === "dark"
? "bg-gray-800 border-gray-700"
: "bg-white border-gray-200"
} border-b px-6 py-4 flex items-center justify-between shadow-sm`}
>
<div className="flex items-center space-x-3">
<div className="bg-gradient-to-r from-blue-600 to-purple-600 text-white
p-2 rounded-lg">
<Database size={24} />
</div>
<h1 className="text-2xl font-bold bg-gradient-to-r from-blue-600
to-purple-600 text-transparent bg-clip-text">
SQL Assistant
</h1>
</div>
<div className="flex items-center space-x-4">
<button
onClick={toggleTheme}
className={`${
theme === "dark"
? "text-gray-300 hover:text-white"
: "text-gray-500 hover:text-gray-700"
} flex items-center gap-1 text-sm px-3 py-1 rounded-md
hover:bg-opacity-10 hover:bg-gray-500`}

☀️ 🌙
>
{theme === "dark" ? " Light" : " Dark"}
</button>
<button
onClick={clearConversation}
className={`${
theme === "dark"
? "text-gray-300 hover:text-white"
: "text-gray-500 hover:text-gray-700"
} flex items-center gap-1 text-sm px-3 py-1 rounded-md
hover:bg-opacity-10 hover:bg-gray-500`}
>

28
<Trash2 size={16} />
Clear chat
</button>
</div>
</header>

{/* Main chat area with improved styling */}


<main
className={`flex-1 overflow-y-auto p-6 ${
theme === "dark" ? "bg-gray-900" : "bg-gray-50"
}`}
>
<div className="max-w-4xl mx-auto space-y-6">
{showWelcome ? (
<div
className={`${
theme === "dark"
? "bg-gray-800 border-gray-700"
: "bg-white border-gray-200"
} rounded-xl shadow-md p-6 mb-6 border`}
>
<h2
className={`text-2xl font-bold ${
theme === "dark" ? "text-white" : "text-gray-800"
} mb-2`}
>
Welcome to SQL Assistant!
</h2>
<p
className={`${
theme === "dark" ? "text-gray-300" : "text-gray-600"
} mb-6`}
>
Ask me anything about your database or try one of these
examples:
</p>
<div className="grid grid-cols-1 md:grid-cols-2 gap-3">
{suggestions.map((suggestion, index) => (
<button
key={index}
className={`p-3 ${
theme === "dark"
? "border-gray-700 hover:bg-gray-700 text-gray-200"
: "border-gray-200 hover:bg-gray-50 text-gray-700"

29
} border rounded-lg text-left hover:border-blue-300 transition-all flex
items-start space-x-3`}
onClick={() => useSuggestion(suggestion.text)}
>
<div
className={`mt-1 ${
theme === "dark" ? "text-blue-400" : "text-blue-500"
}`}
>
<ChevronRight size={16} />
</div>
<div>
<div
className={`text-sm font-medium ${
theme === "dark" ? "text-blue-400" : "text-blue-600"
}`}
>
{suggestion.category}
</div>
<div>{suggestion.text}</div>
</div>
</button>
))}
</div>
</div>
) : null}

{messages.length === 0 && !showWelcome ? (


<div className="flex items-center justify-center h-64">
<div
className={`text-center ${
theme === "dark" ? "text-gray-400" : "text-gray-500"
}`}
>
<MessageSquare size={48} className="mx-auto mb-4 opacity-50" />
<p>No messages yet. Ask something about your database!</p>
</div>
</div>
) : null}

{messages.map((msg, i) => (
<div
key={i}
className={`flex ${

30
msg.role === "user" ? "justify-end" : "justify-start"
}`}
>
<div
className={`rounded-2xl p-4 max-w-3xl whitespace-pre-wrap
shadow-sm
${
msg.role === "user"
? "bg-gradient-to-r from-blue-500 to-blue-600 text-white"
: theme === "dark"
? "bg-gray-800 border-gray-700 text-gray-100"
: "bg-white border border-gray-200 text-gray-800"
}
`}
>
<div className="flex justify-between items-center mb-2">
<div className="flex items-center">
{msg.role === "user" ? (
<span className="font-semibold flex items-center">
You
</span>
):(
<span className="font-semibold flex items-center">
<Database size={16} className="mr-1" /> SQL Assistant
</span>
)}
</div>
<span
className={`text-xs ${
msg.role === "user"
? "opacity-75"
: theme === "dark"
? "text-gray-400"
: "text-gray-500"
}`}
>
{formatTime(msg.timestamp)}
</span>
</div>
<div
className={`${msg.role === "assistant" ? "prose" : ""} ${
theme === "dark" && msg.role === "assistant"
? "prose-invert"
: ""

31
}`}
>
{msg.text}
</div>
{msg.status === "error" && (
<div className="flex items-center text-red-500 text-sm mt-2">
<AlertTriangle size={14} className="mr-1" /> Error: Unable
to process request
</div>
)}
</div>
</div>
))}

{loading && (
<div className="flex justify-start">
<div
className={`${
theme === "dark"
? "bg-gray-800 border-gray-700 text-gray-300"
: "bg-white border-gray-200 text-gray-600"
} rounded-2xl p-4 shadow-sm border flex items-center space-x-3`}
>
<Loader size={18} className="animate-spin" />
<span>Generating response...</span>
</div>
</div>
)}

<div ref={messagesEndRef} />


</div>
</main>

{/* Footer with better input design */}


<footer
className={`${
theme === "dark"
? "bg-gray-800 border-gray-700"
: "bg-white border-gray-200"
} border-t p-4`}
>
<div className="max-w-4xl mx-auto">
<div
className={`relative ${

32
theme === "dark" ? "bg-gray-700" : "bg-white"
} rounded-xl border ${
isExpanded
? "border-blue-400 shadow-md"
: theme === "dark"
? "border-gray-600 shadow-sm"
: "border-gray-300 shadow-sm"
} transition-all duration-200`}
>
<textarea
ref={textareaRef}
className={`w-full p-4 pr-24 resize-none focus:outline-none rounded-xl
max-h-36 ${
theme === "dark"
? "bg-gray-700 text-white placeholder-gray-400"
: "bg-white text-gray-700 placeholder-gray-500"
}`}
placeholder="Ask about your database..."
value={question}
onChange={(e) => {
setQuestion(e.target.value);
setIsExpanded(e.target.value.length > 0);
}}
onKeyDown={handleKeyDown}
onFocus={() => setIsExpanded(true)}
onBlur={() => setIsExpanded(question.length > 0)}
rows={1}
/>
<button
onClick={() => askQuestion()}
disabled={loading || !question.trim()}
className={`absolute bottom-3 right-3 p-2 rounded-lg transition-all ${
loading || !question.trim()
? theme === "dark"
? "bg-gray-600 text-gray-400"
: "bg-gray-100 text-gray-400"
: "bg-gradient-to-r from-blue-500 to-purple-600 text-white shadow-md
hover:shadow-lg"
}`}
>
<Send size={20} />
</button>
</div>
<div className="flex justify-between items-center mt-2">

33
<p
className={`text-xs ${
theme === "dark" ? "text-gray-400" : "text-gray-500"
}`}
>
Press Enter to send • Shift+Enter for new line
</p>
<div className="flex items-center">
<span
className={`text-xs mr-2 ${
theme === "dark" ? "text-gray-400" : "text-gray-500"
}`}
>
Powered by AI
</span>
<Code
size={14}
className={theme === "dark" ? "text-gray-400" : "text-gray-500"}
/>
</div>
</div>
</div>
</footer>
</div>
);}

34
5.2 Output

35
36
CHAPTER 6
CONCLUSION

6.1 CONCLUSION

The Natural Language to SQL (NL2SQL) query generation system developed


in this project presents a practical and intelligent solution for bridging the gap
between natural language interfaces and relational database systems. Aimed at
empowering non-technical users to retrieve data without writing SQL, the system
integrates advanced techniques such as dynamic few-shot learning, semantic
table selection using vector embeddings, and prompt optimization for accurate
query formulation.

By leveraging the capabilities of ChatGPT-4o and LangChain, the system


dynamically interprets user intent, selects relevant database tables based on
semantic similarity, and generates syntactically correct SQL queries. Furthermore,
the inclusion of LangChain’s memory module ensures contextual continuity in
multi-turn conversations, enabling the system to respond intelligently to follow-up
queries. This enhances usability and simulates natural, human-like interaction with
structured databases.

The overall architecture demonstrates high adaptability and scalability, making


it suitable for enterprise use cases where fast, self-service data access is required.
Through this implementation, the project successfully reduces dependency on
database administrators, speeds up data retrieval, and improves decision-making
processes. With further enhancement, such as integrating advanced query
validation or expanding schema generalization, the system holds strong potential
for deployment in real-world analytical platforms and AI-driven data interfaces.

37
CHAPTER 7
REFERENCES

[1] Biderman, S., Schoelkopf, H., Anthony, Q.G., Bradley, H., O’Brien, K.,
Hallahan, E., Khan, M.A., Purohit, S., Prashanth, U.S.V.S.N., Raff, E., et al., 2023,
"Pythia: A Suite for Analyzing Large Language Models Across Training and
Scaling", International Conference on Machine Learning, PMLR, pp. 2397–2430

[2] Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar,
E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al., 2023, "Sparks of Artificial General
Intelligence: Early Experiments with GPT-4", arXiv preprint, arXiv:2303.12712

[3] Dahl, D.A., Bates, M., Brown, M.K., Fisher, W.M., Hunicke-Smith, K., Pallett,
D.S., Pao, C., Rudnicky, A., Shriberg, E., 1994, "Expanding the Scope of the ATIS
Task: The ATIS-3 Corpus", Human Language Technology Workshop Proceedings,
Plainsboro, New Jersey, March 8–11, 1994

[4] Edward, B., Fourrier, C., Habib, N., Han, S., Lambert, N., Rajani, N.,
Sanseviero, O., Tunstall, L., Wolf, T., 2023, "Open LLM Leaderboard",
HuggingFace

[5] Geng, X., Gudibande, A., Liu, H., Wallace, E., Abbeel, P., Levine, S., Song,
D., 2023, "Koala: A Dialogue Model for Academic Research", Blog Post

[6] Hemphill, C.T., Godfrey, J.J., Doddington, G.R., 1990, "The ATIS Spoken
Language Systems Pilot Corpus", Speech and Natural Language Workshop
Proceedings, Hidden Valley, Pennsylvania, June 24–27, 1990[

[7] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T.,
Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., et al., 2022,
"Training Compute-Optimal Large Language Models", arXiv preprint,
arXiv:2203.15556

[8] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen,
W., 2021, "LoRA: Low-Rank Adaptation of Large Language Models", arXiv
preprint, arXiv:2106.09685

38
[9] Katsogiannis-Meimarakis, G., Koutrika, G., 2023, "A Survey on Deep
Learning Approaches for Text-to-SQL", The VLDB Journal, pp. 1–32

[10], Kocetkov, D., Li, R., Ben Allal, L., Li, J., Mou, C., Muñoz Ferrandis, C.,
Jernite, Y., Mitchell, M., Hughes, S., Wolf, T., Bahdanau, D., von Werra, L., de
Vries, H., 2022, "The Stack: 3 TB of Permissively Licensed Source Code", Preprint

[11] Köpf, A., Kilcher, Y., von Rütte, D., Anagnostidis, S., Tam, Z.R., Stevens, K.,
Barhoum, A., Duc, N.M., Stanley, O., Nagyfi, R., et al., 2023, "OpenAssistant
Conversations – Democratizing Large Language Model Alignment", arXiv preprint,
arXiv:2304.07327

[12] Koschke, R., Falke, R., Frenzel, P., 2006, "Clone Detection Using Abstract
Syntax Suffix Trees", 13th Working Conference on Reverse Engineering, IEEE,
pp. 253–262

[13] Numbers Station Labs, 2023, "NSText2SQL: An Open Source Text-to-SQL


Dataset for Foundation Model Training", Numbers Station Labs

[14] Li, H., Zhang, J., Li, C., Chen, H., 2023, "RESDSQL: Decoupling Schema
Linking and Skeleton Parsing for Text-to-SQL", AAAI Conference on Artificial
Intelligence, Vol. 37, pp. 13067–13075

[15] Li, J., Hui, B., Cheng, R., Qin, B., Ma, C., Huo, N., Huang, F., Du, W., Si, L.,
Li, Y., 2023, "Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware
Layers for Text-to-SQL Parsing", arXiv preprint, arXiv:2301.07507

39

You might also like