0% found this document useful (0 votes)
3 views

NL2SQL_Schema_Linked_Guide

This guide outlines two methods for building a Natural Language to SQL (NL2SQL) system with schema linking: one using a BERT-based NER model for explicit schema linking and another using schema-aware prompting with T5. The first method involves training a custom NER model to identify schema elements in natural language queries, while the second method incorporates schema context directly into the prompt without a separate NER model. Both approaches aim to effectively translate natural language questions into SQL queries, with the first method offering modular architecture and the second relying on learned schema alignment.

Uploaded by

esrikaran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

NL2SQL_Schema_Linked_Guide

This guide outlines two methods for building a Natural Language to SQL (NL2SQL) system with schema linking: one using a BERT-based NER model for explicit schema linking and another using schema-aware prompting with T5. The first method involves training a custom NER model to identify schema elements in natural language queries, while the second method incorporates schema context directly into the prompt without a separate NER model. Both approaches aim to effectively translate natural language questions into SQL queries, with the first method offering modular architecture and the second relying on learned schema alignment.

Uploaded by

esrikaran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Building NL2SQL with Schema

Linking
This guide walks you through two complete approaches for building a
Natural Language to SQL system with schema linking.

Goal
Translate natural language questions to SQL by understanding schema
structure using: 1. NER-Based Schema Linking with a BERT classifier 2.
Schema-Aware Prompting using T5

Part 1: Custom NER Model for Schema


Linking + SQL Decoder
What:

Train a BERT-based NER model to label parts of the natural language query
as table names, column names, or values.

Why:

NL2SQL needs to know what parts of the question refer to schema


elements. NER helps identify these tokens explicitly before decoding SQL.

⚙ How:

Using token classification with HuggingFace Transformers.

Step 1: Prepare Labeled Data

Input: "List employees hired after 2020 in the HR department"

NER Output (labels): [("employees", "B-TABLE"), ("hired", "B-


COL"), ("2020", "B-VALUE"), ("HR", "B-VALUE"), ("department", "B-
COL")]

Step 2: Train BERT for Token Classification

```python from transformers import BertTokenizerFast,


BertForTokenClassification, Trainer, TrainingArguments from
torch.utils.data import Dataset
Tokenization
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

Dataset Class
class NL2SQLNERDataset(Dataset): def init(self, encodings, labels):
self.encodings = encodings self.labels = labels

def __getitem__(self, idx):


item = {key: torch.tensor(val[idx]) for key, val in self.encodings.ite
item["labels"] = torch.tensor(self.labels[idx])
return item

def __len__(self):
return len(self.labels)

Training
model = BertForTokenClassification.frompretrained("bert-base-uncased",
numlabels=4) trainingargs = TrainingArguments(outputdir='./nermodel',
perdevicetrainbatchsize=8, numtrain_epochs=5)

trainer = Trainer(model=model, args=trainingargs,


traindataset=train_dataset, tokenizer=tokenizer) trainer.train() ```

Step 3: Inference and Mapping

NER Output to Entity Mapping:

json { "table_candidates": ["employees"], "column_candidates":


["hire_date", "department"], "value_candidates": ["2020", "HR"] }

Step 4: SQL Prompt Construction + Decoding

python input_prompt = ''' translate to SQL: Question: List


employees hired after 2020 in the HR department Entities:
Table=employees; Columns=hire_date, department; Values=2020, HR
'''

Feed this to T5 for generation: ```python from transformers import


T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.frompretrained("t5-base") model =


T5ForConditionalGeneration.frompretrained("t5-base")
inputs = tokenizer(inputprompt, returntensors="pt") outputs =
model.generate(**inputs) print(tokenizer.decode(outputs[0],
skipspecialtokens=True)) ```

Part 2: Schema-Aware Prompting (T5 or


PICARD)
What:

In this approach, we don’t train a separate NER model. Instead, we provide


schema context inline as part of the prompt.

Why:

This lets the model learn to align question tokens to schema elements using
attention mechanisms.

Step 1: Flatten Schema

text employees(id, name, department_id, hire_date);


departments(id, name)

Step 2: Create Prompt

text translate to SQL: Question: List employees hired after 2020


in the HR department. Schema: employees(id, name, department_id,
hire_date); departments(id, name)

Step 3: Tokenize and Generate SQL

```python inputprompt = ''' translate to SQL: Question: List employees hired


after 2020 in the HR department. Schema: employees(id, name,
departmentid, hire_date); departments(id, name) '''

inputs = tokenizer(inputprompt, returntensors="pt") outputs =


model.generate(**inputs) print(tokenizer.decode(outputs[0],
skipspecialtokens=True)) ```

Output:

sql SELECT * FROM employees WHERE hire_date > '2020-01-01' AND


department = 'HR'
Summary
| Feature | NER + Decoder | Schema-Aware Prompt |
|-----------------------------|----------------------------|---------------------------| | Explicit Schema
Linking | Yes | No (learned) | | Modular Architecture | Yes | ⚠ Harder t
trace | | Performance on small data | High | Needs fine-tuning | |
Pretraining Required | | Yes (T5 pretrained) |

Next Steps
• Integrate grammar-constrained decoding using PICARD
• Evaluate using Spider dataset metrics: exact match, exec accuracy

You might also like