0% found this document useful (0 votes)

121 views4 pages

NL2SQL Schema Linked Guide

This guide outlines two methods for building a Natural Language to SQL (NL2SQL) system with schema linking: one using a BERT-based NER model for explicit schema linking and another using schema-aware prompting with T5. The first method involves training a custom NER model to identify schema elements in natural language queries, while the second method incorporates schema context directly into the prompt without a separate NER model. Both approaches aim to effectively translate natural language questions into SQL queries, with the first method offering modular architecture and the second relying on learned schema alignment.

Uploaded by

esrikaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

121 views4 pages

NL2SQL Schema Linked Guide

Uploaded by

esrikaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Building NL2SQL with Schema

Linking
This guide walks you through two complete approaches for building a
Natural Language to SQL system with schema linking.

Goal
Translate natural language questions to SQL by understanding schema
structure using: 1. NER-Based Schema Linking with a BERT classifier 2.
Schema-Aware Prompting using T5

Part 1: Custom NER Model for Schema

Linking + SQL Decoder
What:

Train a BERT-based NER model to label parts of the natural language query
as table names, column names, or values.

Why:

NL2SQL needs to know what parts of the question refer to schema

elements. NER helps identify these tokens explicitly before decoding SQL.

⚙ How:

Using token classification with HuggingFace Transformers.

Step 1: Prepare Labeled Data

Input: "List employees hired after 2020 in the HR department"

NER Output (labels): [("employees", "B-TABLE"), ("hired", "B-

COL"), ("2020", "B-VALUE"), ("HR", "B-VALUE"), ("department", "B-
COL")]

Step 2: Train BERT for Token Classification

```python from transformers import BertTokenizerFast,

BertForTokenClassification, Trainer, TrainingArguments from
torch.utils.data import Dataset
Tokenization
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

Dataset Class
class NL2SQLNERDataset(Dataset): def init(self, encodings, labels):
self.encodings = encodings self.labels = labels

def getitem(self, idx):

item = {key: torch.tensor(val[idx]) for key, val in self.encodings.ite
item["labels"] = torch.tensor(self.labels[idx])
return item

def __len__(self):
return len(self.labels)

Training
model = BertForTokenClassification.frompretrained("bert-base-uncased",
numlabels=4) trainingargs = TrainingArguments(outputdir='./nermodel',
perdevicetrainbatchsize=8, numtrain_epochs=5)

trainer = Trainer(model=model, args=trainingargs,

traindataset=train_dataset, tokenizer=tokenizer) trainer.train() ```

Step 3: Inference and Mapping

NER Output to Entity Mapping:

json { "table_candidates": ["employees"], "column_candidates":

["hire_date", "department"], "value_candidates": ["2020", "HR"] }

Step 4: SQL Prompt Construction + Decoding

python input_prompt = ''' translate to SQL: Question: List

employees hired after 2020 in the HR department Entities:
Table=employees; Columns=hire_date, department; Values=2020, HR
'''

Feed this to T5 for generation: ```python from transformers import

T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.frompretrained("t5-base") model =

T5ForConditionalGeneration.frompretrained("t5-base")
inputs = tokenizer(inputprompt, returntensors="pt") outputs =
model.generate(**inputs) print(tokenizer.decode(outputs[0],
skipspecialtokens=True)) ```

Part 2: Schema-Aware Prompting (T5 or

PICARD)
What:

In this approach, we don’t train a separate NER model. Instead, we provide

schema context inline as part of the prompt.

Why:

This lets the model learn to align question tokens to schema elements using
attention mechanisms.

Step 1: Flatten Schema

text employees(id, name, department_id, hire_date);

departments(id, name)

Step 2: Create Prompt

text translate to SQL: Question: List employees hired after 2020

in the HR department. Schema: employees(id, name, department_id,
hire_date); departments(id, name)

Step 3: Tokenize and Generate SQL

```python inputprompt = ''' translate to SQL: Question: List employees hired

after 2020 in the HR department. Schema: employees(id, name,
departmentid, hire_date); departments(id, name) '''

inputs = tokenizer(inputprompt, returntensors="pt") outputs =

model.generate(**inputs) print(tokenizer.decode(outputs[0],
skipspecialtokens=True)) ```

Output:

sql SELECT * FROM employees WHERE hire_date > '2020-01-01' AND

department = 'HR'
Summary
| Feature | NER + Decoder | Schema-Aware Prompt |
|-----------------------------|----------------------------|---------------------------| | Explicit Schema
Linking | Yes | No (learned) | | Modular Architecture | Yes | ⚠ Harder t
trace | | Performance on small data | High | Needs fine-tuning | |
Pretraining Required | | Yes (T5 pretrained) |

Next Steps
• Integrate grammar-constrained decoding using PICARD
• Evaluate using Spider dataset metrics: exact match, exec accuracy