Query GPT
Query GPT
https://www.uber.com/en-IN/blog/query-gpt/?uclick_id=6cfc9a34-
Link
aa3e-4140-9e8e-34e867b80b2b
End
@October 22, 2024
Date
Source Uber
Start
@October 22, 2024
Date
Time 15
authoring queries requires lot of time between searching for relevant datasets
in data dictionary and then authoring the query inside editor
Architecture
original architecture
relied on simple RAG to fetch (retrieving relevant data from a database) the
relevant samples needed to include in our query generation call to the LLM
(few-shot prompting) → take prompt, vectorize it and do similarity search
on SQL samples and schemas to fetch 3 relevant tables and 7 relevant
SQL samples
SQL sample queries → provide the LLM guidance on how to use the
table schemas provided
schema samples provided the LLM information about the columns that
existed on those tables
to help the LLM understand internal lingo and work with specific datasets,
some custom instructions were added in the LLM call
worked well for a small set of schemas and SQL samples, nut as more
tables and SQL samples were added, accuracy was declining
better RAG
simple similarity search for prompt on schema samples and SQL queries
doesn’t return relevant results
Current Design
workspaces
intent agent
incoming prompt first runs through an intent agent → map user question to
one or more business domains/workspaces (and by extension a set of SQL
samples and tables mapped to the domain)
table agent
intermittent token size issue → when some requests included one or more
tables that consumed a large amount of tokens
Evaluation
to track incremental improvements in performance → standardized evaluation
procedure is needed
set of real questions from logs, manually verified correct intent, schemas
required, and the golden SQL
evaluation procedure
table overlap → are tables identified via Seach + Table Agent correct?
run has output → does query execution return >0 records (to check for
hallucinations such as “Finished” instead of “Completed”)
also aggregate accuracy and latency metrics for each evaluation run to
track performance over time
limitations
identify error patterns over longer time periods that can be addressed
by specific feature improvements
Learnings
LLMs are excellent classifiers (intermediate agents)
hallucinations (LLMs might generate query with tables or columns that don’t
exist)