Benchmarking PromptQL - Hasura PromptQL
Benchmarking PromptQL - Hasura PromptQL
In this post, we compare PromptQL with Claude.ai + MCP, connected to the same
set of tools.
3. Observations
We run each prompt 5 times, and record the accuracy of the result. Each result is a list of
data points. In the heat maps that you see below, columns represent a particular "run" and
the rows represent the data points in the ouptut.
Sample heatmap:
Data #1
Data #2
Data #3
Data #4
Data #5
Give me a sorted list of top 5 support tickets which I should prioritize amongst last 30
most recent open tickets.
1. the project_id
2. the plan of the project
3. the criticality of the issue
4. the monthly average revenue for the project
5. list of recent ticket_ids for that project from the last 6 months
Determine the criticality of the issue of a ticket, by looking at the ticket and its
comments. These are categories of issue criticality in descending order of
importance:
1. Production downtime
2. Instability in production
3. Performance degradation in production
4. Bug
5. Feature request
6. How-to
Now, prioritize the tickets according to the following rules. Assign the highest priority
to production issues. Within this group, prioritize advanced plan > base plan > free
plan. Next, take non-production issues, and within this group, order by monthly
average revenue.
In case there are ties within the 2 groups above, break the tie using:
Classify these as whether the reason for the delay was because of a
delay by the customer or by the agent.
Return the list of tickets sorted by ticket id, along with the comment id of
the slowest comment and the classification result.
In-context approaches cannot separate the creation of a query plan from the execution of a
query plan.
This results in a lack of repeatability. Increasing amount of data and information in the
context window deteriorates accuracy at each step of the query plan and hence further
reduces repeatability.
Ignores rule that advanced plan > free plan in Ignores rule that advanced plan > base plan
priority. in priority.
PromptQL's query plans are composed of steps that are computational and cognitive. These
ensures intelligence can be precisely used only in the isolated context of the step (eg:
"extract project_id from the ticket description").
5. Evaluation Methodology
To measure usefulness of introducing an AI sytem into business processes, we
calculate a usefulness score that is a composite of 2 key measures:
Computational accuracy: The correctness of the result on a task that was possible to
solve programmatically
Cognitive accuracy: The correctness of the result on a task needing AI
2. Repeatability: How often is the result the same, across multiple runs of the same prompt.
Computational repeatability: The repeatability of the computational part of the result.
Docs
Cognitive repeatability: The repeatability of the AI part of the result.
Sign in
5.1 Computational vs Cognitive
When working with business data and systems, computational accuracy and
repetability are usually more important than "cognitive" accuracy and repeatability.
That is to say, there is higher sensitivity and mistrust when an AI system fails on
an objective task, vs a subjective task.
For example: Imagine working with a colleague or an intern who is a "data access
agent". This person can interact with data for you and take actions based on your
instructions. You would provide a lower performance rating to this agent if they
missed data or calculated something incorrectly. However, you would be more
forgiving if they made a judgement call on a subjective task.
w (1−w) w (1−w)
Usef ulness = (CompAcc × CogAcc ) × (CompRep × CogRep
Where:
CompAcc (Computational Accuracy)
A user-assigned score between 0 and 1 indicating how accurately the system handles
computational (algorithmic) tasks.
This transforms the raw computational repeatability measure (C, ranging from 0 to 1) into a
value between 0 and 1. A higher αC means that any deviation from perfect computational
repeatability (C=1) is penalized more severely. In plain English: This score heavily rewards
systems that deliver the same computational results every time.
CogRep (Cognitive Repeatability)
Similar to CompRep, this takes a raw cognitive repeatability measure (K, 0 to 1) and maps it
to a 0-to-1 scale. Here, αK controls how harshly deviations from perfect repeatability are
penalized for cognitive tasks. In other words, this score measures how consistently the
system provides similar cognitive outputs across multiple runs, typically allowing for a bit
more variation than computational tasks if αK is smaller than αC.
A weighting factor between 0 and 1 that determines the relative importance of computational
versus cognitive factors in both accuracy and repeatability.
We evaluate the usefulness of PromptQL and Claude.ai + MCP on the same user
goal. This allows flexibility in optimizing the prompt for each system in order to
achieve the best possible score. In practice, this often meant that for Claude.ai
the prompt had to stress on ensuring that the input guidelines are followed strictly.
For PromptQL, the prompt needed basic awareness of the available AI primitives.
100
75
50
25
5. Coming Up
In the near future, we plan to expand our benchmark to include:
We welcome contributions to our benchmark efforts. You can access our open-
source benchmark toolkit, contribute test cases, or run the benchmark on your
own systems by visiting our Hasura Agentic Data Access Benchmark GitHub
repository. Your participation helps us continually improve and validate the
effectiveness of PromptQL and other agentic data access methods.