Evaluating AI Agents
Evaluating AI Agents
Upco
Imagine you're working with an AI assistant that claims it can help you complete
your tasks. Can you trust it to analyze data effectively? To write important press
releases? To make complex product decisions?
Hi there! What can I help you with?
Evaluating AI agents isn't like testing traditional software where you can check if
the output matches expected results. These agents perform complex tasks that
often have multiple valid approaches. They need to understand context and
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 1/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
In this post we'll explore how researchers are tackling these challenges by
examining fundamental capabilities that define an effective AI agent. Each
capability requires its own specialized evaluation frameworks. Understanding
them helps us grasp both the current state of AI technology and where
improvements are needed.
This blog will provide solid fundamentals of evaluating agents that are becoming
part of our world.
Tool Calling
Let's start with the most important skillet. The ability to select and use
appropriate tools has become a cornerstone of AI agent functionality. Ask anyone
what is agent and they will immediately mention tool calling.
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 2/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
Upco
The journey began with BFCL v1, which established the foundation for evaluating
function-calling capabilities. This initial version introduced a diverse evaluation
dataset question-function-answer pairs covering multiple programming
languages including Python, Java, JavaScript, and REST APIs.
The framework also evaluated complex scenarios where agents needed to select
one or more functions from multiple options, or make parallel function calls
together. This work revealed significant insights into how different models
handled tool selection, with proprietary models like GPT-4 leading in
performance, followed closely by open-source alternatives.
Hi there! What can I help you with?
BFCL v2 introduced real-world complexity through user-contributed data. This
version addressed crucial issues like bias and data contamination while focusing
on dynamic, real-world scenarios. The evaluation expanded to include more
sophisticated test cases. It revealed that in real-world usage, there's a higher
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 3/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
The latest iteration, BFCL v3 pushed the boundaries further by introducing multi-
turn and multi-step evaluation scenarios. This version recognized that real-world
applications often require complex sequences of interactions, where agents must
maintain context, handle state changes, and adapt their strategies based on
previous outcomes. It introduced sophisticated evaluation metrics including
state-based evaluation, which examines how well agents maintain and modify
system state, and response-based assessment, which analyzes the
appropriateness and efficiency of function selection.
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 4/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
Upco
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 5/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
sometimes failing to check current system states before taking action or making
Upco
incorrect assumptions about system conditions.
Τ-bench
The results from τ-bench have been both enlightening and sobering. Even state-
of-the-art models like GPT-4 succeed on less than 50% of tasks, with
performance particularly poor on the more complex airline domain (around 35%
success rate). More concerning is the dramatic drop in performance when
consistency is required across multiple attempts. In retail scenarios, while GPT-4
achieves a 61% success rate on single attempts, this drops below 25% when
requiring consistent performance across eight attempts at the same task. This
reveals a critical gap between current capabilities and the reliability needed for
real-world deployment.
Through detailed analysis τ-bench has identified several critical failure patterns in
Hi there! What can I help you with?
current tool selection capabilities. A significant portion of failures (approximately
55%) stem from agents either providing incorrect information or making wrong
arguments in function calls. These errors often occur when agents need to
reason over complex databases or handle numerical calculations. For
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 7/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
instance, in the retail domain, agents frequently struggle to find products that
Upco
match specific criteria or calculate correct total prices for complex orders.
Looking ahead, τ-bench has highlighted several critical areas for improvement in
tool selection capabilities. Future research needs to focus on:
suggests that significant advances are still needed before complex agentic
systems can be reliably deployed in critical customer-facing roles.
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 8/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
Planning Upco
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 9/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
world into a desired future state. Planning underlies virtually every complex task
Upco
an agent might need to perform.
Think of an AI agent tasked with helping someone move house. Much like a
human would approach this task, the agent needs to break down this complex
goal into manageable steps while considering various constraints and
dependencies. Let's see how this planning process works:
Initial Assessment: "I need to identify the items to be moved and available
resources first."
If any step fails—say, the truck isn't large enough—the agent must replan,
perhaps splitting the move into multiple trips or suggesting a larger vehicle. This
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 10/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
ability to adjust plans when facing unexpected situations is crucial for real-world
Upco
effectiveness.
What makes this challenging is that the agent must follow specific rules and
constraints throughout. Just like in PlanBench's evaluation scenarios, they need
to maintain logical consistency (you can't move a bookshelf before emptying it)
while being flexible enough to handle changes (what if it rains on moving day?).
PlanBench
At the most basic level, an agent should be able to generate valid plans to
achieve specific goals. However, true planning capability goes beyond just finding
any solution—it requires finding optimal or near-optimal solutions. PlanBench
evaluates both aspects, testing whether agents can not only reach goals but do
so efficiently.
A planning system must also verify them and reason about their execution. This
involves understanding whether a proposed plan will work, identifying potential
Hi there! What can I help you with?
failure points, and comprehending the consequences of each action in the plan
sequence.
Perhaps most importantly PlanBench tests an agent's ability to adapt and Upco
generalize its planning capabilities. This includes:
Extracting general patterns from specific plans and applying them to new
scenarios
The environment consists of a flat table and several blocks of different colors.
The robotic hand (or arm) can perform two main actions: pick up a block that's
clear (nothing on top of it) and put it down either on the table or on top of
another clear block. Blocks can be stacked but you can't move a block if there's
another block on top of it.
A typical Blocksworld problem might look like this: Initial State: Red block on
table, Blue block on Red block, Green block on table Goal State: Green block on
Blue block, Blue block on Red block, Red block on table
Recognize that Blue block can't move directly (Red block is under it)
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 12/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
Upco
The Logistics domain is more complex. It simulates moving packages between
different cities using trucks and airplanes. Here's how it works:
You have cities, and within each cities there are locations (like airports, post
offices). Trucks can move packages between locations within the same city,
while planes can fly packages between cities (but only between airports).
A sample logistics problem: Initial State: Package in San Francisco post office
Goal State: Package needs to reach New York apartment
Unload package
What makes these domains valuable for testing is that they require agents to
understand constraints (like "can't move blocked blocks" or "trucks can't fly"),
plan multiple steps ahead, and sometimes even undo progress temporarily to
reach the final goal.
What makes these test domains particularly effective is their clarity and
controllability. Unlike real-world scenarios where success criteria might be
ambiguous, these domains have well-defined rules and goals. This allows for
precise evaluation of an agent's planning capabilities while eliminating
confounding factors. But real-world situations are messy. Unlike controlled test
environments, they're full of unexpected challenges and moving parts. Let's break
down what this means:
Hi there! What can I help you with?
Think about a customer service agent. In a test environment, you might have a
simple scenario: "Customer wants a refund for a damaged product." But in reality,
you're dealing with:
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 13/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
This complexity means we need to think differently about how we evaluate and
build these systems. Instead of just testing if an agent can handle a clean,
straightforward task, we should ask:
The goal isn't to solve every possible problem, but to build systems that can
gracefully handle the unexpected - because in the real world, the unexpected is
normal.
Current evaluations reveal significant gaps in the planning capabilities of even the
most advanced AI systems. Many struggle with:
Self-Evaluation
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 14/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
Upco
Recent research has demonstrated that large language models can significantly
enhance their problem-solving capabilities through self-reflection and evaluation,
mimicking human metacognitive processes. This capability becomes particularly
crucial as agents encounter increasingly complex tasks that require
understanding and learning from their own mistakes.
At its core, self-evaluation in AI agents involves analyzing and improving their own
chain of thought (CoT). The process begins with the agent examining its
reasoning process that led to a particular solution or decision. Through this
examination, agents can identify various types of errors, including logical
mistakes, mathematical errors, and instances of hallucination.
Upco
The evaluation process follows a structured sequence:
Initial Response: The agent first attempts to answer questions using standard
prompt-engineering techniques, including domain expertise, chain-of-thought
reasoning, and few-shot prompting.
Re-attempt: The agent uses its self-reflection to attempt the question again.
These improvements were statistically significant (p < 0.001) across all types of
self-reflection and all tested language models, including GPT-4, Claude 3 Opus,
Gemini 1.5 Pro, and others. This shows that all models benefit from self reflection
and must be added to high stake agent applications.
Domain-Specific Performance
API Response Errors: Content-safety filters and API issues can introduce small
errors in accuracy measurements (typically less than 1%, but up to 2.8% in
some models).
Hi there! What can I help you with?
Ceiling Effects: High-performing models scoring above 90% make it difficult to
accurately measure improvements due to score compression near 100%.
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 17/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
More Challenging Test Cases: Problems at or above the difficulty level of LSAT
Analytical Reasoning would better demonstrate the impact of self-reflection.
The ability to effectively evaluate an agent's self-reflection will play a key role in
understanding and design systems which can accomplish next generation tasks
while make and correcting mistakes.
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 18/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
Upco
The evaluation process mirrors the authentic dynamics of the community. Human
evaluators rate both AI-generated and human responses on a comprehensive
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 19/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
Example:
The results reveal fascinating insights about both AI capabilities and the
effectiveness of safety measures. Pre-mitigation versions of advanced models
showed concerning success rates with the O1-preview model successfully
extracting donations in 25.8% of conversations and securing 4.6% of available
funds. However, implementation of safety measures dramatically reduced these
capabilities, with success rates dropping to 11.6% and extraction rates falling to
3.2%.
A typical successful manipulation might unfold with the con-artist model crafting
a compelling narrative about emergency medical supplies, gradually building trust
before securing a donation. These conversations demonstrate
Hi there! Whatsophisticated
can I help you with?
social engineering techniques, highlighting the importance of robust safety
measures in AI development.
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 20/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI
The insights gained from these evaluation frameworks prove invaluable for the
responsible development of AI systems. These efforts aim to support the
development of AI systems that can engage effectively in legitimate persuasion
while maintaining strong safeguards against harmful Hi
manipulation.
there! What can I help you with?
Remember that AI assistant we talked about at the beginning? The one promisingUpco
to help with your code data and decisions? These evaluation frameworks have
shown us that current AI agents can plan like strategic thinkers and use tools like
skilled craftspeople. They can even persuade like experienced negotiators. Yet
they've also revealed that these same agents can miss obvious solutions or get
stuck in simple tasks just like humans do.
But here's what makes this exciting. By understanding these strengths and quirks
we're getting better at building AI systems that can truly complement human
capabilities rather than just imitate them. The benchmarks and tests we have
explored are not just measuring AI performance. They're helping shape the future
of human-AI collaboration.
So next time you're working with an AI agent remember it's not about finding the
perfect digital assistant. It's about understanding your digital partner well enough
to bring out the best in both of you. That's what makes the future of AI so
promising!
So what?
Remember that AI assistant we talked about at the beginning? Through our
exploration of evaluation frameworks, we've uncovered a nuanced picture of what
today's AI agents can and cannot do.
Our evaluation frameworks has revealed agents struggle with complex multi-step
interactions, often miss implicit steps in planning and need better consistency in
their performance. Yet they've also shown promising abilities like learning from
mistakes through self-reflection and achieving near-human levels of
persuasiveness when properly constrained by safety measures.
Chat with our team to learn more about our state-of-the-art agent
Hi there! What can Ievaluation
help you with?
capabilities.
Appendix
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 22/24