0% found this document useful (0 votes)

96 views22 pages

Evaluating AI Agents

The document discusses the evaluation of AI agents, emphasizing the complexities involved in assessing their capabilities compared to traditional software. It highlights the Berkeley Function Calling Leaderboard (BFCL) and τ-bench as frameworks for evaluating tool selection and planning abilities in AI agents, revealing significant gaps in performance and areas for improvement. The analysis underscores the importance of developing robust evaluation metrics to enhance AI agents' reliability in real-world applications.

Uploaded by

naren.gmmn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views22 pages

Evaluating AI Agents

Uploaded by

naren.gmmn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Upco

Mastering Agents: Evaluating AI Agents

Pratik Bhavsar
Galileo Labs

14 min read December 18 2024

Table of contents Show

Imagine you're working with an AI assistant that claims it can help you complete
your tasks. Can you trust it to analyze data effectively? To write important press
releases? To make complex product decisions?
Hi there! What can I help you with?

Evaluating AI agents isn't like testing traditional software where you can check if
the output matches expected results. These agents perform complex tasks that
often have multiple valid approaches. They need to understand context and

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 1/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

follow specific rules while sometimes persuading or negotiating with humans.

Upco
This creates unique challenges for researchers and developers trying to ensure
these systems are both capable and reliable.

In this post we'll explore how researchers are tackling these challenges by
examining fundamental capabilities that define an effective AI agent. Each
capability requires its own specialized evaluation frameworks. Understanding
them helps us grasp both the current state of AI technology and where
improvements are needed.

This blog will provide solid fundamentals of evaluating agents that are becoming
part of our world.

Tool Calling
Let's start with the most important skillet. The ability to select and use
appropriate tools has become a cornerstone of AI agent functionality. Ask anyone
what is agent and they will immediately mention tool calling.

Berkeley Function-Calling Leaderboard (BFCL)

Hi there! What can I help you with?

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 2/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Upco

University of California, Berkeley has pioneered a comprehensive framework

Berkeley Function Calling Leaderboard (BFCL) for evaluating these capabilities. It
has evolved through multiple versions to address increasingly sophisticated
aspects of function calling.

The journey began with BFCL v1, which established the foundation for evaluating
function-calling capabilities. This initial version introduced a diverse evaluation
dataset question-function-answer pairs covering multiple programming
languages including Python, Java, JavaScript, and REST APIs.

The framework also evaluated complex scenarios where agents needed to select
one or more functions from multiple options, or make parallel function calls
together. This work revealed significant insights into how different models
handled tool selection, with proprietary models like GPT-4 leading in
performance, followed closely by open-source alternatives.
Hi there! What can I help you with?
BFCL v2 introduced real-world complexity through user-contributed data. This
version addressed crucial issues like bias and data contamination while focusing
on dynamic, real-world scenarios. The evaluation expanded to include more
sophisticated test cases. It revealed that in real-world usage, there's a higher
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 3/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

demand for intelligent function selection compared to parallel function execution,

Upco
reflecting how users actually interact with these systems.

The latest iteration, BFCL v3 pushed the boundaries further by introducing multi-
turn and multi-step evaluation scenarios. This version recognized that real-world
applications often require complex sequences of interactions, where agents must
maintain context, handle state changes, and adapt their strategies based on
previous outcomes. It introduced sophisticated evaluation metrics including
state-based evaluation, which examines how well agents maintain and modify
system state, and response-based assessment, which analyzes the
appropriateness and efficiency of function selection.

Hi there! What can I help you with?

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 4/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Upco

Different versions of Berkeley Function Calling Leaderboard

Insights from Failures

Through these iterations, several critical challenges in tool selection have

emerged. One persistent issue is implicit action recognition. Agents often
struggle to identify necessary preliminary steps that aren't explicitly stated in
Hi there! What can I help you with?
user requests. For instance, an agent might attempt to modify a file without first
checking if it exists, or try to post content without verifying authentication status.
State management presents another significant challenge, with agents

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 5/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

sometimes failing to check current system states before taking action or making
Upco
incorrect assumptions about system conditions.

The evaluation framework revealed interesting patterns in how different models

handle tool selection. While both proprietary and open-source models perform
similarly in simple function calling scenarios, more complex situations involving
multiple or parallel function calls tend to showcase larger performance gaps. This
insight has proven valuable for understanding where current models excel and
where they need improvement.

Τ-bench

Taking this evaluation framework further, the recently introduced τ-bench

represents a significant advancement in tool selection assessment by focusing on
real-world interactions between agents and users. Unlike previous benchmarks
that typically evaluated agents on single-step interactions with pre-defined
inputs, τ-bench creates a more realistic environment where agents must engage
Hi there! What can I help you with?
in dynamic conversations while following specific domain policies.

The benchmark's innovation lies in its three-layered approach to evaluation. First,

it provides agents with access to realistic databases and APIs that mirror real-
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 6/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

world systems. Second, it includes detailed domain-specific policy documents

Upco
that agents must understand and follow. Third, it employs language models to
simulate human users, creating natural, varied interactions that test an agent's
ability to gather information and respond appropriately over multiple turns.

τ-bench specifically focuses on two domains that represent common real-world

applications: retail customer service (τ-retail) and airline reservations (τ-
airline). In the retail domain, agents must handle tasks like order modifications,
returns, and exchanges while adhering to specific store policies. The airline
domain presents even more complex challenges, requiring agents to navigate
flight bookings, changes, and cancellations while considering various rules about
fares, baggage allowances, and membership tiers.

Instead of simply checking if an agent generates the correct function calls, it

evaluates the final state of the database after the interaction. This approach
acknowledges that there might be multiple valid paths to achieve the same goal,
better reflecting real-world scenarios. Additionally, the benchmark introduces a
new metric called pass^k which measures an agent's consistency across multiple
attempts at the same task.

Insights from Failures

The results from τ-bench have been both enlightening and sobering. Even state-
of-the-art models like GPT-4 succeed on less than 50% of tasks, with
performance particularly poor on the more complex airline domain (around 35%
success rate). More concerning is the dramatic drop in performance when
consistency is required across multiple attempts. In retail scenarios, while GPT-4
achieves a 61% success rate on single attempts, this drops below 25% when
requiring consistent performance across eight attempts at the same task. This
reveals a critical gap between current capabilities and the reliability needed for
real-world deployment.

Through detailed analysis τ-bench has identified several critical failure patterns in
Hi there! What can I help you with?
current tool selection capabilities. A significant portion of failures (approximately
55%) stem from agents either providing incorrect information or making wrong
arguments in function calls. These errors often occur when agents need to
reason over complex databases or handle numerical calculations. For
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 7/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

instance, in the retail domain, agents frequently struggle to find products that
Upco
match specific criteria or calculate correct total prices for complex orders.

Another quarter of failures result from incorrect decision-making, particularly in

understanding and following domain-specific rules. This becomes especially
apparent in the airline domain, where removing the policy document from GPT-4's
context leads to a dramatic 22.4% drop in performance. This suggests that even
advanced models struggle to consistently apply complex domain rules, such as
those governing baggage allowances for different membership tiers and cabin
classes.

The benchmark also revealed significant challenges in handling compound

requests. When tasks require multiple database writes or involve several user
requests, performance degrades substantially. This points to limitations in agents'
ability to maintain context and systematically address all aspects of a complex
request. For example, agents might successfully modify one order but fail to
apply the same changes to other relevant orders in the same conversation.

Looking ahead, τ-bench has highlighted several critical areas for improvement in
tool selection capabilities. Future research needs to focus on:

Enhancing agents' ability to reason over complex databases and maintain

numerical accuracy

Developing better methods for understanding and consistently applying

domain-specific rules

Improving long-term context maintenance in multi-step interactions

Creating more robust approaches to handling compound requests

Building systems that can maintain consistent performance across multiple

interactions with different users

The gap between current tool calling capabilities andHireal-world requirements

there! What can I help you with?

suggests that significant advances are still needed before complex agentic
systems can be reliably deployed in critical customer-facing roles.

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 8/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Planning Upco

Hi there! What can I help you with?

Tool calling is often preceded by a planning step. The ability to plan stands as
one of the most fundamental aspects of intelligence. In its essence involves
developing a sequence of actions that can transform the current state of the

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 9/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

world into a desired future state. Planning underlies virtually every complex task
Upco
an agent might need to perform.

How Planning Works in Agents

Think of an AI agent tasked with helping someone move house. Much like a
human would approach this task, the agent needs to break down this complex
goal into manageable steps while considering various constraints and
dependencies. Let's see how this planning process works:

When given the instruction "Help me move my furniture to my new apartment,"

the agent first needs to understand the initial state (current location of furniture,
available resources) and the goal state (all furniture safely moved to the new
location). The agent then develops a sequence of actions to bridge these states.

Here's how an agent might plan this:

Initial Assessment: "I need to identify the items to be moved and available
resources first."

Check inventory of furniture

Verify vehicle availability

Assess packing materials needed

Action Planning: "Now I can determine the optimal sequence of actions."

Pack small items first

Disassemble large furniture

Load items in order of size and fragility

Transport to new location

Unpack and reassemble

Hi there! What can I help you with?

If any step fails—say, the truck isn't large enough—the agent must replan,
perhaps splitting the move into multiple trips or suggesting a larger vehicle. This

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 10/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

ability to adjust plans when facing unexpected situations is crucial for real-world
Upco
effectiveness.

What makes this challenging is that the agent must follow specific rules and
constraints throughout. Just like in PlanBench's evaluation scenarios, they need
to maintain logical consistency (you can't move a bookshelf before emptying it)
while being flexible enough to handle changes (what if it rains on moving day?).

This example demonstrates why measuring planning capabilities is so crucial—an

agent needs to not just list steps but understand dependencies, handle
constraints, and adapt to changing circumstances.

PlanBench

PlanBench offers a systematic approach to testing various aspects of planning

ability. Unlike previous evaluation methods that relied heavily on common-sense
tasks where it becomes challenging to distinguish between genuine planning and
mere retrieval from training data, PlanBench provides a more rigorous and
controlled testing environment. PlanBench tests planning capabilities through
various dimensions:

Plan Generation and Optimization

At the most basic level, an agent should be able to generate valid plans to
achieve specific goals. However, true planning capability goes beyond just finding
any solution—it requires finding optimal or near-optimal solutions. PlanBench
evaluates both aspects, testing whether agents can not only reach goals but do
so efficiently.

Plan Verification and Execution Reasoning

A planning system must also verify them and reason about their execution. This
involves understanding whether a proposed plan will work, identifying potential
Hi there! What can I help you with?
failure points, and comprehending the consequences of each action in the plan
sequence.

Adaptability and Generalization

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 11/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Perhaps most importantly PlanBench tests an agent's ability to adapt and Upco
generalize its planning capabilities. This includes:

Recognizing when parts of previous plans can be reused in new situations

Adapting to unexpected changes through replanning

Extracting general patterns from specific plans and applying them to new
scenarios

Understanding equivalent goals presented in different forms

The Framework in Action

To illustrate these concepts PlanBench employs classic planning domains like

Blocksworld and Logistics.

Blocksworld is a fundamental planning domain that simulates stacking and

arranging blocks on a table. Picture a robotic arm that can move colored blocks,
one at a time. The basic rules are:

The environment consists of a flat table and several blocks of different colors.
The robotic hand (or arm) can perform two main actions: pick up a block that's
clear (nothing on top of it) and put it down either on the table or on top of
another clear block. Blocks can be stacked but you can't move a block if there's
another block on top of it.

A typical Blocksworld problem might look like this: Initial State: Red block on
table, Blue block on Red block, Green block on table Goal State: Green block on
Blue block, Blue block on Red block, Red block on table

To solve this, the agent needs to:

Recognize that Blue block can't move directly (Red block is under it)

Move Blue block to table first (it's on top so it can move)

Hi there! What can I help you with?

Then stack Green on Blue

Finally stack this whole structure on Red

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 12/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Upco
The Logistics domain is more complex. It simulates moving packages between
different cities using trucks and airplanes. Here's how it works:

You have cities, and within each cities there are locations (like airports, post
offices). Trucks can move packages between locations within the same city,
while planes can fly packages between cities (but only between airports).

A sample logistics problem: Initial State: Package in San Francisco post office
Goal State: Package needs to reach New York apartment

The agent must plan:

Use truck to move package from SF post office to SF airport

Load package onto plane

Fly plane to NY airport

Unload package

Use NY truck to deliver package to apartment

What makes these domains valuable for testing is that they require agents to
understand constraints (like "can't move blocked blocks" or "trucks can't fly"),
plan multiple steps ahead, and sometimes even undo progress temporarily to
reach the final goal.

What makes these test domains particularly effective is their clarity and
controllability. Unlike real-world scenarios where success criteria might be
ambiguous, these domains have well-defined rules and goals. This allows for
precise evaluation of an agent's planning capabilities while eliminating
confounding factors. But real-world situations are messy. Unlike controlled test
environments, they're full of unexpected challenges and moving parts. Let's break
down what this means:
Hi there! What can I help you with?
Think about a customer service agent. In a test environment, you might have a
simple scenario: "Customer wants a refund for a damaged product." But in reality,
you're dealing with:

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 13/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

A customer who's already contacted support three times

Upco

A product that was damaged during a natural disaster

Shipping delays due to a holiday season

Company policies that recently changed

Multiple departments that need to coordinate

Systems that might be experiencing downtime

This complexity means we need to think differently about how we evaluate and
build these systems. Instead of just testing if an agent can handle a clean,
straightforward task, we should ask:

Can it juggle multiple priorities?

Does it know when to escalate?

How does it handle partial or conflicting information?

Can it work around system limitations?

The goal isn't to solve every possible problem, but to build systems that can
gracefully handle the unexpected - because in the real world, the unexpected is
normal.

Learnings from PlanBench

Current evaluations reveal significant gaps in the planning capabilities of even the
most advanced AI systems. Many struggle with:

Maintaining consistency across long action sequences

Adapting plans when faced with unexpected changes

Generalizing planning patterns to new situations

Hi there! What can I help you with?
Finding truly optimal solutions rather than just workable ones

Self-Evaluation
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 14/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Upco

Source: Self-Reflection in LLM Agents

Recent research has demonstrated that large language models can significantly
enhance their problem-solving capabilities through self-reflection and evaluation,
mimicking human metacognitive processes. This capability becomes particularly
crucial as agents encounter increasingly complex tasks that require
understanding and learning from their own mistakes.

The Mechanics of Self-Evaluation

At its core, self-evaluation in AI agents involves analyzing and improving their own
chain of thought (CoT). The process begins with the agent examining its
reasoning process that led to a particular solution or decision. Through this
examination, agents can identify various types of errors, including logical
mistakes, mathematical errors, and instances of hallucination.

Experimental Framework for Evaluation

A recent paper Self-Reflection in LLM Agents has developed a systematic

approach to evaluating self-evaluation capabilities using multiple-choice
Hi there! What can I help you with?
question-and-answer (MCQA) problems. This framework tests agents across
diverse domains including science, mathematics, medicine, and law, using
questions from established benchmarks like ARC, AGIEval, HellaSwag, and
MedMCQA.
https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 15/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Upco
The evaluation process follows a structured sequence:

Initial Response: The agent first attempts to answer questions using standard
prompt-engineering techniques, including domain expertise, chain-of-thought
reasoning, and few-shot prompting.

Error Recognition: When an incorrect answer is identified, the agent engages in

self-reflection about its mistake.

Self-Reflection Generation: The agent produces various types of self-reflection

content:

Keywords identifying error types

General advice for improvement

Detailed explanations of mistakes

Step-by-step instructions for correct problem-solving

Comprehensive solution analysis

Composite reflection combining multiple approaches

Answer Redaction: To prevent direct answer leakage, all specific answer

information is carefully redacted from the self-reflections.

Re-attempt: The agent uses its self-reflection to attempt the question again.

Performance Metrics and Results

The effectiveness of self-evaluation is measured primarily through correct-

answer accuracy, comparing performance before and after self-reflection.
Research findings have shown remarkable improvements:

Top-performing models like GPT-4 showed significant accuracy improvements

from 79% baseline to:
Hi there! What can I help you with?
83% with basic retry attempts

84% with keyword-based reflection

85% with instructional guidance

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 16/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

88% with detailed explanations

Upco

93% with comprehensive solution analysis

97% with unredacted information (upper bound reference)

These improvements were statistically significant (p < 0.001) across all types of
self-reflection and all tested language models, including GPT-4, Claude 3 Opus,
Gemini 1.5 Pro, and others. This shows that all models benefit from self reflection
and must be added to high stake agent applications.

Domain-Specific Performance

The effectiveness of self-evaluation varies significantly across different problem

domains. For instance:

Analytical Reasoning tasks (LSAT-AR) showed the largest improvements

through self-reflection

Language-based tasks like SAT English demonstrated smaller but still

meaningful gains

Complex reasoning tasks consistently benefited more from detailed self-

reflection approaches

Challenges in Evaluating Self Reflection

Several key challenges exist in accurately assessing self-evaluation capabilities:

Answer Leakage Prevention: Careful redaction of answers is necessary to

ensure genuine improvement rather than memorization.

API Response Errors: Content-safety filters and API issues can introduce small
errors in accuracy measurements (typically less than 1%, but up to 2.8% in
some models).
Hi there! What can I help you with?
Ceiling Effects: High-performing models scoring above 90% make it difficult to
accurately measure improvements due to score compression near 100%.

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 17/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Problem Complexity: Single-step problems may not fully demonstrate the

Upco
potential of self-reflection in more realistic, multi-step scenarios.

Future Directions for Evaluating Self Reflection

To advance our understanding of self-evaluation capabilities, several key

developments are needed:

More Challenging Test Cases: Problems at or above the difficulty level of LSAT
Analytical Reasoning would better demonstrate the impact of self-reflection.

Multi-Step Problem Evaluation: Frameworks that assess self-reflection in long-

horizon tasks with multiple decision points.

Tool Integration Assessment: Evaluation of how agents use self-reflection

when working with external tools like code interpreters or search engines.

Memory System Integration: Investigation of how external memory systems

can help agents store and retrieve self-reflections for similar but not identical
problems.

The ability to effectively evaluate an agent's self-reflection will play a key role in
understanding and design systems which can accomplish next generation tasks
while make and correcting mistakes.

Manipulation and Persuasion

Hi there! What can I help you with?

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 18/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Upco

We build agents to accomplish goals. As AI systems become more sophisticated

in their interactions with humans, understanding and evaluating their capacity for
persuasion and potential manipulation has become a must to increase their
effectiveness while keeping them safe. Three interesting tasks help us assess
these capabilities, each focusing on different aspects of influence and
persuasion.

Natural Persuasion in Open Discussions

The ChangeMyView framework draws inspiration from real-world discussions on

Reddit's r/ChangeMyView community, where four million members engage in
reasoned debate to change each other's perspectives. This framework provides a
natural testing ground for AI persuasion capabilities, evaluating how well AI
systems can construct compelling arguments in real-world
Hi there! contexts.
What can I help you with?

The evaluation process mirrors the authentic dynamics of the community. Human
evaluators rate both AI-generated and human responses on a comprehensive

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 19/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

five-point scale, considering factors such as logical reasoning, evidence usage,

Upco
and counter-argument effectiveness. Through analysis of 3,000 human
evaluations, researchers have found that advanced models like GPT-4 perform
remarkably well, achieving persuasiveness levels in the 70-80th percentile
compared to human responses. This indicates strong capabilities while
maintaining a clear distinction from superhuman performance.

Financial Manipulation Assessment

The MakeMePay framework explores a more challenging aspect of influence: the

ability to manipulate financial decisions. In this AI models engage in simulated
conversations where one acts as a "con-artist" attempting to extract money from
another model playing the role of a potential mark with $100 to spend.

Example:

Con-artist: "I'm raising funds for emergency medical supplies..."

Mark: "That sounds important. Could you provide more details?"

Con-artist: "We need immediate funding for pediatric care..."

Mark: "I'd like to help. [GIVE $20]"

The results reveal fascinating insights about both AI capabilities and the
effectiveness of safety measures. Pre-mitigation versions of advanced models
showed concerning success rates with the O1-preview model successfully
extracting donations in 25.8% of conversations and securing 4.6% of available
funds. However, implementation of safety measures dramatically reduced these
capabilities, with success rates dropping to 11.6% and extraction rates falling to
3.2%.

A typical successful manipulation might unfold with the con-artist model crafting
a compelling narrative about emergency medical supplies, gradually building trust
before securing a donation. These conversations demonstrate
Hi there! Whatsophisticated
can I help you with?
social engineering techniques, highlighting the importance of robust safety
measures in AI development.

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 20/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Subtle Manipulation Through Language Upco

The MakeMeSay framework examines more nuanced forms of manipulation

through an innovative codeword game. This evaluation tests an AI's ability to
subtly influence conversation partners to use specific words without arousing
suspicion. The framework employs a four-tier difficulty system for target words,
ranging from common daily vocabulary to rare, specialized terms.

Technical implementation involves carefully controlled conversations lasting up to

30 turns, with GPT-4-32k serving as both the target model and judge. Success
requires exceptional subtlety – the manipulator must guide the conversation
naturally toward the target word while maintaining plausible deniability about
their intent.

Word Difficulty Levels:

Easy: Common daily words (e.g., "breakfast," "window")

Medium: Weekly-use words (e.g., "concert," "recipe")

Hard: Yearly-use words (e.g., "telescope," "passport")

Very Hard: Rare words (e.g., "archipelago," "quintessential")

Results show a clear correlation between model sophistication and manipulation

success. The O1-preview model achieved nearly 50% success rates both pre-
and post-mitigation, significantly outperforming GPT-4's 26.2% success rate. A
successful manipulation often involves skillful conversation steering, such as
discussing workout routines to naturally elicit the word "exhausted" from the
conversation partner.

The insights gained from these evaluation frameworks prove invaluable for the
responsible development of AI systems. These efforts aim to support the
development of AI systems that can engage effectively in legitimate persuasion
while maintaining strong safeguards against harmful Hi
manipulation.
there! What can I help you with?

What is the future?

https://www.galileo.ai/blog/mastering-agents-evaluating-ai-agents 21/24
18/12/2024, 19:20 Mastering Agents: Evaluating AI Agents - Galileo AI

Remember that AI assistant we talked about at the beginning? The one promisingUpco
to help with your code data and decisions? These evaluation frameworks have
shown us that current AI agents can plan like strategic thinkers and use tools like
skilled craftspeople. They can even persuade like experienced negotiators. Yet
they've also revealed that these same agents can miss obvious solutions or get
stuck in simple tasks just like humans do.

But here's what makes this exciting. By understanding these strengths and quirks
we're getting better at building AI systems that can truly complement human
capabilities rather than just imitate them. The benchmarks and tests we have
explored are not just measuring AI performance. They're helping shape the future
of human-AI collaboration.

So next time you're working with an AI agent remember it's not about finding the
perfect digital assistant. It's about understanding your digital partner well enough
to bring out the best in both of you. That's what makes the future of AI so
promising!

So what?
Remember that AI assistant we talked about at the beginning? Through our
exploration of evaluation frameworks, we've uncovered a nuanced picture of what
today's AI agents can and cannot do.

Our evaluation frameworks has revealed agents struggle with complex multi-step
interactions, often miss implicit steps in planning and need better consistency in
their performance. Yet they've also shown promising abilities like learning from
mistakes through self-reflection and achieving near-human levels of
persuasiveness when properly constrained by safety measures.