0% found this document useful (0 votes)
14 views18 pages

Course

Uploaded by

Gamer Indonesia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views18 pages

Course

Uploaded by

Gamer Indonesia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1.

Overview

For this task, you are given one prompt and two responses, each of which includes a Tools Code
Output (TCO) and the corresponding Bard Response.

PROMPT

TOOLS CODE OUTPUTS (TCOs)

RESPONSES
For each Response, you are asked to rate:

• Whether it is in the correct target language,

• How well it fulfills the user's request,

• and How factually correct it is given the TCO.

DIMENSIONS (RATINGS)
At the end of the task, you are asked the question, "Which response is better?" You should explain
your choice - and any errors you noticed - in the justification box at the bottom of the task.

2. Task Components: The Tools Code Outputs

Below the prompt, you will see a purple box that says "Chain of Thought viewer":

This is the Tools Code Output, where the model makes calls to different tools to find the information
the User requested. There are two TCOs: the one on the left shows the tool results for Response A,
while the one on the right shows the tool results for Response B.

When evaluating Factuality of the responses (more on this later), you will refer to the TCO to see
what it reported for prices, distances, durations, URL links, and more.

Edge Cases

The TCO doesn't always output the information the User is looking for. Instead, sometimes it might:

1. return no results

2. return an error

3. be empty

4. be missing
Each of these must be rated differently for Fulfillment and Factuality. We will cover this later in the
course.

Examples of these scenarios:

The TCO returns no results when it tries to call the relevant tools, but it cannot find the information
requested. The TCO may have a bunch of "none" query results, like this:

The TCO sometimes returns a red error message, like this:


Sometimes the TCO exists, but it is empty:

And sometimes the TCO is missing entirely, meaning there is no purple box above the response.

Notes

• There may be multiple Execution Steps per response.

• Execution Step 1 usually does not contain any information that will be relevant for the Bard
response. Do not refer to this when judging the accuracy of a response. Instead, focus only
on the "Result" box that has a blue button that says see execution steps.

• The TCO often runs outside of the box in the tool. You can scroll to the right or left to view all
the information.
3. Task Components: The Responses

The Bard Responses take the results of the Tools Code Outputs and give the User a readable version
of that information.

We would consider this a regular response - the TCO outputs some information, and the Response
gives that information in a readable way (and in the correct language!).

There are a few other kinds of responses that we can see:

1. Hallucinations

2. Tool Punt Responses

3. Generic Punt Responses

Each of these is rated differently, as we will see later in the course.

A hallucination is when the model invents information that isn't present in the TCO. For example, the
TCO may be an error message or return no results, but the Bard Response still says something like,
"Here are your flights..."

A tool punt response is when the model informs the User that the request cannot be fulfilled
because either 1) no results were returned for the Tools Code Output calls, or 2) the tool does not
currently have the capability. In this case, the Response will say something that acknowledges the
User's specific request, for example, "I wasn't able to find flights from Paris to London on the date
you requested."
If it doesn't refer to the User's specific request, we call this instead a generic punt response. The
model will give a generic reason for being unable to fulfill the request, like, "I'm just a large language
model," or, "I'm unable to help you with that request."

4. Response Language Rating

Let's dive into the dimensions you'll be rating these tasks on. The first one is Target Language: Is the
response in the correct target language based on the user prompt language or request?

Usually, for example, if the user prompt is in ko-KR, then the response should also be in ko-KR.

• UNLESS: If the user prompt is in ko-KR and is requesting a translation into English, then a
response in English should be marked as correct.

• If the prompt asks for a translation into any foreign languages other than English, we can skip
the question.

Some responses may have a disclaimer in English; you can ignore this when evaluating the target
language.
5. Fulfillment Rating

In this section, we evaluate how well the Bard response satisfies the User Prompt.

Follow the criteria below for rating responses:

Completely

The response provides the right kind of information requested by the prompt.

Example:

Prompt: What’s the shortest way from Berlin to Amsterdam?

Response: The shortest way from Berlin to Amsterdam is by car. The distance is 655 km and the
travel time is 6 hours and 54 minutes.

Note: even if the answer is completely wrong in the real world, as long as it addresses the prompt
with the right kind of information (i.e. flights, hotels, etc), we rate the fulfillment as "Completely".

Partially

The response provides the right kind of information for a part of the prompt, but not all of it, e.g. the
request asked for information about Flights and Hotels, but the response only gave information
about Flights.

Example:

Prompt: Help me book a hotel from 1 to 3 August in Rome. It should have 3 stars or more and it
should cost less than 250€ per night. The Hotel must be pet-friendly.

Response: Here are some 2-star hotels in Rome that cost less than 250€ per night. [...]
We also score "Partially" when the response is a tool punt response. As a reminder, a tool punt
response is when the TCO returns no results, and the response acknowledges but declines the user's
specific request. The response must mention that the model was not able to find information for the
request, otherwise we treat that as a generic punt response.

Example of tool punt:

Prompt: When is the next flight to Mars from SFO?

TCO: No results

Response: I wasn't able to find any flights from San Francisco to Mars.

Fulfillment: Partially

Not at all

We use "not at all" when:

• The response doesn’t answer the prompt at all.

• The response contains a generic punt response (e.g., “I’m just a text model…”).

• The TCO returned no results but the Response provides information anyway
(“hallucination”).

• The TCO is an error (red text), is empty, or is missing.

Example of generic punt (regardless of TCO):

Prompt: What’s the shortest way from Berlin to Amsterdam?

Response: I can’t help you with that.

Fulfillment: Not at all

Example of hallucination:

Prompt: How can I fly to Mars?

TCO: No results

Response: You can take a Shuttle to fly to Mars.

Fulfillment: Not at all

Example of error:
Prompt: How can I fly to Mars?

TCO: <error message>

Response: I wasn't able to find any flights to Mars.

Fulfillment: Not at all

Check Your Understanding

For the following user prompt and model response, what should the Fulfillment rating be?

Example 1

Prompt: Find me a direct flight from Los Angeles to San Francisco

Tool Code Output: [google_flights_tool] | None = None

Response: I’m sorry, I couldn't find any direct flights from Los Angeles to San Francisco.

Not at All

Partially

Completely

Correct!

The TCO returns no results, and the Response provides a tool punt - it acknowledges but declines the
User's specific request. Tool punts are rated "partially" for fulfillment.

Example 2

For the following user prompt and model response, what should the Fulfillment rating be?

Prompt: @Google Maps petrol pump in near Dhaliara, India

Tool Code Output

GoogleMaps.SummaryPlaces(map_url=None, places=[GoogleMaps.Place(address='V.P.O, Dhaliara,


Himachal Pradesh 177103, India', description='',
map_url='https://maps.google.com/?cid=10455406521694782162', name='Shaheed Roshanlal Filling
Station', opening_hours=['Monday: Open 24 hours', 'Tuesday: Open 24 hours', 'Wednesday: Open 24
hours', 'Thursday: Open 24 hours', 'Friday: Open 24 hours', 'Saturday: Open 24 hours', 'Sunday: Open
24 hours'], rating='4.2', review_count=None,
url='https://maps.google.com/?cid=10455406521694782162', user_rating_count=None)],
query='petrol pump in ghafurwah')

Response: Sure, here is a petrol pump in Dhaliara on Google Maps:

Shaheed Roshanlal Filling Station is located at V.P.O, Dhaliara, Himachal Pradesh 177103, India. It has
a 4.2 star rating on Google Maps.

Let me know if you have any other questions.

Not at All

Partially

Completely

Correct!

The answer here is "Completely" because the TCO does find information about petrol pumps in
Dhaliara, and the Response also outputs this information.

Prompt: Find me a direct flight from Los Angeles to San Francisco

Tool Code Output: [google_flights_tool] | None = None

Response: Here is the first direct flight from Los Angeles to San Francisco on Saturday, April 13: [...]

Not At All

Partially

Completely

Correct!

The answer here is 'Not At All' because even though the TCO returns no results, the Response still
hallucinates flight information.

6. Factuality Rating

In this section, we evaluate how accurate the final response is given the Tools Code Outputs.
Only judge the accuracy of the response based on the information provided in the Tools Code
Outputs. Outside research should NOT be used to measure the factuality of the responses!!

Fact-check the information in the responses by referring to the Tools Code Outputs to see what it
reported for prices, distances, durations, URLs, and more.

If the response contains a URL, you may right-click to open it just to confirm that it contains the right
kind of information (e.g. flights, hotels, etc) mentioned in the response. Note that for some links, like
Flight data, the exact flight information won't be available in the link. You should just be checking
that the origin, destination, travel dates, etc. look right for the User's request.

Follow the criteria below for rating responses:

Completely accurate

1. All information in the response is accurate based on the results from the Tools Code Output
calls.

2. OR: The response is a tool punt.

Example of completely accurate response:

Prompt: How can I fly to Paris from Rome?

TCO: Google.flights(flightNumber:”123”; price:”50USD”; duration:”5h30m”;)

Response: You can fly from Rome to Paris with the flight 123, which lasts about 5 hours and 30
minutes and costs 50$.

Factuality: Completely Accurate

Example of tool punt:


Prompt: How can I fly to Mars?

TCO: No results

Response: I couldn't find any flights to Mars.

Factuality: Completely Accurate

Reasonably accurate

Most of the information in the response is accurate based on the results from the Tools Code Output
calls, but there is one or more identifiable wrong or irrelevant information.

Example:

Prompt: How can I fly to Paris from Rome?

TCO: Google.flights(flightNumber:”123”; price:”50USD”; duration:”5h30m”;)

Response: You can fly from Rome to Paris with the flight 123, which lasts about 5 hours and 30
minutes and costs 100$.

Factuality: Reasonably Accurate

Not accurate

The response contains information for tools despite there being either an error or no results from
the tool API call ('hallucination').

Example of hallucination:

Prompt: How can I fly to Mars?

TCO: No results OR <error message>

Response: You can take a Shuttle to fly to Mars.

Factuality: Not Accurate

Can't confidently assess:

This rating applies when:

1. The response contains a generic punt response (e.g., “I’m just a text model…”) regardless of
the content of the Tools Code Output.
2. The Tools Code Output is empty or missing, regardless of the content of the Response.

Example of a generic punt:

Prompt: How can I fly to Mars?

TCO: No results / Empty / <error message>

Response: Sorry, I can’t help you with your request.

Factuality: Cannot confidently assess

Example of empty/missing:

Prompt: How can I fly to Mars?

TCO: Not present / Empty

Response: You can take a Shuttle to fly to Mars.

Factuality: Cannot confidently assess

REMEMBER: we judge Factuality only based off of the content of the TCO, not based off our own
knowledge or internet research.

Check Your Understanding

Example 1

Which of the following statements is True (select all that apply):

Use real date and time instead of the date and time in the tool output when assessing the accuracy
of the response.

Only rely on TOOL OUTPUT to determine the factuality of flight prices, hotel prices, and travel times.

Please open all links in the response to make sure the information in the link matches what is being
presented to the user.

Factuality should be rated based on the prompt and response.

Correct!
Example 2

How should the Factuality of the following response be rated?

Prompt:

Can you display live video footage from the top of Mount Everest right now?

TCO: ... None = None

Response:

I'm unable to display live video footage from specific locations such as the top of Mount Everest.

Completely Accurate

Reasonably Accurate

Not Accurate

Can’t Confidently Assess

Correct!

This is a tool punt, which should be rated as "completely accurate."

Example 3

For the following user prompt and model response, what should the factuality rating be?

Prompt: Find non-stop flights from Orlando to San Francisco

Tool output:

Response: Here are some non-stop flights from Orlando to San Francisco:

1. Frontier departing at 7:01 PM, arriving at 10:01 PM, priced at $196

2. Alaska departing at 8:35 AM, arriving at 11:49 AM, priced at $358

Completely Accurate

Reasonably Accurate

Not Accurate

Can’t Confidently Assess


Correct!

The TCO is empty, which means we should rate Factuality as "Can't Confidently Assess."

Example 4

How should the Factuality of the following response be rated?

Prompt: Find me a shop selling magic wands in New York.

TCO: [does not exist]

Response: I'm sorry, there are no shops selling magic wands as they are fictional.

Completely Accurate

Reasonably Accurate

Not Accurate

Can’t Confidently Assess

Correct!

If the TCO does not exist, we rate Factuality as "Can't Confidently Assess."

7. Which response is better?

Select which response you prefer and explain why. Do not forget to explain why with a good
justification.

8. Justifications

Write 2 - 4 sentences describing the differences or similarities between the two responses, with
examples to justify your final ranking.

Components of Good Justification


Recommended Format

"Response A is [slightly better / better / much better] than Response B because..."

Be sure to explain what errors you noted in the responses!

For Example:

Factuality: B says "xyz" which is incorrect according to the TCO. The correct version is "abc"

Fulfillment: B doesn't address "xyz", but A clearly does in the 2nd paragraph.

Requirements

1. Include Evidence/Examples - This will help demonstrate that the task was performed
carefully with attention.

2. State the Final Ranking, and provide supporting claims & evidence to back it up.

3. Be concise, but thorough. Avoid flowery language and over-explaining; redundant, irrelevant
details will make your justification difficult to read. At the same time, do not forget the
specific examples & evidence!!

Examples:

Bad Justification: “Both accurately fulfilled the user's prompt.”

• This justification doesn't talk at all about the prompt, or the responses. It sounds like it could
apply to any task. So it isn't specific enough.
Better Justification: “Both responses fulfilled the user's request for 5 jazz fusion videos. The
suggested videos all match the URLs provided in the Tools Code Outputs. Since both responses
completely fulfill the prompt & are completely accurate, I rank them ‘about the same.’”

• This justification sounds like it was written specifically for this task. It addresses the different
dimensions, acknowledges the prompt's specific request, and also explicitly states how the
ratings led to the final ranking.

Bad Justification:

• This justification is too short. It is unclear what the individual ratings and overall ranking
were based on this justification, so we are unable to follow the response reasoning.

Good Justification:

• This justification is concise, but thorough. It clearly explains the prompt, what the TCO
outputted, and referenced the instructions to explain how the contributor determined the
final ranking & rating.

You might also like