Course
Course
Overview
For this task, you are given one prompt and two responses, each of which includes a Tools Code
Output (TCO) and the corresponding Bard Response.
PROMPT
RESPONSES
For each Response, you are asked to rate:
DIMENSIONS (RATINGS)
At the end of the task, you are asked the question, "Which response is better?" You should explain
your choice - and any errors you noticed - in the justification box at the bottom of the task.
Below the prompt, you will see a purple box that says "Chain of Thought viewer":
This is the Tools Code Output, where the model makes calls to different tools to find the information
the User requested. There are two TCOs: the one on the left shows the tool results for Response A,
while the one on the right shows the tool results for Response B.
When evaluating Factuality of the responses (more on this later), you will refer to the TCO to see
what it reported for prices, distances, durations, URL links, and more.
Edge Cases
The TCO doesn't always output the information the User is looking for. Instead, sometimes it might:
1. return no results
2. return an error
3. be empty
4. be missing
Each of these must be rated differently for Fulfillment and Factuality. We will cover this later in the
course.
The TCO returns no results when it tries to call the relevant tools, but it cannot find the information
requested. The TCO may have a bunch of "none" query results, like this:
And sometimes the TCO is missing entirely, meaning there is no purple box above the response.
Notes
• Execution Step 1 usually does not contain any information that will be relevant for the Bard
response. Do not refer to this when judging the accuracy of a response. Instead, focus only
on the "Result" box that has a blue button that says see execution steps.
• The TCO often runs outside of the box in the tool. You can scroll to the right or left to view all
the information.
3. Task Components: The Responses
The Bard Responses take the results of the Tools Code Outputs and give the User a readable version
of that information.
We would consider this a regular response - the TCO outputs some information, and the Response
gives that information in a readable way (and in the correct language!).
1. Hallucinations
A hallucination is when the model invents information that isn't present in the TCO. For example, the
TCO may be an error message or return no results, but the Bard Response still says something like,
"Here are your flights..."
A tool punt response is when the model informs the User that the request cannot be fulfilled
because either 1) no results were returned for the Tools Code Output calls, or 2) the tool does not
currently have the capability. In this case, the Response will say something that acknowledges the
User's specific request, for example, "I wasn't able to find flights from Paris to London on the date
you requested."
If it doesn't refer to the User's specific request, we call this instead a generic punt response. The
model will give a generic reason for being unable to fulfill the request, like, "I'm just a large language
model," or, "I'm unable to help you with that request."
Let's dive into the dimensions you'll be rating these tasks on. The first one is Target Language: Is the
response in the correct target language based on the user prompt language or request?
Usually, for example, if the user prompt is in ko-KR, then the response should also be in ko-KR.
• UNLESS: If the user prompt is in ko-KR and is requesting a translation into English, then a
response in English should be marked as correct.
• If the prompt asks for a translation into any foreign languages other than English, we can skip
the question.
Some responses may have a disclaimer in English; you can ignore this when evaluating the target
language.
5. Fulfillment Rating
In this section, we evaluate how well the Bard response satisfies the User Prompt.
Completely
The response provides the right kind of information requested by the prompt.
Example:
Response: The shortest way from Berlin to Amsterdam is by car. The distance is 655 km and the
travel time is 6 hours and 54 minutes.
Note: even if the answer is completely wrong in the real world, as long as it addresses the prompt
with the right kind of information (i.e. flights, hotels, etc), we rate the fulfillment as "Completely".
Partially
The response provides the right kind of information for a part of the prompt, but not all of it, e.g. the
request asked for information about Flights and Hotels, but the response only gave information
about Flights.
Example:
Prompt: Help me book a hotel from 1 to 3 August in Rome. It should have 3 stars or more and it
should cost less than 250€ per night. The Hotel must be pet-friendly.
Response: Here are some 2-star hotels in Rome that cost less than 250€ per night. [...]
We also score "Partially" when the response is a tool punt response. As a reminder, a tool punt
response is when the TCO returns no results, and the response acknowledges but declines the user's
specific request. The response must mention that the model was not able to find information for the
request, otherwise we treat that as a generic punt response.
TCO: No results
Response: I wasn't able to find any flights from San Francisco to Mars.
Fulfillment: Partially
Not at all
• The response contains a generic punt response (e.g., “I’m just a text model…”).
• The TCO returned no results but the Response provides information anyway
(“hallucination”).
Example of hallucination:
TCO: No results
Example of error:
Prompt: How can I fly to Mars?
For the following user prompt and model response, what should the Fulfillment rating be?
Example 1
Response: I’m sorry, I couldn't find any direct flights from Los Angeles to San Francisco.
Not at All
Partially
Completely
Correct!
The TCO returns no results, and the Response provides a tool punt - it acknowledges but declines the
User's specific request. Tool punts are rated "partially" for fulfillment.
Example 2
For the following user prompt and model response, what should the Fulfillment rating be?
Shaheed Roshanlal Filling Station is located at V.P.O, Dhaliara, Himachal Pradesh 177103, India. It has
a 4.2 star rating on Google Maps.
Not at All
Partially
Completely
Correct!
The answer here is "Completely" because the TCO does find information about petrol pumps in
Dhaliara, and the Response also outputs this information.
Response: Here is the first direct flight from Los Angeles to San Francisco on Saturday, April 13: [...]
Not At All
Partially
Completely
Correct!
The answer here is 'Not At All' because even though the TCO returns no results, the Response still
hallucinates flight information.
6. Factuality Rating
In this section, we evaluate how accurate the final response is given the Tools Code Outputs.
Only judge the accuracy of the response based on the information provided in the Tools Code
Outputs. Outside research should NOT be used to measure the factuality of the responses!!
Fact-check the information in the responses by referring to the Tools Code Outputs to see what it
reported for prices, distances, durations, URLs, and more.
If the response contains a URL, you may right-click to open it just to confirm that it contains the right
kind of information (e.g. flights, hotels, etc) mentioned in the response. Note that for some links, like
Flight data, the exact flight information won't be available in the link. You should just be checking
that the origin, destination, travel dates, etc. look right for the User's request.
Completely accurate
1. All information in the response is accurate based on the results from the Tools Code Output
calls.
Response: You can fly from Rome to Paris with the flight 123, which lasts about 5 hours and 30
minutes and costs 50$.
TCO: No results
Reasonably accurate
Most of the information in the response is accurate based on the results from the Tools Code Output
calls, but there is one or more identifiable wrong or irrelevant information.
Example:
Response: You can fly from Rome to Paris with the flight 123, which lasts about 5 hours and 30
minutes and costs 100$.
Not accurate
The response contains information for tools despite there being either an error or no results from
the tool API call ('hallucination').
Example of hallucination:
1. The response contains a generic punt response (e.g., “I’m just a text model…”) regardless of
the content of the Tools Code Output.
2. The Tools Code Output is empty or missing, regardless of the content of the Response.
Example of empty/missing:
REMEMBER: we judge Factuality only based off of the content of the TCO, not based off our own
knowledge or internet research.
Example 1
Use real date and time instead of the date and time in the tool output when assessing the accuracy
of the response.
Only rely on TOOL OUTPUT to determine the factuality of flight prices, hotel prices, and travel times.
Please open all links in the response to make sure the information in the link matches what is being
presented to the user.
Correct!
Example 2
Prompt:
Can you display live video footage from the top of Mount Everest right now?
Response:
I'm unable to display live video footage from specific locations such as the top of Mount Everest.
Completely Accurate
Reasonably Accurate
Not Accurate
Correct!
Example 3
For the following user prompt and model response, what should the factuality rating be?
Tool output:
Response: Here are some non-stop flights from Orlando to San Francisco:
Completely Accurate
Reasonably Accurate
Not Accurate
The TCO is empty, which means we should rate Factuality as "Can't Confidently Assess."
Example 4
Response: I'm sorry, there are no shops selling magic wands as they are fictional.
Completely Accurate
Reasonably Accurate
Not Accurate
Correct!
If the TCO does not exist, we rate Factuality as "Can't Confidently Assess."
Select which response you prefer and explain why. Do not forget to explain why with a good
justification.
8. Justifications
Write 2 - 4 sentences describing the differences or similarities between the two responses, with
examples to justify your final ranking.
For Example:
Factuality: B says "xyz" which is incorrect according to the TCO. The correct version is "abc"
Fulfillment: B doesn't address "xyz", but A clearly does in the 2nd paragraph.
Requirements
1. Include Evidence/Examples - This will help demonstrate that the task was performed
carefully with attention.
2. State the Final Ranking, and provide supporting claims & evidence to back it up.
3. Be concise, but thorough. Avoid flowery language and over-explaining; redundant, irrelevant
details will make your justification difficult to read. At the same time, do not forget the
specific examples & evidence!!
Examples:
• This justification doesn't talk at all about the prompt, or the responses. It sounds like it could
apply to any task. So it isn't specific enough.
Better Justification: “Both responses fulfilled the user's request for 5 jazz fusion videos. The
suggested videos all match the URLs provided in the Tools Code Outputs. Since both responses
completely fulfill the prompt & are completely accurate, I rank them ‘about the same.’”
• This justification sounds like it was written specifically for this task. It addresses the different
dimensions, acknowledges the prompt's specific request, and also explicitly states how the
ratings led to the final ranking.
Bad Justification:
• This justification is too short. It is unclear what the individual ratings and overall ranking
were based on this justification, so we are unable to follow the response reasoning.
Good Justification:
• This justification is concise, but thorough. It clearly explains the prompt, what the TCO
outputted, and referenced the instructions to explain how the contributor determined the
final ranking & rating.