Code Extensions - Instructions
Code Extensions - Instructions
Table of Contents
Project Resources
1. Extensions Tool API docs
2. Tool Expectation Cheat Sheet
Task Goals
1. Understand the user’s needs based on the conversation history
2. Evaluate the quality of the Code for each model.
3. Evaluate the quality of the Response for each model.
4. Identify the better Response and justify why.
Task Overview
Each task will include a conversation between the user and the model, along with two Code
and Code Output and two Responses, which you will evaluate and compare.
The user’s request needs to be determined from the entire conversation with the most recent
request as the main focus. Based on your interpretation of the user’s request, you will evaluate
each Code, Code Output and Response along specific dimensions. It’s important to note that
your evaluation and ratings of the Code and Code Output section should not affect your
evaluation of the Responses and vice versa.
Task Workflow
1. Read the conversation and the final prompt. Interpret the user’s expectation from the
model by putting yourself in the shoes of the user.
Model: Sure! Here are some flights to New York departing before 8AM this weekend!
…list of flights…
The user is updating the flight destination with their last request. There are no other
instructions given to the model regarding the flight departure date or the time. The model
needs to carry over the “weekend” and “8AM” requirements from the previous conversation
history and find flights from San Francisco to Miami that depart this weekend before 8AM.
User: Can you show me a list of 5 star hotels that are in New York?
The user is switching directions and is requesting information about dinosaurs. Although the
user requested hotel information in the previous turn, this has no relevance to the user’s most
recent request. The model needs to disregard the previous conversation history when it is
irrelevant.
There is no previous conversation and the model only needs the final request to fulfill the
prompt, which is to find YouTube videos.
The goal is to analyze the conversation to identify user constraints and user expectations.
This step is extremely important since you will be assessing the Code, Code Output and
Responses based on your interpretation of the prompt. It’s helpful to keep notes for tasks that
have many constraints and/or requests. Try to visualize the ideal execution steps the model
should take and the ideal response that would be fulfilling to the user.
The Code and Code Output section contains the model’s Chain of Thought. You will analyze the
code comments and the tool executions taken by the model and give a rating along the
following dimensions.
*For available tools / functions / parameters refer to: Extensions Tools API Docs
The code partially satisfies the user intent given the prompt and context with the tools,
functions and parameters used. However, there may have been a better tool, function, or
parameter that would have better satisfied the intent of the user, resulting in a more useful
Minor Issues response.
The code fails to satisfy the intent of the user and will not generate a useful response. The
Major Issues code involves the incorrect tool, tool functions(s), and/or is missing multiple critical
parameters given the prompt & context.
All or most of the tool was not used. For example, only a call to `print` with a string as an
argument.
When the prompt is too ambiguous (one word, missing context, etc.).
Good The grounding information provides sufficient information to fully satisfy the user intent.
The grounding information fails to satisfy the intent of the user and will not generate a
useful response.
Bad
(e.g. User asks for public transit directions but google maps returns only driving directions)
When the model believes it has performed all necessary steps to address the prompt, it
engages in reasoning by considering its chain of thought and tool executions to synthesize this
information into a final Response. You will analyze and rate each Response along the following
dimensions.
Instruction Following
No Issues Response completely follows all instructions from the prompt.
Response satisfies the primary aspect of the user’s request, but misses some elements.
Minor Issues Response is a soft punt. Model accurately states why it can’t do something and gives helpful
and relevant suggestions.
Response ignores, circumvents, or violates key components of the prompt, rendering the
response unhelpful to the user.
Response is invalid punt that does not align with the user’s implicit/explicit request.
Response is blank.
Truthfulness
No Issues All claims are accurate and can be verified by reputable web evidence.
Primary claims are accurate, but at least one secondary claim is inaccurate, unsupported, or
Minor Issues
can be disputed by reputable web evidence.
At least one primary claim is inaccurate, unsupported, or can be disputed by reputable web
Major Issues
evidence.
Cannot Assess Verifying the claims in the response would take more than 15 minutes.
Response is blank.
Response is generally relevant to the prompt but contains a small portion of unnecessary
Minor Issues
content that is repetitive, unhelpful, or irrelevant.
Content Completeness
No Issues The response gives enough information with sufficient detail to completely fulfill the prompt.
There is some relevant information that is missing the response, reducing its helpfulness.
Minor Issues
The response might be technically correct but far too terse, leaving more to be desired.
Major Issues Too much content is missing to fulfill the user’s request in a meaningful way.
N/A Response is a full punt.
Response has minor issues in writing quality that makes it sound unnatural.
Response has some stylistic issues that lessen its overall engagement.
Minor Issues
Overly formatted in a distracting way
Collaborativity
Model exhibits characteristics of a collaborative partner by proactively offering relevant
suggestions.
No Issues Model demonstrates a strong understanding of the user's broader objectives and actively
contributes to achieving them.
Response does not solely rely on the user to maintain momentum of the conversation.
Model generally acted as a collaborative partner, but there are few instances where it could
have been more proactive or helpful.
Minor Issues
Model maintains a collaborative approach to addressing the user's needs, but the follow-up
questions are too generic, and the suggestions are slightly off-target.
Model focuses primarily on responding to the immediate query without considering the user's
overall goal. Seems to be trying to end the conversation.
Contextual Awareness
Response consistently recalled and built upon information from the entire conversation
history. Demonstrating a strong understanding of the ongoing context.
No Issues
Response effectively references and incorporates past details, delivering relevant and
personalized replies.
Model remembers and builds upon context from previous turns, but there are instances
where it could have done so more effectively.
Minor Issues
Response misses some minor details, or contains slight misinterpretation of prior statements.
Response shows clear signs of struggling to remember or build upon information and
instructions from the conversation history.
Model fails to take into account previously communicated details and provides a response
that is disconnected from the ongoing conversation.
After rating the responses along each dimension, you will give an Overall Quality score for the
response.
Overall Quality
The response is flawless and cannot be meaningfully improved.
Cannot be
improved
There are no major or minor issues in any Response rating dimensions.
Response addresses the main user intent but does not completely fulfill it.
Okay
There are no major issues, but has several minor issues
Response has at least one major issue along any of the response rating dimensions.
Pretty bad
Response does not satisfy the user’s intent, with the exception of avoiding safety issues.
Horrible Response has multiple major issues and is unhelpful and frustrating.
Embedded UI
I searched for business class flights from Mountain View (SFO) to various destinations
departing in
July. Here are some options for round trip flights, departing from Mountain View.
And the rest of the response is blank. While it’s understandable to think that this looks like a
broken response, we must check for embedded UI. This is when the model presents content in
a more dynamic way using images and other UI components that are not traditionally available
in text format.
In order to see the existence of embedded UI in the response, we have to turn off render in
markdown format.
Once turned off, you’ll see something like below added to the final response.
<Note to reviewer: an embedded UI with flights from Google Flights will be shown to the user
here>
If you see this, assume that all valid flight data as the response claims will be there. When you
see a response where the model claims that data will be given but it’s missing, always remember to
check for embedded UI.
After evaluating both Responses, you will select the better response using the response
selector, provide a SxS score to specify to what extent one response is better over the other,
and write a justification to explain why the selected response is better. If no preference was
given, explain why neither response is favorable over the other.
Reflect on the work done with the Response rating dimensions. A good justification should
begin with why one response is better than the other, followed by a brief description of what
each response did and why these factors were relevant in selecting the better response.
A long justification doesn’t mean it’s a good justification for this project. Aim to provide enough
references to explain why one response is superior without including unnecessary details that
do not enhance the justification. Focus on what sets the selected response apart from the less
favorable one.
Highlighting what distinguishes the selected response from the less favorable one is the
goal.
Unset
@Response 2 is better than @Response 1 because @Response 2 gives the user the
answer to their mathematical equation while also pointing out the major highlights of
the response using bolded words. Both responses answer the user's prompt, but
@Response 2 provides a better, more understandable response and gives the user the
option to ask another question by ending the response with "Would you like to explore
another problem or concept related to complex numbers or the FFT". @Response 2
has a thorough explanation of the equation but highlights the key takeaways, which
the user would find beneficial. @Response 1 provides the same answer as
@Response 2, however @Response 1 has a more complex explanation that the user
may find not as clear and harder to understand.ˇ
Task Workflow Recap
You made it! We went over how to evaluate the Code and Code Output, how to evaluate and
compare the Responses, and how to write a good justification.
As we go through the project, we will inevitably run into nuanced situations. If you come across
a task where the instructions are insufficient, please share this in the project channels so we
can keep up with the changes. If there are any other changes you would like to see with the
instructions, please feel free to reach out to a project manager.
Punt:
The response can be what we call a punt. This is when the model refuses to answer the
prompt. Punts can be valid or invalid.
Valid Punt:
The punt is valid when the model truthfully claims it cannot perform a task. It’s important to
note that a punt is only valid when it makes sense with respect to the prompt.
Unset
“Summarize
https://www.nytimes.com/2024/02/01/business/ai-impact-jobs.html.”
Code:
Python
print(browsing.browse(query="Can you summarize this article for
me?",
url="https://www.nytimes.com/2024/02/01/business/ai-impact-jobs.h
tml"))
Output:
Unset
“I'm sorry. I'm not able to access the website(s) you've
provided. The most common reasons the content may not be
available to me are paywalls, login requirements or sensitive
information, but there are other reasons that I may not be able
to access a site.”
We can see that the correct tool, function, and parameters are used. After visiting the nytimes
link, the model is correct about there being a login requirement to view the article. This is a
valid punt.
Invalid Punt:
An invalid punt is when the model falsely claims that it cannot perform a task. Using the valid
punt example above, if there were no login requirement on The New York Times and the
articles were freely available, we can conclude that this is an invalid punt.
Full Punt:
A full punt is when the model simply states it can’t perform a task with no explanation.
● Full Punt Example 1: I'm sorry, but I'm not able to access the video/website you've
provided. Can I help with anything else?
● Full Punt Example 2: I'm just a language model, so I can't help you with that.
● Full Punt Example 3: I'd be happy to help you find flights for your ambitious trip! I'll
need some additional information: Travel Dates, Flight Preferences, Flexibility
● Full Punt Example 4: I currently cannot search for existing reservations using Google
Hotels, but I can search for hotel confirmation emails if you enable the Gmail
Workspace Extension.
Hallucinations:
Hallucinations are claims from the model that can’t be verified from the chain of thought or by
research. For creative assignments hallucinations might be acceptable, but hallucinations that
give misleading information that is factually incorrect is not acceptable.
Google Flights
will assume the user’s location (SFO, SJC, the closest airports to Mountain View)
Google Flights
will sometimes return flights with LAX as the destination
will sometimes return flights to different locations based on the different parameters
Google Flights
will find flights for the following week with a trip duration of one week
Google Hotels
will find hotels for the following week with stay duration of one week
Google Maps
will default to travel_mode=”driving”
Google Flights
will return round trip flights
Browsing
will state to use google search to try answering the question
When the user asks the find hotels for more than 6 people
Google Hotels
will correct this down to 6 even if the parameter value is over 6
Desirable Model Behaviors
Only URL(s)
Prompt:
Unset
https://www.youtube.com/watch?v=wVpaP6IyyvY
Prompt:
Unset
https://en.wikipedia.org/wiki/United_States
Tool + Function
browsing.browse
youtube.question_answer
Explanation:
If it’s a link to a YouTube video, we can assume the user wants a summary of the video with
the youtube tool. Anything else besides a summary is incorrect and unfulfilling.
If it’s a link to an article/non-YouTube website, we can assume the user wants a summary via
the browsing tool. Anything else besides a summary is incorrect an unfulfilling
Videos
Prompt:
Unset
Find me a youtube videos of orange cats.
Prompt:
Unset
Find me that video where the person goes ahhh and then the other person goes
woah
Tool + Function
youtube.search
google_search.search
Explanation:
If youtube is mentioned in the prompt, we want the model to use the youtube tool.
If youtube is not mentioned in the prompt, we want the model to use google search since it
has a wider access to the general web.
Prompt:
Unset
Parkwest Bicycle Casino Bell Gardens, CA 90201, United States
Prompt:
Unset
CHRIS CAKES STL LLC
Tool + Function
google_maps.query_places
google_search.search
Explanation:
Both tools are valid tools to use when the prompt is just a location or a point of interest
Frequently Asked Questions (FAQ)
*TBD