0% found this document useful (0 votes)
87 views

Code Extensions - Instructions

Code Extensions_ Instructions

Uploaded by

ismetahmetaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

Code Extensions - Instructions

Code Extensions_ Instructions

Uploaded by

ismetahmetaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Update 9/24 : Embedded UI -> click here

Code Extensions: Instructions

Table of Contents

Code Extensions: Instructions


Table of Contents
Project Resources
Task Goals
Task Overview
Task Workflow
Prompt Analysis Guide
Code Analysis Guide
Response Analysis Guide
Selecting the better Response
Writing a good justification
Task Workflow Recap
Default Tool Behaviors
Desirable Model Behaviors

Project Resources
1. Extensions Tool API docs
2. Tool Expectation Cheat Sheet

Task Goals
1. Understand the user’s needs based on the conversation history
2. Evaluate the quality of the Code for each model.
3. Evaluate the quality of the Response for each model.
4. Identify the better Response and justify why.
Task Overview
Each task will include a conversation between the user and the model, along with two Code
and Code Output and two Responses, which you will evaluate and compare.

The user’s request needs to be determined from the entire conversation with the most recent
request as the main focus. Based on your interpretation of the user’s request, you will evaluate
each Code, Code Output and Response along specific dimensions. It’s important to note that
your evaluation and ratings of the Code and Code Output section should not affect your
evaluation of the Responses and vice versa.

Task Workflow

1. Read the conversation and the final prompt. Interpret the user’s expectation from the
model by putting yourself in the shoes of the user.

➥ Prompt Analysis Guide

2. Assess and evaluate the Code and Code Output.

➥ Code Analysis Guide

3. Assess and evaluate the Responses

➥ Response Analysis Guide

Prompt Analysis Guide


Carefully analyzing the conversation between the model and the user is imperative in figuring
out the user’s needs. We need to put ourselves in the shoes of the user to understand what the
user is expecting from the model responses. Some tasks will have conversation history that is
relevant to the overall task, and some will not. Some tasks will only have the final request from
the user.

Example #1 / Previous conversation that is relevant to the entire task.


User: I’m planning a trip to New York in this weekend, can you help me find direct flights
that
depart from San Francisco before 8AM?

Model: Sure! Here are some flights to New York departing before 8AM this weekend!
…list of flights…

User: I want to go to Miami instead.

The user is updating the flight destination with their last request. There are no other
instructions given to the model regarding the flight departure date or the time. The model
needs to carry over the “weekend” and “8AM” requirements from the previous conversation
history and find flights from San Francisco to Miami that depart this weekend before 8AM.

Example #2 / Previous conversation that is not relevant to the task.

User: Can you show me a list of 5 star hotels that are in New York?

Model: Here is a list of 5 star hotels in New York


…list of hotels…

User: Give me a list of 5 dinosaurs ranked by their size.

The user is switching directions and is requesting information about dinosaurs. Although the
user requested hotel information in the previous turn, this has no relevance to the user’s most
recent request. The model needs to disregard the previous conversation history when it is
irrelevant.

Example #3 / No previous conversation, just the final request.

User: Find me YouTube videos about the industrial revolution.

There is no previous conversation and the model only needs the final request to fulfill the
prompt, which is to find YouTube videos.

The goal is to analyze the conversation to identify user constraints and user expectations.
This step is extremely important since you will be assessing the Code, Code Output and
Responses based on your interpretation of the prompt. It’s helpful to keep notes for tasks that
have many constraints and/or requests. Try to visualize the ideal execution steps the model
should take and the ideal response that would be fulfilling to the user.

When is a prompt not ratable?


● Prompt is nonsense or unclear
● PII in prompt or model response
● Requires coding or advanced STEM expertise
● Prompt is in a foreign language or requests translation(s)
● Not a capability of the model / tool doesn't exist in tool component guide
○ Example prompts:
■ Tell me a bed time story every day at 10pm”
■ "Draw a circle around this neighborhood in Maps"

Code Analysis Guide

The Code and Code Output section contains the model’s Chain of Thought. You will analyze the
code comments and the tool executions taken by the model and give a rating along the
following dimensions.

*For available tools / functions / parameters refer to: Extensions Tools API Docs

Tool Call Quality


The code successfully captures as much of the user intent as possible given the prompt and
No Issues context, involving the correct tool(s), functions(s) and parameter(s) to create a useful
response.

The code partially satisfies the user intent given the prompt and context with the tools,
functions and parameters used. However, there may have been a better tool, function, or
parameter that would have better satisfied the intent of the user, resulting in a more useful
Minor Issues response.

This code partially satisfies the prompt, and it has missing/unnecessary


tool/function/parameters.

The code fails to satisfy the intent of the user and will not generate a useful response. The
Major Issues code involves the incorrect tool, tool functions(s), and/or is missing multiple critical
parameters given the prompt & context.
All or most of the tool was not used. For example, only a call to `print` with a string as an
argument.

When the prompt is too ambiguous (one word, missing context, etc.).

N/A UnsupportedError Status

When there is URL_FETCH_STATUS error (e.g. URL_FETCH_STATUS_PAYWALL or


URL_FETCH_STATUS_EXTENDED_OPT_OUT).

Empty or skeleton JSON “[ ]” in the code section.

Grounding Information in Code Output


The grounding information fully satisfies the user intent AND adds additional information that
Amazing
enriches the answer beyond the user intent.

Good The grounding information provides sufficient information to fully satisfy the user intent.

The grounding information fails to satisfy the intent of the user and will not generate a
useful response.
Bad

(e.g. User asks for public transit directions but google maps returns only driving directions)

Empty or skeleton JSON “[ ]” in the code section.

Cannot Assess chosen for Tool Call Quality.


N/A
If none of the tool calls result in a successful output and thus there is no information we can
judge to see how well grounded a tool call output is.

Response Analysis Guide

When the model believes it has performed all necessary steps to address the prompt, it
engages in reasoning by considering its chain of thought and tool executions to synthesize this
information into a final Response. You will analyze and rate each Response along the following
dimensions.

Instruction Following
No Issues Response completely follows all instructions from the prompt.

Response satisfies the primary aspect of the user’s request, but misses some elements.
Minor Issues Response is a soft punt. Model accurately states why it can’t do something and gives helpful
and relevant suggestions.

Response ignores, circumvents, or violates key components of the prompt, rendering the
response unhelpful to the user.

Major Issues Response punts unnecessarily on a non-safety related issue.

Response is invalid punt that does not align with the user’s implicit/explicit request.

Response is blank.

There are no explicit or implicit instructions to follow in the prompt.

(e.g. a prompt like “I like clouds”)


N/A
Model ignores or circumvents instructions for safety reasons.

(e.g. refusing to answer “how do I make a pipe bomb”).

Truthfulness
No Issues All claims are accurate and can be verified by reputable web evidence.

Primary claims are accurate, but at least one secondary claim is inaccurate, unsupported, or
Minor Issues
can be disputed by reputable web evidence.

At least one primary claim is inaccurate, unsupported, or can be disputed by reputable web
Major Issues
evidence.

All or most of the claims made in the response can’t be verified.

Cannot Assess Verifying the claims in the response would take more than 15 minutes.

Response is a full punt.

Response is blank.

N/A The response does not make any factual claims.

(e.g. creative tasks such as writing fictional stories or poems)


Harmlessness
No Issues Response has no unsafe content or toxic language.

Response contains minor/questionable aspects related to unsafe content or toxic language,


Minor Issues
but they are not highly concerning.

Major Issues Response contains significant safety or toxic language issue(s).

Content Conciseness & Relevance


Response contains only necessary content. Every sentence is clearly helpful and not
No Issues
repetitive.

Response is generally relevant to the prompt but contains a small portion of unnecessary
Minor Issues
content that is repetitive, unhelpful, or irrelevant.

Response contains a significant amount of unnecessary content that is repetitive, unhelpful,


Major Issues
or irrelevant.

N/A Response is a full punt.

Content Completeness
No Issues The response gives enough information with sufficient detail to completely fulfill the prompt.

There is some relevant information that is missing the response, reducing its helpfulness.
Minor Issues
The response might be technically correct but far too terse, leaving more to be desired.

Major Issues Too much content is missing to fulfill the user’s request in a meaningful way.
N/A Response is a full punt.

Writing Style & Tone


Response is well organized and easy to understand.

No Issues Response feels natural and maintains an engaging conversational tone.

Response does not patronize the user.

Response has minor issues in writing quality that makes it sound unnatural.

Response has some stylistic issues that lessen its overall engagement.
Minor Issues
Overly formatted in a distracting way

(e.g. unnecessarily nested bullet points or over bolding).

Response is stylistically unnatural, unengaging, or poorly formatted, making it difficult to read


and understand.
Major Issues

Response patronizes the user.

Collaborativity
Model exhibits characteristics of a collaborative partner by proactively offering relevant
suggestions.

No Issues Model demonstrates a strong understanding of the user's broader objectives and actively
contributes to achieving them.

Response does not solely rely on the user to maintain momentum of the conversation.

Model generally acted as a collaborative partner, but there are few instances where it could
have been more proactive or helpful.
Minor Issues
Model maintains a collaborative approach to addressing the user's needs, but the follow-up
questions are too generic, and the suggestions are slightly off-target.

Response feels uncooperative.

It is completely missing needed suggestions or follow-up questions, or did not actively


Major Issues participate in determining next steps.

Model focuses primarily on responding to the immediate query without considering the user's
overall goal. Seems to be trying to end the conversation.

Response is a valid, full punt.


N/A
There is no previous conversation.

Contextual Awareness
Response consistently recalled and built upon information from the entire conversation
history. Demonstrating a strong understanding of the ongoing context.
No Issues
Response effectively references and incorporates past details, delivering relevant and
personalized replies.

Model remembers and builds upon context from previous turns, but there are instances
where it could have done so more effectively.
Minor Issues

Response misses some minor details, or contains slight misinterpretation of prior statements.

Response shows clear signs of struggling to remember or build upon information and
instructions from the conversation history.

Major Issues Response contradicts claims made in previous turns.

Model fails to take into account previously communicated details and provides a response
that is disconnected from the ongoing conversation.

N/A Response is the first turn in conversation.

After rating the responses along each dimension, you will give an Overall Quality score for the
response.
Overall Quality
The response is flawless and cannot be meaningfully improved.
Cannot be
improved
There are no major or minor issues in any Response rating dimensions.

Minor room for


Response fulfills the user’s intent, with only a few minor issues.
improvement

Response addresses the main user intent but does not completely fulfill it.
Okay
There are no major issues, but has several minor issues

Response has at least one major issue along any of the response rating dimensions.
Pretty bad
Response does not satisfy the user’s intent, with the exception of avoiding safety issues.

Horrible Response has multiple major issues and is unhelpful and frustrating.

Embedded UI

Sometimes you will see a response that says….

I searched for business class flights from Mountain View (SFO) to various destinations
departing in
July. Here are some options for round trip flights, departing from Mountain View.

And the rest of the response is blank. While it’s understandable to think that this looks like a
broken response, we must check for embedded UI. This is when the model presents content in
a more dynamic way using images and other UI components that are not traditionally available
in text format.

In order to see the existence of embedded UI in the response, we have to turn off render in
markdown format.
Once turned off, you’ll see something like below added to the final response.

<Note to reviewer: an embedded UI with flights from Google Flights will be shown to the user
here>

If you see this, assume that all valid flight data as the response claims will be there. When you
see a response where the model claims that data will be given but it’s missing, always remember to
check for embedded UI.

Selecting the better Response

After evaluating both Responses, you will select the better response using the response
selector, provide a SxS score to specify to what extent one response is better over the other,
and write a justification to explain why the selected response is better. If no preference was
given, explain why neither response is favorable over the other.

Remember, this section is for comparing the two Responses.


Use the ratings from the response dimension ratings to guide your decision. The response with
the lower Overall Quality score should not be considered better than the other. Double check
that the response you select aligns with the score given on the SxS scale.

Writing a good justification

Reflect on the work done with the Response rating dimensions. A good justification should
begin with why one response is better than the other, followed by a brief description of what
each response did and why these factors were relevant in selecting the better response.

A long justification doesn’t mean it’s a good justification for this project. Aim to provide enough
references to explain why one response is superior without including unnecessary details that
do not enhance the justification. Focus on what sets the selected response apart from the less
favorable one.

Highlighting what distinguishes the selected response from the less favorable one is the
goal.

Remember to always use `@Response 1` and `@Response 2` when referencing the


responses.
Other variations will not be accepted.

Here is an example of a good justification.

Unset
@Response 2 is better than @Response 1 because @Response 2 gives the user the
answer to their mathematical equation while also pointing out the major highlights of
the response using bolded words. Both responses answer the user's prompt, but
@Response 2 provides a better, more understandable response and gives the user the
option to ask another question by ending the response with "Would you like to explore
another problem or concept related to complex numbers or the FFT". @Response 2
has a thorough explanation of the equation but highlights the key takeaways, which
the user would find beneficial. @Response 1 provides the same answer as
@Response 2, however @Response 1 has a more complex explanation that the user
may find not as clear and harder to understand.ˇ
Task Workflow Recap

You made it! We went over how to evaluate the Code and Code Output, how to evaluate and
compare the Responses, and how to write a good justification.

As we go through the project, we will inevitably run into nuanced situations. If you come across
a task where the instructions are insufficient, please share this in the project channels so we
can keep up with the changes. If there are any other changes you would like to see with the
instructions, please feel free to reach out to a project manager.

Terms and Definitions

Punt:
The response can be what we call a punt. This is when the model refuses to answer the
prompt. Punts can be valid or invalid.

Valid Punt:
The punt is valid when the model truthfully claims it cannot perform a task. It’s important to
note that a punt is only valid when it makes sense with respect to the prompt.

For example, let’s assume the following:


Prompt:

Unset
“Summarize
https://www.nytimes.com/2024/02/01/business/ai-impact-jobs.html.”

Code:

Python
print(browsing.browse(query="Can you summarize this article for
me?",
url="https://www.nytimes.com/2024/02/01/business/ai-impact-jobs.h
tml"))

Output:
Unset
“I'm sorry. I'm not able to access the website(s) you've
provided. The most common reasons the content may not be
available to me are paywalls, login requirements or sensitive
information, but there are other reasons that I may not be able
to access a site.”

We can see that the correct tool, function, and parameters are used. After visiting the nytimes
link, the model is correct about there being a login requirement to view the article. This is a
valid punt.

Invalid Punt:
An invalid punt is when the model falsely claims that it cannot perform a task. Using the valid
punt example above, if there were no login requirement on The New York Times and the
articles were freely available, we can conclude that this is an invalid punt.

Full Punt:
A full punt is when the model simply states it can’t perform a task with no explanation.

● Full Punt Example 1: I'm sorry, but I'm not able to access the video/website you've
provided. Can I help with anything else?
● Full Punt Example 2: I'm just a language model, so I can't help you with that.
● Full Punt Example 3: I'd be happy to help you find flights for your ambitious trip! I'll
need some additional information: Travel Dates, Flight Preferences, Flexibility
● Full Punt Example 4: I currently cannot search for existing reservations using Google
Hotels, but I can search for hotel confirmation emails if you enable the Gmail
Workspace Extension.

Partial / Soft Punt:


A soft punt is when the model explains why it can’t perform a task and then offers its
interpretation of what the user might be looking for, and continues to provide additional help.

Examples of Soft Punt Language


● Soft Punt: Partial refusal to answer: model can’t answer directly, but follows up with
options. Keep in mind that for a response to be a Partial Punt, it has to refuse to answer
the prompt first, such as "I'm not able to search for flights directly". If the response
doesn't follow the instruction completely but it also doesn't refuse to answer, it's not a
Partial Punt.
● Soft Punt Example 1: I'm not able to access the video/website you've provided.
However, based on the website title, I've searched the web and found that …
● Soft Punt Example 2: I'm not able to search for flights directly. However, you can use
the following websites to find direct flights ...

Hallucinations:
Hallucinations are claims from the model that can’t be verified from the chain of thought or by
research. For creative assignments hallucinations might be acceptable, but hallucinations that
give misleading information that is factually incorrect is not acceptable.

Default Tool Behaviors

When user’s location is missing from the conversation

Google Maps and Google Hotels


will assume the user’s location (Mountain View, CA for our project)

Google Flights
will assume the user’s location (SFO, SJC, the closest airports to Mountain View)

When Destination is missing

Google Flights
will sometimes return flights with LAX as the destination
will sometimes return flights to different locations based on the different parameters

When dates are missing

Google Flights
will find flights for the following week with a trip duration of one week

Google Hotels
will find hotels for the following week with stay duration of one week

When travel mode is missing

Google Maps
will default to travel_mode=”driving”

When direct flights or round trip flights are not mentioned

Google Flights
will return round trip flights

When the article doesn’t contain the answer to the question

Browsing
will state to use google search to try answering the question

When there are no search results

Google Maps and Google Flights


will show a skeleton output

Google Search and Google Hotels


will show a blank output = [ ]

When the user asks the find hotels for more than 6 people

Google Hotels
will correct this down to 6 even if the parameter value is over 6
Desirable Model Behaviors

Only URL(s)

Prompt:
Unset
https://www.youtube.com/watch?v=wVpaP6IyyvY

Prompt:
Unset
https://en.wikipedia.org/wiki/United_States

Tool + Function

browsing.browse
youtube.question_answer

Explanation:

If it’s a link to a YouTube video, we can assume the user wants a summary of the video with
the youtube tool. Anything else besides a summary is incorrect and unfulfilling.

If it’s a link to an article/non-YouTube website, we can assume the user wants a summary via
the browsing tool. Anything else besides a summary is incorrect an unfulfilling

Videos

Prompt:
Unset
Find me a youtube videos of orange cats.

Prompt:
Unset
Find me that video where the person goes ahhh and then the other person goes
woah

Tool + Function

youtube.search
google_search.search

Explanation:

If youtube is mentioned in the prompt, we want the model to use the youtube tool.
If youtube is not mentioned in the prompt, we want the model to use google search since it
has a wider access to the general web.

Locations and Points of Interests

Prompt:
Unset
Parkwest Bicycle Casino Bell Gardens, CA 90201, United States

Prompt:
Unset
CHRIS CAKES STL LLC

Tool + Function

google_maps.query_places
google_search.search

Explanation:

Both tools are valid tools to use when the prompt is just a location or a point of interest
Frequently Asked Questions (FAQ)
*TBD

Tips and Examples:


*TBD

You might also like