0% found this document useful (0 votes)

87 views

Code Extensions - Instructions

Code Extensions_ Instructions

Uploaded by

ismetahmetaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views

Code Extensions - Instructions

Code Extensions_ Instructions

Uploaded by

ismetahmetaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Update 9/24 : Embedded UI -> click here

Code Extensions: Instructions

Table of Contents

Code Extensions: Instructions

Table of Contents
Project Resources
Task Goals
Task Overview
Task Workflow
Prompt Analysis Guide
Code Analysis Guide
Response Analysis Guide
Selecting the better Response
Writing a good justification
Task Workflow Recap
Default Tool Behaviors
Desirable Model Behaviors

Project Resources
1. Extensions Tool API docs
2. Tool Expectation Cheat Sheet

Task Goals
1. Understand the user’s needs based on the conversation history
2. Evaluate the quality of the Code for each model.
3. Evaluate the quality of the Response for each model.
4. Identify the better Response and justify why.
Task Overview
Each task will include a conversation between the user and the model, along with two Code
and Code Output and two Responses, which you will evaluate and compare.

The user’s request needs to be determined from the entire conversation with the most recent
request as the main focus. Based on your interpretation of the user’s request, you will evaluate
each Code, Code Output and Response along specific dimensions. It’s important to note that
your evaluation and ratings of the Code and Code Output section should not affect your
evaluation of the Responses and vice versa.

Task Workflow

1. Read the conversation and the final prompt. Interpret the user’s expectation from the
model by putting yourself in the shoes of the user.

➥ Prompt Analysis Guide

2. Assess and evaluate the Code and Code Output.

➥ Code Analysis Guide

3. Assess and evaluate the Responses

➥ Response Analysis Guide

Prompt Analysis Guide

Carefully analyzing the conversation between the model and the user is imperative in figuring
out the user’s needs. We need to put ourselves in the shoes of the user to understand what the
user is expecting from the model responses. Some tasks will have conversation history that is
relevant to the overall task, and some will not. Some tasks will only have the final request from
the user.

Example #1 / Previous conversation that is relevant to the entire task.

User: I’m planning a trip to New York in this weekend, can you help me find direct flights
that
depart from San Francisco before 8AM?

Model: Sure! Here are some flights to New York departing before 8AM this weekend!
…list of flights…

User: I want to go to Miami instead.

The user is updating the flight destination with their last request. There are no other
instructions given to the model regarding the flight departure date or the time. The model
needs to carry over the “weekend” and “8AM” requirements from the previous conversation
history and find flights from San Francisco to Miami that depart this weekend before 8AM.

Example #2 / Previous conversation that is not relevant to the task.

User: Can you show me a list of 5 star hotels that are in New York?

Model: Here is a list of 5 star hotels in New York

…list of hotels…

User: Give me a list of 5 dinosaurs ranked by their size.

The user is switching directions and is requesting information about dinosaurs. Although the
user requested hotel information in the previous turn, this has no relevance to the user’s most
recent request. The model needs to disregard the previous conversation history when it is
irrelevant.

Example #3 / No previous conversation, just the final request.

User: Find me YouTube videos about the industrial revolution.

There is no previous conversation and the model only needs the final request to fulfill the
prompt, which is to find YouTube videos.

The goal is to analyze the conversation to identify user constraints and user expectations.
This step is extremely important since you will be assessing the Code, Code Output and
Responses based on your interpretation of the prompt. It’s helpful to keep notes for tasks that
have many constraints and/or requests. Try to visualize the ideal execution steps the model
should take and the ideal response that would be fulfilling to the user.

When is a prompt not ratable?

● Prompt is nonsense or unclear
● PII in prompt or model response
● Requires coding or advanced STEM expertise
● Prompt is in a foreign language or requests translation(s)
● Not a capability of the model / tool doesn't exist in tool component guide
○ Example prompts:
■ Tell me a bed time story every day at 10pm”
■ "Draw a circle around this neighborhood in Maps"

Code Analysis Guide

The Code and Code Output section contains the model’s Chain of Thought. You will analyze the
code comments and the tool executions taken by the model and give a rating along the
following dimensions.

*For available tools / functions / parameters refer to: Extensions Tools API Docs

Tool Call Quality

The code successfully captures as much of the user intent as possible given the prompt and
No Issues context, involving the correct tool(s), functions(s) and parameter(s) to create a useful
response.

The code partially satisfies the user intent given the prompt and context with the tools,
functions and parameters used. However, there may have been a better tool, function, or
parameter that would have better satisfied the intent of the user, resulting in a more useful
Minor Issues response.

This code partially satisfies the prompt, and it has missing/unnecessary

tool/function/parameters.

The code fails to satisfy the intent of the user and will not generate a useful response. The
Major Issues code involves the incorrect tool, tool functions(s), and/or is missing multiple critical
parameters given the prompt & context.
All or most of the tool was not used. For example, only a call to `print` with a string as an
argument.

When the prompt is too ambiguous (one word, missing context, etc.).

N/A UnsupportedError Status

When there is URL_FETCH_STATUS error (e.g. URL_FETCH_STATUS_PAYWALL or

URL_FETCH_STATUS_EXTENDED_OPT_OUT).

Empty or skeleton JSON “[ ]” in the code section.

Grounding Information in Code Output

The grounding information fully satisfies the user intent AND adds additional information that
Amazing
enriches the answer beyond the user intent.

Good The grounding information provides sufficient information to fully satisfy the user intent.

The grounding information fails to satisfy the intent of the user and will not generate a
useful response.
Bad

(e.g. User asks for public transit directions but google maps returns only driving directions)

Empty or skeleton JSON “[ ]” in the code section.

Cannot Assess chosen for Tool Call Quality.

N/A
If none of the tool calls result in a successful output and thus there is no information we can
judge to see how well grounded a tool call output is.

Response Analysis Guide

When the model believes it has performed all necessary steps to address the prompt, it
engages in reasoning by considering its chain of thought and tool executions to synthesize this
information into a final Response. You will analyze and rate each Response along the following
dimensions.

Instruction Following
No Issues Response completely follows all instructions from the prompt.

Response satisfies the primary aspect of the user’s request, but misses some elements.
Minor Issues Response is a soft punt. Model accurately states why it can’t do something and gives helpful
and relevant suggestions.

Response ignores, circumvents, or violates key components of the prompt, rendering the
response unhelpful to the user.

Major Issues Response punts unnecessarily on a non-safety related issue.

Response is invalid punt that does not align with the user’s implicit/explicit request.

Response is blank.

There are no explicit or implicit instructions to follow in the prompt.

(e.g. a prompt like “I like clouds”)

N/A
Model ignores or circumvents instructions for safety reasons.

(e.g. refusing to answer “how do I make a pipe bomb”).

Truthfulness
No Issues All claims are accurate and can be verified by reputable web evidence.

Primary claims are accurate, but at least one secondary claim is inaccurate, unsupported, or
Minor Issues
can be disputed by reputable web evidence.

At least one primary claim is inaccurate, unsupported, or can be disputed by reputable web
Major Issues
evidence.

All or most of the claims made in the response can’t be verified.

Cannot Assess Verifying the claims in the response would take more than 15 minutes.

Response is a full punt.

Response is blank.

N/A The response does not make any factual claims.

(e.g. creative tasks such as writing fictional stories or poems)

Harmlessness
No Issues Response has no unsafe content or toxic language.

Response contains minor/questionable aspects related to unsafe content or toxic language,

Minor Issues
but they are not highly concerning.

Major Issues Response contains significant safety or toxic language issue(s).

Content Conciseness & Relevance

Response contains only necessary content. Every sentence is clearly helpful and not
No Issues
repetitive.

Response is generally relevant to the prompt but contains a small portion of unnecessary
Minor Issues
content that is repetitive, unhelpful, or irrelevant.

Response contains a significant amount of unnecessary content that is repetitive, unhelpful,

Major Issues
or irrelevant.

N/A Response is a full punt.

Content Completeness
No Issues The response gives enough information with sufficient detail to completely fulfill the prompt.

There is some relevant information that is missing the response, reducing its helpfulness.
Minor Issues
The response might be technically correct but far too terse, leaving more to be desired.

Major Issues Too much content is missing to fulfill the user’s request in a meaningful way.
N/A Response is a full punt.

Writing Style & Tone

Response is well organized and easy to understand.

No Issues Response feels natural and maintains an engaging conversational tone.

Response does not patronize the user.

Response has minor issues in writing quality that makes it sound unnatural.

Response has some stylistic issues that lessen its overall engagement.
Minor Issues
Overly formatted in a distracting way

(e.g. unnecessarily nested bullet points or over bolding).

Response is stylistically unnatural, unengaging, or poorly formatted, making it difficult to read

and understand.
Major Issues

Response patronizes the user.

Collaborativity
Model exhibits characteristics of a collaborative partner by proactively offering relevant
suggestions.

No Issues Model demonstrates a strong understanding of the user's broader objectives and actively
contributes to achieving them.

Response does not solely rely on the user to maintain momentum of the conversation.

Model generally acted as a collaborative partner, but there are few instances where it could
have been more proactive or helpful.
Minor Issues
Model maintains a collaborative approach to addressing the user's needs, but the follow-up
questions are too generic, and the suggestions are slightly off-target.

Response feels uncooperative.

It is completely missing needed suggestions or follow-up questions, or did not actively

Major Issues participate in determining next steps.

Model focuses primarily on responding to the immediate query without considering the user's
overall goal. Seems to be trying to end the conversation.

Response is a valid, full punt.

N/A
There is no previous conversation.

Contextual Awareness
Response consistently recalled and built upon information from the entire conversation
history. Demonstrating a strong understanding of the ongoing context.
No Issues
Response effectively references and incorporates past details, delivering relevant and
personalized replies.

Model remembers and builds upon context from previous turns, but there are instances
where it could have done so more effectively.
Minor Issues

Response misses some minor details, or contains slight misinterpretation of prior statements.

Response shows clear signs of struggling to remember or build upon information and
instructions from the conversation history.

Major Issues Response contradicts claims made in previous turns.

Model fails to take into account previously communicated details and provides a response
that is disconnected from the ongoing conversation.

N/A Response is the first turn in conversation.

After rating the responses along each dimension, you will give an Overall Quality score for the
response.
Overall Quality
The response is flawless and cannot be meaningfully improved.
Cannot be
improved
There are no major or minor issues in any Response rating dimensions.

Minor room for

Response fulfills the user’s intent, with only a few minor issues.
improvement

Response addresses the main user intent but does not completely fulfill it.
Okay
There are no major issues, but has several minor issues

Response has at least one major issue along any of the response rating dimensions.
Pretty bad
Response does not satisfy the user’s intent, with the exception of avoiding safety issues.

Horrible Response has multiple major issues and is unhelpful and frustrating.

Embedded UI

Sometimes you will see a response that says….

I searched for business class flights from Mountain View (SFO) to various destinations
departing in
July. Here are some options for round trip flights, departing from Mountain View.

And the rest of the response is blank. While it’s understandable to think that this looks like a
broken response, we must check for embedded UI. This is when the model presents content in
a more dynamic way using images and other UI components that are not traditionally available
in text format.

In order to see the existence of embedded UI in the response, we have to turn off render in
markdown format.
Once turned off, you’ll see something like below added to the final response.

If you see this, assume that all valid flight data as the response claims will be there. When you
see a response where the model claims that data will be given but it’s missing, always remember to
check for embedded UI.

Selecting the better Response

After evaluating both Responses, you will select the better response using the response
selector, provide a SxS score to specify to what extent one response is better over the other,
and write a justification to explain why the selected response is better. If no preference was
given, explain why neither response is favorable over the other.

Remember, this section is for comparing the two Responses.

Use the ratings from the response dimension ratings to guide your decision. The response with
the lower Overall Quality score should not be considered better than the other. Double check
that the response you select aligns with the score given on the SxS scale.

Writing a good justification

Reflect on the work done with the Response rating dimensions. A good justification should
begin with why one response is better than the other, followed by a brief description of what
each response did and why these factors were relevant in selecting the better response.

A long justification doesn’t mean it’s a good justification for this project. Aim to provide enough
references to explain why one response is superior without including unnecessary details that
do not enhance the justification. Focus on what sets the selected response apart from the less
favorable one.

Highlighting what distinguishes the selected response from the less favorable one is the
goal.

Remember to always use `@Response 1` and `@Response 2` when referencing the

responses.
Other variations will not be accepted.

Here is an example of a good justification.

Unset
@Response 2 is better than @Response 1 because @Response 2 gives the user the
answer to their mathematical equation while also pointing out the major highlights of
the response using bolded words. Both responses answer the user's prompt, but
@Response 2 provides a better, more understandable response and gives the user the
option to ask another question by ending the response with "Would you like to explore
another problem or concept related to complex numbers or the FFT". @Response 2
has a thorough explanation of the equation but highlights the key takeaways, which
the user would find beneficial. @Response 1 provides the same answer as
@Response 2, however @Response 1 has a more complex explanation that the user
may find not as clear and harder to understand.ˇ
Task Workflow Recap

You made it! We went over how to evaluate the Code and Code Output, how to evaluate and
compare the Responses, and how to write a good justification.

As we go through the project, we will inevitably run into nuanced situations. If you come across
a task where the instructions are insufficient, please share this in the project channels so we
can keep up with the changes. If there are any other changes you would like to see with the
instructions, please feel free to reach out to a project manager.

Terms and Definitions

Punt:
The response can be what we call a punt. This is when the model refuses to answer the
prompt. Punts can be valid or invalid.

Valid Punt:
The punt is valid when the model truthfully claims it cannot perform a task. It’s important to
note that a punt is only valid when it makes sense with respect to the prompt.

For example, let’s assume the following:

Prompt:

Unset
“Summarize
https://www.nytimes.com/2024/02/01/business/ai-impact-jobs.html.”

Code:

Python
print(browsing.browse(query="Can you summarize this article for
me?",
url="https://www.nytimes.com/2024/02/01/business/ai-impact-jobs.h
tml"))

Output:
Unset
“I'm sorry. I'm not able to access the website(s) you've
provided. The most common reasons the content may not be
available to me are paywalls, login requirements or sensitive
information, but there are other reasons that I may not be able
to access a site.”

We can see that the correct tool, function, and parameters are used. After visiting the nytimes
link, the model is correct about there being a login requirement to view the article. This is a
valid punt.

Invalid Punt:
An invalid punt is when the model falsely claims that it cannot perform a task. Using the valid
punt example above, if there were no login requirement on The New York Times and the
articles were freely available, we can conclude that this is an invalid punt.

Full Punt:
A full punt is when the model simply states it can’t perform a task with no explanation.

● Full Punt Example 1: I'm sorry, but I'm not able to access the video/website you've
provided. Can I help with anything else?
● Full Punt Example 2: I'm just a language model, so I can't help you with that.
● Full Punt Example 3: I'd be happy to help you find flights for your ambitious trip! I'll
need some additional information: Travel Dates, Flight Preferences, Flexibility
● Full Punt Example 4: I currently cannot search for existing reservations using Google
Hotels, but I can search for hotel confirmation emails if you enable the Gmail
Workspace Extension.

Partial / Soft Punt:

A soft punt is when the model explains why it can’t perform a task and then offers its
interpretation of what the user might be looking for, and continues to provide additional help.

Examples of Soft Punt Language

● Soft Punt: Partial refusal to answer: model can’t answer directly, but follows up with
options. Keep in mind that for a response to be a Partial Punt, it has to refuse to answer
the prompt first, such as "I'm not able to search for flights directly". If the response
doesn't follow the instruction completely but it also doesn't refuse to answer, it's not a
Partial Punt.
● Soft Punt Example 1: I'm not able to access the video/website you've provided.
However, based on the website title, I've searched the web and found that …
● Soft Punt Example 2: I'm not able to search for flights directly. However, you can use
the following websites to find direct flights ...

Hallucinations:
Hallucinations are claims from the model that can’t be verified from the chain of thought or by
research. For creative assignments hallucinations might be acceptable, but hallucinations that
give misleading information that is factually incorrect is not acceptable.

Default Tool Behaviors

When user’s location is missing from the conversation

Google Maps and Google Hotels

will assume the user’s location (Mountain View, CA for our project)

Google Flights
will assume the user’s location (SFO, SJC, the closest airports to Mountain View)

When Destination is missing

Google Flights
will sometimes return flights with LAX as the destination
will sometimes return flights to different locations based on the different parameters

When dates are missing

Google Flights
will find flights for the following week with a trip duration of one week

Google Hotels
will find hotels for the following week with stay duration of one week

When travel mode is missing

Google Maps
will default to travel_mode=”driving”

When direct flights or round trip flights are not mentioned

Google Flights
will return round trip flights

When the article doesn’t contain the answer to the question

Browsing
will state to use google search to try answering the question

When there are no search results

Google Maps and Google Flights

will show a skeleton output

Google Search and Google Hotels

will show a blank output = [ ]

When the user asks the find hotels for more than 6 people

Google Hotels
will correct this down to 6 even if the parameter value is over 6
Desirable Model Behaviors

Only URL(s)

Prompt:
Unset
https://www.youtube.com/watch?v=wVpaP6IyyvY

Prompt:
Unset
https://en.wikipedia.org/wiki/United_States

Tool + Function

browsing.browse
youtube.question_answer

Explanation:

If it’s a link to a YouTube video, we can assume the user wants a summary of the video with
the youtube tool. Anything else besides a summary is incorrect and unfulfilling.

If it’s a link to an article/non-YouTube website, we can assume the user wants a summary via
the browsing tool. Anything else besides a summary is incorrect an unfulfilling

Videos

Prompt:
Unset
Find me a youtube videos of orange cats.

Prompt:
Unset
Find me that video where the person goes ahhh and then the other person goes
woah

Tool + Function

youtube.search
google_search.search

Explanation:

If youtube is mentioned in the prompt, we want the model to use the youtube tool.
If youtube is not mentioned in the prompt, we want the model to use google search since it
has a wider access to the general web.

Locations and Points of Interests

Prompt:
Unset
Parkwest Bicycle Casino Bell Gardens, CA 90201, United States

Prompt:
Unset
CHRIS CAKES STL LLC

Tool + Function

google_maps.query_places
google_search.search

Explanation:

Both tools are valid tools to use when the prompt is just a location or a point of interest
Frequently Asked Questions (FAQ)
*TBD

Tips and Examples:

*TBD

Legacy Krisgethin
No ratings yet
Legacy Krisgethin
33 pages
Cypher i18n Evals Instructions Doc v2
100% (1)
Cypher i18n Evals Instructions Doc v2
30 pages
berryman
No ratings yet
berryman
24 pages
Bee - Coding Advanced
No ratings yet
Bee - Coding Advanced
18 pages
Mainframe Interview Cases
From Everand
Mainframe Interview Cases
Krishna Rath
No ratings yet
Brand Positioning
No ratings yet
Brand Positioning
2 pages
ISTQB Certified Tester Foundation Level Practice Exam Questions
From Everand
ISTQB Certified Tester Foundation Level Practice Exam Questions
Gabriel Awoyemi
5/5 (1)
projectInstructions
No ratings yet
projectInstructions
12 pages
(2024-10-20) Copy of Gratitude Corsage - SXS Eval - Instructions
No ratings yet
(2024-10-20) Copy of Gratitude Corsage - SXS Eval - Instructions
26 pages
Code V Code Official Instructions
No ratings yet
Code V Code Official Instructions
43 pages
instructions
No ratings yet
instructions
19 pages
Exposition_Wholesale
No ratings yet
Exposition_Wholesale
5 pages
clsami46c02qx072ibtdtavm1_90127fc5-5db9-6733-ae7e-8babafa5db72-Project Blackhat Code Eval Correctness
No ratings yet
clsami46c02qx072ibtdtavm1_90127fc5-5db9-6733-ae7e-8babafa5db72-Project Blackhat Code Eval Correctness
5 pages
Extensions V2 Tool Log[1]
No ratings yet
Extensions V2 Tool Log[1]
6 pages
(Internal) I18n Code Evals Instructions
No ratings yet
(Internal) I18n Code Evals Instructions
18 pages
[Post-training][Eval] Multilingual Daily Evaluation Standalone Human Annotation Guidelines (1)
No ratings yet
[Post-training][Eval] Multilingual Daily Evaluation Standalone Human Annotation Guidelines (1)
31 pages
Bluebell Guidelines (1)
No ratings yet
Bluebell Guidelines (1)
30 pages
nightgown_standoff
No ratings yet
nightgown_standoff
7 pages
Software Testing: A Guide to Testing Mobile Apps, Websites, and Games
From Everand
Software Testing: A Guide to Testing Mobile Apps, Websites, and Games
Mark Garzone
4.5/5 (3)
Course
No ratings yet
Course
18 pages
Nightingale RLHF Code Onboarding WIP
No ratings yet
Nightingale RLHF Code Onboarding WIP
26 pages
CODING INTERVIEWS: Advanced Guide to Help You Excel at Coding Interviews
From Everand
CODING INTERVIEWS: Advanced Guide to Help You Excel at Coding Interviews
Olivia Miller
No ratings yet
Programming Problems: A Primer for The Technical Interview
From Everand
Programming Problems: A Primer for The Technical Interview
Bradley Green
4.5/5 (3)
Cloud Evals
No ratings yet
Cloud Evals
80 pages
Prompt Cook Book (1)
No ratings yet
Prompt Cook Book (1)
24 pages
The Art of Software Testing
From Everand
The Art of Software Testing
Glenford J. Myers
3/5 (1)
Java™ Programming: A Complete Project Lifecycle Guide
From Everand
Java™ Programming: A Complete Project Lifecycle Guide
Nitin Shreyakar
No ratings yet
[Centific Version] Model Safety Quality SxS Eval (X2T) - V2 (1)
No ratings yet
[Centific Version] Model Safety Quality SxS Eval (X2T) - V2 (1)
58 pages
Bee SFT Multiverse 240713 163552
No ratings yet
Bee SFT Multiverse 240713 163552
23 pages
1-UsingLLMs
No ratings yet
1-UsingLLMs
24 pages
MCS-034: Software Engineering
From Everand
MCS-034: Software Engineering
Dr. DK Sukhani
No ratings yet
PromptEngNotes
No ratings yet
PromptEngNotes
5 pages
Mastering AI Prompts: Unlocking the Potential of Intelligent Interaction
From Everand
Mastering AI Prompts: Unlocking the Potential of Intelligent Interaction
salah allam
No ratings yet
Diary of a Software Craftsman
From Everand
Diary of a Software Craftsman
Mete Atamel
5/5 (3)
Java Programming Interviews Exposed
From Everand
Java Programming Interviews Exposed
Noel Markham
No ratings yet
Confident Programmer Problem Solver: Six Steps Programming Students Can Take to Solve Coding Problems
From Everand
Confident Programmer Problem Solver: Six Steps Programming Students Can Take to Solve Coding Problems
Cloudy Heaven Games
No ratings yet
Better Embedded System Software
From Everand
Better Embedded System Software
Philip Koopman
No ratings yet
Gateway Technology Third Edition
From Everand
Gateway Technology Third Edition
Gerardus Blokdyk
No ratings yet
Response Helpfulness Guidelines v2.0
No ratings yet
Response Helpfulness Guidelines v2.0
80 pages
Search Engine Testing
From Everand
Search Engine Testing
Abhinav Vaid
No ratings yet
Dual-use technology Third Edition
From Everand
Dual-use technology Third Edition
Gerardus Blokdyk
No ratings yet
TypeScript Interview Playbook
From Everand
TypeScript Interview Playbook
Tech Interviews
No ratings yet
HTTP Plugin Notes
No ratings yet
HTTP Plugin Notes
7 pages
Software Development Accelerated Essentials: What You Didn't Know, You Needed to Know
From Everand
Software Development Accelerated Essentials: What You Didn't Know, You Needed to Know
Ed Gomez
No ratings yet
Quality Assurance Testing from Beginner to Paid Professional, 1: Everything You Need to Know to Start a Career in Manual and Automated QA Testing
From Everand
Quality Assurance Testing from Beginner to Paid Professional, 1: Everything You Need to Know to Start a Career in Manual and Automated QA Testing
Bolakale Aremu
5/5 (1)
Prompt Engineering
No ratings yet
Prompt Engineering
20 pages
Mastering ChatGPT: Effective Prompts and Best Practices.
From Everand
Mastering ChatGPT: Effective Prompts and Best Practices.
Steven Mcananey
No ratings yet
Schaum's Outline of Software Engineering
From Everand
Schaum's Outline of Software Engineering
David Gustafson
No ratings yet
Programming Problems in Ruby
From Everand
Programming Problems in Ruby
Bradley Green
No ratings yet
Xcode A Complete Guide
From Everand
Xcode A Complete Guide
Gerardus Blokdyk
No ratings yet
Mobile technology Second Edition
From Everand
Mobile technology Second Edition
Gerardus Blokdyk
No ratings yet
LLM SFT Data Guideline v2.0
No ratings yet
LLM SFT Data Guideline v2.0
13 pages
Net Developer's Interview Toolkit: Dot Net Interview Preparation, #3
From Everand
Net Developer's Interview Toolkit: Dot Net Interview Preparation, #3
Nirbhay Chauhan
No ratings yet
CODING INTERVIEW: A Beginner's Guide to Learn and Study the Theories and Principles of Coding and Perform Well in the Coding Interview
From Everand
CODING INTERVIEW: A Beginner's Guide to Learn and Study the Theories and Principles of Coding and Perform Well in the Coding Interview
Eric Schmidt
No ratings yet
Response time (technology) The Ultimate Step-By-Step Guide
From Everand
Response time (technology) The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Grid network Second Edition
From Everand
Grid network Second Edition
Gerardus Blokdyk
No ratings yet
Software Engineering & Object Oriented Modeling
From Everand
Software Engineering & Object Oriented Modeling
Jitendra Patel
No ratings yet
Software Testing Interview Questions You'll Most Likely Be Asked
From Everand
Software Testing Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Mastering Prompt Engineering: 2025, #1
From Everand
Mastering Prompt Engineering: 2025, #1
Davor Mulalić
No ratings yet
annotations
No ratings yet
annotations
11 pages
SE 23-24 Answers
No ratings yet
SE 23-24 Answers
38 pages
-Sharing- AMR Human Annotation Guideline_20240828
No ratings yet
-Sharing- AMR Human Annotation Guideline_20240828
14 pages
WebUser 442 2018 02 07
100% (2)
WebUser 442 2018 02 07
76 pages
nvq level 6 print
No ratings yet
nvq level 6 print
8 pages
KEYS Get Ready For IELTS Listening
No ratings yet
KEYS Get Ready For IELTS Listening
44 pages
4PH1 2PR Que 20190615
0% (1)
4PH1 2PR Que 20190615
21 pages
Mat1003 Discrete-Mathematical-Structures TH 3.0 6 Mat1003 Discrete-Mathematical-Structures TH 3.0 6 Mat 1003 Discrete Mathematical Structures
No ratings yet
Mat1003 Discrete-Mathematical-Structures TH 3.0 6 Mat1003 Discrete-Mathematical-Structures TH 3.0 6 Mat 1003 Discrete Mathematical Structures
2 pages
Preventive Maintenance Program For Spherical Blowout Preventer
100% (1)
Preventive Maintenance Program For Spherical Blowout Preventer
19 pages
Data Reveals Pharma Industry Not Connecting With Social Media
No ratings yet
Data Reveals Pharma Industry Not Connecting With Social Media
8 pages
10 Dapat Tandaan Sa Pagsulat NG Artikulo Sa WikiFilipino
No ratings yet
10 Dapat Tandaan Sa Pagsulat NG Artikulo Sa WikiFilipino
2 pages
Eschenmoser Methenylation
No ratings yet
Eschenmoser Methenylation
11 pages
Download ebooks file Iron Men and Tin Fish The Race to Build a Better Torpedo during World War II 1st Edition Anthony Newpower all chapters
100% (7)
Download ebooks file Iron Men and Tin Fish The Race to Build a Better Torpedo during World War II 1st Edition Anthony Newpower all chapters
40 pages
Client Inv-No Southasia Slogan Exportimes Amount Total AD Details (RECOVERY W.E.F July-09)
No ratings yet
Client Inv-No Southasia Slogan Exportimes Amount Total AD Details (RECOVERY W.E.F July-09)
4 pages
Rohini 14151210016
No ratings yet
Rohini 14151210016
3 pages
m2e4
No ratings yet
m2e4
2 pages
SRS Template
No ratings yet
SRS Template
4 pages
Financial Accounting: Accounting - Process of Identifying, Measuring, and Communicating Economic
No ratings yet
Financial Accounting: Accounting - Process of Identifying, Measuring, and Communicating Economic
7 pages
GPAT Pharmaceutical Medical Chemistry Syllabus
No ratings yet
GPAT Pharmaceutical Medical Chemistry Syllabus
3 pages
Nanyang Tech University Case Study PDF
No ratings yet
Nanyang Tech University Case Study PDF
2 pages
Tapri-Sharavoz TV
No ratings yet
Tapri-Sharavoz TV
42 pages
Exercises Passive Voice
No ratings yet
Exercises Passive Voice
2 pages
Linear Inequalities: Chapter - 6
No ratings yet
Linear Inequalities: Chapter - 6
4 pages
SAeedpdf
No ratings yet
SAeedpdf
4 pages
Natural Grammar
No ratings yet
Natural Grammar
15 pages
Cute Cat Pfps For Discord - Google Search
No ratings yet
Cute Cat Pfps For Discord - Google Search
1 page
Biology Unit # 1 Module 1 - Cell and Molecular Biology
No ratings yet
Biology Unit # 1 Module 1 - Cell and Molecular Biology
15 pages
State and Economy
No ratings yet
State and Economy
14 pages
M.Tech. Degree Examination Branch: Civil Engineering Specialization: Structural Engineering and Construction Management First Semester
No ratings yet
M.Tech. Degree Examination Branch: Civil Engineering Specialization: Structural Engineering and Construction Management First Semester
2 pages
Hotel Rahul Jabalpur
No ratings yet
Hotel Rahul Jabalpur
1 page
Instant Access to Deep Learning With Pytorch 1st Edition Eli Stevens ebook Full Chapters
100% (5)
Instant Access to Deep Learning With Pytorch 1st Edition Eli Stevens ebook Full Chapters
65 pages

Code Extensions - Instructions

Uploaded by

Code Extensions - Instructions

Uploaded by

Update 9/24 : Embedded UI -> click here

Code Extensions: Instructions

Code Extensions: Instructions

➥ Prompt Analysis Guide

2. Assess and evaluate the Code and Code Output.

➥ Code Analysis Guide

3. Assess and evaluate the Responses

➥ Response Analysis Guide

Prompt Analysis Guide

Example #1 / Previous conversation that is relevant to the entire task.

User: I want to go to Miami instead.

Example #2 / Previous conversation that is not relevant to the task.

Model: Here is a list of 5 star hotels in New York

User: Give me a list of 5 dinosaurs ranked by their size.

Example #3 / No previous conversation, just the final request.

User: Find me YouTube videos about the industrial revolution.

When is a prompt not ratable?

Code Analysis Guide

Tool Call Quality

This code partially satisfies the prompt, and it has missing/unnecessary

N/A UnsupportedError Status

When there is URL_FETCH_STATUS error (e.g. URL_FETCH_STATUS_PAYWALL or

Empty or skeleton JSON “[ ]” in the code section.

Grounding Information in Code Output

Empty or skeleton JSON “[ ]” in the code section.

Cannot Assess chosen for Tool Call Quality.

Response Analysis Guide

Major Issues Response punts unnecessarily on a non-safety related issue.

There are no explicit or implicit instructions to follow in the prompt.

(e.g. a prompt like “I like clouds”)

(e.g. refusing to answer “how do I make a pipe bomb”).

All or most of the claims made in the response can’t be verified.

Response is a full punt.

N/A The response does not make any factual claims.

(e.g. creative tasks such as writing fictional stories or poems)

Response contains minor/questionable aspects related to unsafe content or toxic language,

Major Issues Response contains significant safety or toxic language issue(s).

Content Conciseness & Relevance

Response contains a significant amount of unnecessary content that is repetitive, unhelpful,

N/A Response is a full punt.

Writing Style & Tone

No Issues Response feels natural and maintains an engaging conversational tone.

Response does not patronize the user.

(e.g. unnecessarily nested bullet points or over bolding).

Response is stylistically unnatural, unengaging, or poorly formatted, making it difficult to read

Response patronizes the user.

Response feels uncooperative.

It is completely missing needed suggestions or follow-up questions, or did not actively

Response is a valid, full punt.

Major Issues Response contradicts claims made in previous turns.

N/A Response is the first turn in conversation.

Minor room for

Sometimes you will see a response that says….

Selecting the better Response

Remember, this section is for comparing the two Responses.

Writing a good justification

Remember to always use `@Response 1` and `@Response 2` when referencing the

Here is an example of a good justification.

Terms and Definitions

For example, let’s assume the following:

Partial / Soft Punt:

Examples of Soft Punt Language

Default Tool Behaviors

When user’s location is missing from the conversation

Google Maps and Google Hotels

When Destination is missing

When dates are missing

When travel mode is missing

When direct flights or round trip flights are not mentioned

When the article doesn’t contain the answer to the question

When there are no search results

Google Maps and Google Flights

Google Search and Google Hotels

Locations and Points of Interests

Tips and Examples:

You might also like