Nightingale RLHF Code Onboarding WIP
Nightingale RLHF Code Onboarding WIP
Code
Onboarding
Introduction
● Project assignment requirements
○ Attend at least one onboarding session
○ Engage, and ask questions at the end of presentation
○ Attend at least 2 daily webinars per week
Instructions:
https://docs.google.com/document/u/1/d/e/2PACX-1vRNK1x15w0ZcsqhLCbxjtqSkKYnvPVquGPrTrRgLKEuu18
MkQ_alVtC7q_hVNIDVUM3t9G6Djuljxnw/pub#h.65ikpbq61y4i
Resources
Access
Resources and
webinar link
here
Project Overview & Goal
Your work will help improve a cutting-edge language model to provide more helpful,
accurate, and concise coding responses. Specifically, we want the generated code to follow
the client instructions precisely.
2. Indicate which response is better with a side-by-side (SxS) rating and provide a written
justification
How Does Feedback Work?
Training Tasks
● These are tasks that test your understanding of the instructions and Edge Cases without
impacting production
● Failing Training tasks results in EQ, review each task thoroughly
● You cannot receive feedback on these tasks
● Look exactly like normal tasks
Normal Tasks
● You will be given a score from 1-5 and a message indicating areas of improvement
What do Tasks Look Like? (Prompt Evaluation)
● There will be 6 minute tasks used to evaluate
the safety/validity of a prompt
● You do not evaluate the prompt and model in
the same task
What Do The Tasks Look Like?
Prompt is the user’s request. The
prompt might be a conversation with Prompt
multiple turns, but the chatbot is
Response 1
responding to the final request.
Correctness/ Completeness
Coherence/Clarity
Save Changes
Response 2
Correctness/ Completeness
Coherence/Clarity
Save Changes
What Do The Tasks Look Like?
Which is the better response?
SxS Rating:
You will provide a side-by-side score
Rate your preference between the two responses on a scale from 1 to 6
to specify which model is preferred
based on the previous, generated
1 2 3 4 5 6
responses.
Response 1 is Response 2 is
much better than much better than
Response 2 Response 1
Explain how you chose this final comparison rating. Your justification should have a declaration in the beginning stating why one
response is better than the other (or if they are the same) and an evidence-based reason for your decision that cites information
directly from the prompts and responses.
SxS Justification:
You will justify your answer and
prove that the score you have Justification
selected carries weight.
● Understand what the user wants, putting yourself in the shoes of the user interacting with the
chatbot.
Note: There may be previous conversations in your task. If so, only focus on the last prompt and model response pair of the task. The
rest should be used as context
Workflow
a. Evaluate each of the responses on the 5 dimensions. Make sure to mark the appropriate
checkboxes when a response has issues in a particular dimension.
b. Follow the Dimension Rating Rubric for a breakdown on how to analyze each response.
Choose the Response that most correctly satisfies the requirements of the prompt.
a. Choose a score between 1-6 indicating which response is better, and by how much.
a. Consult the SxS Score Guide for a description of what each score means. Ensure that this score
coincides with the response you have selected in step 3.
1 2 3 4 5 6
Workflow
Write a justification.
b. Consult the Writing a Good Justification section for a guide on how to properly write your
justification for choosing a particular response. For this project refer to the model responses as:
@Response 1 and @Response 2.
c. Your SxS justification should typically be shorter than your helpfulness justification
Code Response Analysis Guide - Correctness
Criteria Description Rubric
Correctness/ The intent of Correctness/Completeness is to 5 - The response is completely correct and accurate to what is requested by the prompt with no
provide a response that is factual, accurate, and fully necessary details missing and without false, misleading, or hallucinated information. If the prompt
Completeness
addresses all requirements in the prompt. asks the assistant to do a task, the task is completely done and addressed in the response (within
In addition to the rating, if you provide a score of 1 to the limits of the assistant’s capabilities and intended usage).
4, an "areas for improvement" box will appear, you
MUST check all applicable options:
4 - The response is mostly accurate and correct with a small amount of missing information. It
● Contains incorrect information
contains no misleading information or hallucinations. If the prompt asks the assistant to perform a
● Key information is missing
task, the task is mostly successfully attempted.
● Misses one or more specific prompt
requirement(s)
● Contains unwarranted refusal
● Model Response is outdated as of July 2024[1] 3 - The response contains a mix of correct and incorrect information. The response may miss
some details, contain misleading information, or minor hallucinations, but is more or less aligned
with what the prompt asks for. If the prompt asks the assistant to perform a task, the task is
⚠ When assessing correctness, it’s important to attempted with moderate success but still has clear room for improvement.
populate a list of sources used to validate the
accuracy of the response in the provided open text
field. Sources must be URLs of publicly accessible, 2 - The response has some correct elements but is mostly wrong or incomplete. The
reliable, human-generated web pages or response may contain multiple instances of hallucinated, false and/or misleading information. If the
documents. For each source, you have to copy and prompt asks the assistant to do a task, the task was attempted with a small amount of success.
paste an excerpt (in a manner that is as concise as
possible) from the source regarding the specific
information that was used in the source to verify 1 - The response is completely incorrect. All information provided is wrong, false or hallucinated.
accuracy.[2] If the prompt asks the assistant to do a task, the task is not at all attempted for no good reason, or
the wrong task was attempted in the response. The response is completely irrelevant to the prompt.
Code Response Analysis Guide - Clarity
Criteria Description Rubric
Coherence/ With this attribute we measure how lucid, 5 (Perfectly Coherent and Clear) - The response is perfectly clear and self-consistent throughout.
cogent, and self-consistent the model’s There are no contradictory assertions or statements, the writing flows logically and following the
Clarity
response is. The Coherence/Clarity rating of train of thought/story is not challenging.
the response should account for previous user
and assistant turns in the conversation (so as
to spot potential contradictions, repetitions, 4 (Mostly Coherent and Clear) - The response is mostly clear and coherent, but there may be one
or two places where the wording is confusing, the flow of the response is a little hard to follow, or
unwarranted style changes, etc.).
with a small amount of repetitions / irrelevant content. Overall, the response can mostly be followed
In addition to the rating, if you provide a score with a little room for improvement.
of 1 to 4, an "areas for improvement" box will
appear, you MUST check all applicable
options: 3 (A Little Unclear and/or Incoherent) - The response is a little unclear. There are some
inconsistencies or contradictions, run-on sentences, confusing statements, blatant repetitions,
Contains irrelevant information significant amounts of irrelevant content, or hard to follow sections of the response.
Contains repetitions
Contains contradiction(s)
Contains awkward
2 (Mostly Incoherent and/or Unclear) - The response is mostly hard to follow, with inconsistencies,
[3]
phrasing/formatting issues contradictions, confusing logic flow, unclear language, constant repetitions or mostly irrelevant
Contains style changes content used throughout, but there are still some coherent/clear parts.
Should have addressed a false
premise, mistake, or ambiguity in
[4] 1 (Completely Incoherent and/or Unclear) - The response is completely incomprehensible or
the prompt
irrelevant and no clear meaning or sensible message can be discerned from it.
Code Response Analysis Guide - Language
Criteria Description Rubric
Simple vs. Complex Rating of the response along a simple → 5 (Expert) - Deep expertise in the field or area (typically associated with post-graduate education) is
required to understand the response. It uses specific and technically relevant vocabulary, or elevated
Language complex spectrum: the response uses
language that someone at the simple or basic level may not understand at all. The professional
simple, easy to understand vocabulary and language of a lawyer, scientist, engineer, or doctor falls into this category.
sentence structure that children can
understand, vs. the model uses
4 (Advanced) - The response uses a fairly sophisticated vocabulary and terminology. Someone
sophisticated language with elevated majoring in this subject at a university (post-18 education) would understand the response, while an
vocabulary that adults with advanced average adult who does not work or study in this area would not.
education or experts on the topic would use.
⚠ Make sure the rating aligns with the
rubric. A 5 is not necessarily “better” for 3 (Intermediate) - People who have completed up through a high school education (up to age 18)
this metric. ⚠ will probably be able to understand the vocabulary and sentence structure used, but those at the
basic level or children might struggle to understand the response.
2 (Simple) - The response uses relatively straightforward language and wording, but some schooling
through elementary (age 7 to 12) or a middle school (age 13 - 15) in the language might be required
to understand the response.
1 (Basic) - The response uses very easy to understand language that is clear and completely
interpretable by children under 6, adults, and anyone with a functional command of the language.
Code Response Analysis Guide - Language
Criteria Description Rubric
Succinct vs. The goal here is to place the response on a 5 (Verbose) - The response is particularly lengthy, wordy, and/or extensive with extra details
spectrum from the most short, crisp given what the prompt requested from the assistant model. The response can be verbose
Verbose Language
regardless of if the length is due to repetition and incoherency or if it is due to rich and
answers, to the most lengthy, detailed,
insightful detail.
and/or wordy answers, under the context of
the length expectations set by the prompt.
4 (Moderately Long) - The response is on the longer side but could still have more added to it
For example, if the prompt asks the model a before it is considered fully detailed or rambling.
yes or no question and the model simply
responds “yes” the answer is succinct. But if
the model responds “yes”, restates the
question worded as an answer, and explains 3 (Intermediate Length) - The response isn’t especially long or short given what the prompt is
why it gave that answer, the answer is asking of the model. The length is adequate for conveying a full response but isn’t particularly
verbose. wordy nor particularly concise.
Even if two responses have exactly the
same length, one can be rated as verbose
and the other as succinct depending on the 2 (Pretty Short) - The response is on the shorter side but could still have words, details,
prompting context. and/or text removed before it’s at a bare minimum of what the response is trying to convey.
This verbosity rating evaluates the response
as a whole.
⚠ Make sure the rating aligns with the
1 (Succinct) - The response is short, to the point, and the most concise it can be. No
rubric. A 5 is not “better” for this metric additional information is provided outside of what is requested by the prompt (regardless of if
the information or response itself is incorrect, hallucinated, or misleading: response that gives
an incorrect answer can still be succinct).
Code Response Analysis Guide - Helpfulness
Criteria Description Rubric
Helpfulness/ Overall quality rating summarizing how 5 - The response is perfectly helpful and completely aligned with the spirit of what the prompt was
asking for. It acts on the user’s request accurately, and to the point - without any unnecessary
Overall useful and helpful the response is. information. If a user request is not possible/inline with desired model behavior, a helpful response
⚠For the Helpfulness/Overall rating, provides useful context and rationale even if they do not act on user request directly.
you must provide an explanation
(50-250 words) of why you selected 4 - The response is mostly helpful and mainly aligned with what the user was looking for, but
this rating. Be as detailed as possible, there is still some room for improvement.
within the length bounds. Do not make
references to the other response in this
explanation. ⚠
3 - The response is partially helpful but misses the overall goal of the user's query/input in some
way. The response did not fully satisfy what the user was looking for.
2 - The response is slightly helpful and mostly does not capture what the user was looking for, but
it is still usable and helpful in a small way.
1 - The response is not helpful. The response completely missed the essence of what the user
wanted.
Fact Checking
● Do not use Forums or blogs as sources (Quora, Stack Overflow, Wikipedia)
● Make sure to fact check any major factual claim made by the model that will not be
verified testing the code
Good Example
Source: https://www.python.org/doc/sunset-python-2/
Excerpt: “Python 2 was sunset Jan 1, 2020”
(Why is this good? This has both the source link, and the excerpt)
Poor Example
Source: https://stackoverflow.com/questions/4836375/end-of-support-for-python-2-7
(Why is this Poor? This did not include the excerpt, and used a source crowd based versus official documentation)
Fact Checking Example
Helpfulness Justification
● Provide a detailed explanation of why you gave
the response that rating on helpfulness
● Give examples to justify your point if applicable
● Your helpfulness justification should not
reference the other response
● Helpfulness justifications must be 50-250
words.
● Helpfulness justification must begin with “The
response is
{not/slightly/partially/mostly/perfectly} helpful”
based on score from 1 to 5. Please don’t use
synonyms. This should be a sentence (end with
a full stop).
● Your justification should not be in first person or
mention the number score you gave
Preference Justification
● Preference justifications should ideally be no
more than 50 words
● The first sentence of preference reasoning
should start with “@Response {1/2} is
{slightly/<blank>/much} better than @Response
{1/2}” based on the scores.
● Response Nomenclature: Please include the
“@” symbol and 1 or 2 when referring to
responses. This will help identify and
distinguish which response you are referring to.
● Do not refer to responses as A/B
● If both responses have ratings of 1 or 2, select
neither response is valid and open your
justification with “@Response 1 is as unhelpful
as @Response 2”
Selecting The Better Response
Criteria Rubric
Slightly Better ● To be used when the responses are similarly appropriate and the difference is minor or a
Response 1 is slightly better matter of personal preference.
than Response 2 – Score: 3 ● The difference in Helpfulness/Overall between responses should be at most 1
● Minor differences in clarity and formatting warrant this response.
OR
● When you consider the responses to be tied, you should slightly prefer the shorter one (in
Response 2 is slightly better unlikely circumstances of same length - use your own judgment)
than Response 1 – Score: 4
Better ● To be used when one response is clearly better than the other but not by a very large
Response 1 is better than margin (a difference in Helpfulness/Overall of 1 or 2 points)
Response 2 – Score: 2 ● If the better response fails to follow some but not all instructions or is misleading but the
worse response does not follow instructions at all or is completely wrong, this should be
OR
selected.
Response 2 is better than ● If both answers follow instructions and are correct, but one is significantly clearer and/or
Response 1 – Score: 5 better formatted, this should be selected.
Much Better ● To be used when there is a significant difference between the two responses (a difference
Response 1 is much better in Helpfulness/Overall of at least 2 points)
than Response 2 – Score: 1 ● If one answer is entirely correct and the other contains a major mistake, this should be
selected.
OR
● If one answer follows all instructions and the other does not, this should be selected.
Response 2 is much better ● If there are major differences in readability and formatting, this should be selected.
than Response 1 – Score: 6
Common Errors
● Code does not compile