0% found this document useful (0 votes)
25 views40 pages

Mail Valley Reasoning

The Mail Valley Reasoning project aims to enhance AI's performance in complex reasoning tasks by creating prompts that lead to reasoning or calculation errors. Participants are instructed to write multiple-choice questions that challenge the model, annotate its responses, and rewrite any incorrect steps while justifying their changes. The document includes specifications for task execution, common errors to avoid, and helpful resources for participants.

Uploaded by

Joseph Kamwenya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views40 pages

Mail Valley Reasoning

The Mail Valley Reasoning project aims to enhance AI's performance in complex reasoning tasks by creating prompts that lead to reasoning or calculation errors. Participants are instructed to write multiple-choice questions that challenge the model, annotate its responses, and rewrite any incorrect steps while justifying their changes. The document includes specifications for task execution, common errors to avoid, and helpful resources for participants.

Uploaded by

Joseph Kamwenya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Mail Valley Reasoning

Attempter Specifications

Table of Contents
Project Introduction
Helpful Links
Task Steps Overview
Task Specifications
Step 1: Write a Prompt to Stump the Model
The Phrasing Of Your Prompt Should Not Reference The Fact There Are Answer
Options Present
Please Make Sure you Copy The Entire Question into The “Write the question from the
multiple-choice prompt” Field
Step 2: Annotate the Initial Response
When to Label a Step as LLM
When to Label a Step as Python
Some very brief examples
Step 3: Rewrite Incorrect Steps
Step 4: Assess Response Quality
Appendix
COMMON ERRORS
Prompt Examples
Grading Rubric
Project Introduction

Welcome to Mail Valley Reasoning! This prompt-focused project is designed to improve AI in


complex reasoning tasks.

Your job is to create a complex prompt(multiple choice question) that causes the model to make
a reasoning or calculation error, rate the model response, and rewrite the incorrect steps in
the response, justifying why the steps were wrong.

REMEMBER!
The goal of this project is to get the Model to fail (make a reasoning or
calculation error).
The more complex your prompts are, the better.

Tasks should be done in English.


If you are not comfortable tasking in English,
please skip the task and notify a QM in
Discourse.

Helpful Links
• Calculator.Net - Includes a number of useful calculators including quadratic formula,
LCM, GCF, prime factorization, permutations, combinations, triangles, volume, hex, and
much more.

• GeoGebra - Desmos but with more functionality for geometry. Would recommend testing
outside of tasking hours to get familiarity with the tool as there is a learning curve
• WolframAlpha - a computational engine that answers questions and solves problems
across various subjects, including math, science, and finance.

Reviewers, please navigate to the Reviewer Instructions.

Task Steps Overview

1. Write a Miltiple-Choice Prompt to Stump the Model (See Task Specifications)

• Write a complex prompt within the given skills


• The prompt should be a multiple-choice question with 4 options (A,B,C,D)
• The multiple choice question must have a single, objectively correct, answer
that can be reached without looking at the answer options.
• Your prompt should cause the model to fail in at least one of its reasoning, logic, or
calculation steps

• NOTE: If you do not feel comfortable with the skills that you’ve been assigned, please
skip the task.

2. Annotate the Initial Response (See Task Specifications)

• Rate each reasoning step within the model’s response according to various
dimensions
3. Rewrite the Incorrect Steps and Justify the Rewrites (See Task Specifications)

• Suggest a rewrite for any incorrect steps


• Write a justification for why your rewrite is necessary
• Provide the Final Answer

4. Assess Response Quality (See Task Specifications)

• Indicate whether the response was accurate and followed the instructions given in the
prompt

Please take a look at the COMMON ERRORS that we frequently see in this project.

Task Specifications

Step 1: Write a Prompt to Stump the Model

Step 1: Write a Prompt to Stump the Model

I. IDENTIFY THE SKILLS


You will be given a task with two assigned skills.

• Skill one: Biology, Chemistry, Physics


• Skill two examples: Biology(Genetics, Molecular Biology), Chemistry(Organic
Chemistry, Physical Chemistry), Physics( Biophysics, Astrophysics)

IMPORTANT:
If you are uncomfortable with the skills provided
PLEASE SKIP THE TASK.

II. WRITE A MULTIPLE CHOICE PROMPT

Based on the assigned skills

• STEP 1, write a prompt that has exactly one clear objective answer.
• STEP 2, provide 4 possible answers, in a multiple choice format, so that your answer
is EXACTLY ONE of the choices.
o Multiple choice answers be of the form “A. “ such as
A. Choice 1
B. Choice 2
C. Choice 3
D. Choice 4
• DO NOT use variations like A) or (A) or a:

Note: Doing STEP 1 and then STEP 2 you will help avoid the following Common
Errors
★ Common Errors ★

Error #1 – The prompt references the MC questions!


Avoidance Tip: Don’t use phrases like:

• “Which of the following is correct?”


• “Which choice below best answers the question?”
• “Which of the following explain this situation?”
• “What is the least correct explanation for this?”

Error #2 – The MC questions have multiple answers!


Avoidance Tip:

• Write prompts independent of the MC options.


• DO NOT include options like: D. All the Above
• DO NOT use phrases like “Choose all that apply”

★ Remember ★

• Prompts must have a single objective answer that can be reached without
the MC options!
• Prompts with multiple answers are NOT VALID!
• Prompts that request a numerical value usually avoid these issues!

Bad Prompt Example Why it’s Bad

Which value is closest to the


gravitational constant on Earth: • The prompt references the MC
A. 8 m/s^2 options.
B. 9 m/s^2 • The prompt does not have an
C. 10 m/s^2 objective answer.
D. None of the above
• Using “None of the above”
A 32-year-old pregnant woman presents
to her OBGYN for a routine ultrasound at • The prompt references the MC
22 weeks gestation. The US reveals that options.
the fetus has a ventricular septal defect • Does not have an objective reliable
(VSD), a single AV valve, and answer
misalignment of the outflow tract with a
o Without MC options the model could
mispositioned conotruncal region.
reach different answers.

Which developmental process is most


likely to not be disrupted here?
A. Formation of the primitive heart tube
B. Development of the atrial septum
C. Septation of the conotruncal region
D. Formation to the endocardial cushion

A population of deer on a remote island


have recently underwent a shift in their • The prompt references the MC
dominant coat color. Coat color is options.
determined by a single autosomal gene • Does not have an objective answer
with two alleles. Allele C corresponds to
• It includes more than 4 options
tawny coloration and is dominant to
allele c which corresponds to melanistic • Using “all of the above is not
coloration. Historically, tawny coats allowed”
have been more prevalent, but within the
last several years the number of
individuals with melanistic coats have
increased. A biologist observing the
deer population theorizes that this is
because the jungle has become
overgrown due to recent logging
restrictions, so the deer now reside
under dense canopy. The deer's only
predator is a species of big cat.
Observation of historical data also shows
that a large fire on the island 50 years
ago greatly reduced the population of
deer which has since approached
recovery. Current allele frequency
estimates are C= 0.45 and c= 0.55.
Which of the following is likely the
predominant reason why the c allele is
increasing in frequency?
A. Balanced polymorphism maintaining
genetic diversity
B. A genetic bottleneck occurred due to
the fire, increasing the frequency of allele
c via genetic drift
C. Gene flow from other deer
populations on the mainland
D. Directional selection favoring the
survival of melanistic individuals
E. All of the above

Skills Good Prompt Examples

You have a DNA sequence:


ATGAGATCCGCGAGCGAGCGATGGCGACGA.
• Biology You have mutated the 8th base counting from
the right to A, and the 7th base counting from the
left to C. What will the sequence translate into?
Note: Stop when encounter stop codon.
A: MRSASERWRR
• Genetics
B: MRASWR
C: MRSAESAWAA
D: MRPASER

A 25.00 mL sample of 0.070 M acetic acid is


• Chemistry titrated with 0.090 M NaOH. After adding 18.30
mL of NaOH, what is the pH of the solution?
Assume the dissociation constant $K_a$ for
acetic acid is $1.8 \times 10^{-5}$.
• Analytical Chemistry A. 4.74
B. 4.95
C. 5.74
D. 5.95

Consider a box containing 1 mole of hydrogen


• Physics atoms (which can be assumed to act as an ideal
gas) at a pressure of 2 atmospheres and a
temperature of $20^\circ \text{ C}$. What is the
ratio of the number of atoms in the first excited
• Statistical Mechanics state to the number of atoms in the ground
state?

A: $e^{404}$
B: $4e^{404}$
C: $e^{-404}$
D: $4e^{-404}$

III. Copy the question without the multiple-choice options

Please Make Sure you Copy The Entire Question(Including the context)
into The “Write the question from the multiple-choice prompt” Field. The
only thing you will not copy are the multiple choice options

EXAMPLE:
Prompt:
Consider a box containing 1 mole of hydrogen atoms (which can be assumed to act as an ideal
gas) at a pressure of 2 atmospheres and a temperature of $20^\circ \text{ C}$. What is the ratio of
the number of atoms in the first excited state to the number of atoms in the ground state?
A. $e^{404}$
B. $4e^{404}$
C. $e^{-404}$
D. $4e^{-404}$
This is what you would write in the text box:
Consider a box containing 1 mole of hydrogen atoms (which can be assumed to act as an ideal
gas) at a pressure of 2 atmospheres and a temperature of $20^\circ \text{ C}$. What is the ratio of
the number of atoms in the first excited state to the number of atoms in the ground state?

IV. ASSIGN DIFFICULTY LEVEL

To assign difficulty level to your prompt to your best estimation, answer the following
questions:

Estimate (in minutes) the time it takes a human to solve the problem once the correct
approach is identified. (Round up to the nearest multiple of 5): ______________________

Tag a difficulty level based on complexity and the likelihood of model success:

• Easy
o A high school student can solve this without difficulty.
• Medium
o An undergraduate student can solve this without help.
• Hard
o An undergraduate student can solve this but might need to check their notes
or look up a hint.
Step 2: Annotate the Initial Response

Step 2: Annotate the Initial Response

I. IDENTIFY INCORRECT REASONING or CALCULATION STEPS

The AI Model’s response will be separated into steps for easy analysis.

Evaluate each step as either:

• Correct - when the step has no math or logic issues


• Incorrect Calculation - when the step includes a computational mistake, including
rounding errors. For example:
We next simplify to obtain 2-(3+4) = 2-3+4 = 3

If the monthly interest rate is r = 3% then after 5 months your initial


balance grows by a factor of (1+0.3)^5 = 3.71293

If Mary eats half the pizza and Jose eats one third of the pizza then
together they ate ½+⅓ = ⅖ of the pizza.

• Incorrect Reasoning - when the step includes a logical reasoning mistake. For
example:

In the race we know Bob comes in either 1st or 2nd and John comes
in 2nd or 3rd and Abby comes in 3rd then we conclude that Bob
comes in 2nd.

As ice cream sales and drowning rates both increase in the summer,
we conclude that ice cream causes drowning.

We know that x< y and y>z so it follows that x<z.

There may be times when a step contains both types of errors: reasoning and
calculation.

• In this case, select the most appropriate error.

In many cases, a reasoning error is more significant, as it takes the model down
the wrong line of reasoning:

We know 12 students are in Science and 8 students are in Art then


there are 12 - 8 = 3 students taking only Science.
The model may add preamble or summary steps which seem superfluous, but do
Remember, the
not mark them incorrect unless they contain an error.
step needs to be factually, rationally, or computationally
wrong (not just lacking detail) to be marked incorrect.

II. RETURN TO PROMPT (if applicable)

REMEMBER:
Your prompt should be complex enough
to cause the model to fail (make a reasoning or calculation error)
in at least one (1) reasoning or calculation step.

If this did not happen, you must return to your prompt and make it more complex.

Go back to the prompt writing box and click on the “retry chat from here” button.

Refer to our Prompt Complexity Guide.


III. Python vs LLM

When to Label a Step as LLM


The model’s LLM capabilities are used for language-based reasoning and some symbolic
manipulation (rearrangement of equations, setting up conversion factors, etc). If a step is primarily
language-based reasoning, or language based reasoning with the statement of an equation and
what it means or how it is used, this is an LLM step.
When to Label a Step as Python
The model’s python capabilities are to help with math and some symbolic manipulations. If you
see a step that is sequential calculations, primarily number crunching (even with a brief preamble)
or implies a large calculation going on in the background (“Performing a linear regression, we see
the slope is…”), these would all be python steps.
Some very brief examples
• A paragraph introducing or reiterating the problem: This is an LLM step.
• A paragraph describing the primary equations and variables needed to solve the problem:
This is an LLM step.
• A paragraph that involves describing and rearranging these equations into a useful form:
Could be LLM or Python, use your best judgement on how much math vs. text is involved.
• A paragraph setting up conversions of given values so they can plug into an equation:
Could be LLM or Python, use your best judgement on how much math vs. text is involved.
• A step converting the previously mentioned units, maybe with a preamble that says “We
convert the units…”: This is a python step.
• A block of sequential calculations to plug in values and solve the equations: This is a
python step.
• Arithmetic to subtract two numbers in the equation: This is a python step.

Step 3: Rewrite Incorrect Steps As They Appear

Step 3: Rewrite Incorrect Steps

I. REWRITING INCORRECT STEPS

In this part of the task, rewrite any steps that were rated as Incorrect

Your Rewrite Should

• Either modify the the original step or be a completely new step depending on the
degree of correctness
• Address any logical errors in the overall problem approach
• Address any errors in execution
• Maintain the same level of detail as other steps in the response
• Be written clearly, using simple language that can be understood by a high school
student
• Include only the information required to solve the problem
• Ensure that the step is self-contained and understandable on its own
• DO NOT introduce any new errors (this is a common mistake)

Next, write a brief but detailed justification for why your rewrite was necessary.

II. PROVIDE RATIONALE

In this part of the task, provide a rationale for any steps that were rated as Incorrect

Your Rationale Should


• Clearly explain why the step is incorrect
• Use proper spelling and grammar
• Not contain the words “model”, “AI”, “LLM”, etc.
• Not be written in LaTex
• Must include at least 10 words

Answer the following question:

• Is this step the final step, and does it contain the solution to the prompt?
o Yes
o No
▪ If yes, the model will stop generating subsequent steps
▪ If no, the model will generate subsequent steps to solve it once you
save the response.
III. PROVIDE FINAL ANSWER IN THE LAST STEP

In this part you should supply the correct answer to your prompt.

• This has a very specific formatting. If


the model does not follow this
formatting, mark the step incorrect for reasoning and
correct the formatting.

Final Step Formatting Requirements

• The final step should state: “The best answer is” followed by
the letter of the correct option.
• Do not box the final answer or alter this formatting at all.

Example if the answer to the prompt was “A. 55 Joules”.

Text of Final Step:


The best answer is A.

Note: There can be statements before this, but the end of the final step should state,
“The best answer is” followed by the letter answer, with no additional formatting.

• No additional formatting, such as boxing the answer, adding a colon, or


calling the answer “final” instead of “best” should be added to the statement
“The best answer is” followed by the letter.
• Your task will be rejected if you alter this formatting.
Step 4: Assess Response Quality

Step 4: Assess Response Quality

I. LABEL THE RESPONSE

In this final step of the task, you will encounter the following:

Label your response according to the questions below.


Accuracy

• Is the first model response entirely correct?


o Yes
o No

Instruction Following

• Does the model follow instructions and understand the specifications of the prompt?
(Note: the model can be inaccurate and still follow instructions.)
o Yes
o No

NOTE: If you corrected the final step for not following the “The best answer
is” format, you must mark the Instruction Following box as “No”.

DEFINITIONS

• Accuracy
o Did the model make mistakes either in its calculation or reasoning in the first
Turn?

• Instruction Following
o Thinking of the model as a child, did the model attempt to solve your prompt
even if it made mistakes along the way?
o NOTE: If you corrected the final step for not
following the “The best answer is” format, you
must mark the Instruction Following box as “No”.
Appendix

COMMON ERRORS
COMMON ERRORS in Mail Valley and how to avoid them!

Error Explanation & How to avoid


Examples

Ambiguity Many prompts lack


and Open- specificity or clarity, • Use specific, unambiguous language; avoid
Ended allowing for multiple words or phrases with multiple interpretations to
Prompts interpretations, prevent subjective answers.
leading to • Ensure prompts lead to a single, correct
confusion. solution; avoid open-ended or ambiguous
prompts.
Some prompts have • Make prompts challenging yet realistic and
erroneous steps that feasible, avoiding unnecessary complexity or
impact overall task convoluted scenarios.
validity.

Others are too basic


• Examples of Clear vs. Vague Prompts:
or common,
o Clear: “Calculate the equilibrium income
reducing their
when investment rises by $53 million,
effectiveness in
given the initial equation C = 321 +
challenging the
0.40Y.”
model.
o Vague: “Explain how income changes
when investment goes up.”
Includes unclear
phrasing and
prompts with
multiple possible
answers, which
confuse the model
and reviewers.

Incorrect Several prompts


Assumption misrepresent the • Confirm that the prompt requires domain-
s and skills specific knowledge and adheres to the skills
Inaccurate provided.
Domain • Ask yourself the following:
Matching o "Does it match the skills assigned?"
o If not, rework it or skip.

Poor We often see


Grammar, grammatical errors, • Review each prompt carefully before
Syntax, and poor phrasing, and submission, ensuring it’s free of grammar,
Typos even typos in both spelling, and syntax errors.
prompts and • Use grammar checks to catch simple mistakes.
rewrites, which
sometimes result in
inaccurate
information or
misinterpretations.
Incorrect Steps marked as
Step incorrect are • Verify that all "incorrect" labels are applied only
Labeling sometimes correct, to genuinely incorrect steps.
and Poor while others marked • Rewrites should always go in the designated
Rewrite correct actually section, not in the justification box, to maintain
Placement contain errors. clarity.

Additionally, rewrites
are occasionally
placed in the
justification box
instead of in the
designated rewrite
section, causing
confusion.

Not Attempters often


understandi conflate "accuracy" • Evaluate all aspects of the model’s response,
ng with "instruction including calculations, logic, and relevance, to
Accuracy following," leading to confirm accuracy.
vs. partial responses • Check if the model reasonably followed prompt
Instruction marked as correct. instructions even if it made mistakes,
Following distinguishing between attempts and outright
failures.

Vague Justifications often


Justification lack sufficient detail, • Provide concise but comprehensive
s sometimes failing to justifications, focusing specifically on the
address the errors nature of the error (e.g., logical or
comprehensively or computational) and how the rewrite corrects it.
provide a clear Brief examples could clarify this.
rationale for o Example Justification: “This step
corrections. contains a logical error because it
assumes that a base change occurs only
when the result is prime, which was not
stated in the prompt. The rewrite
corrects this by clarifying the conditions.”

Unclear Unclear rewrites can


Rewrites lead to • Ensure rewrites are clear, self-contained, and
misunderstandings stand alone as correct without needing
or new errors, additional context.
affecting task
accuracy.

Missed Missing calculation


Calculation errors can result in • Prioritize errors in steps where multiple issues
Errors false conclusions or may occur, giving precedence to logical errors
a flawed final over minor computational mistakes.
answer. • Don’t mark redundant steps as incorrect if
they’re accurate. Only mark as an error if the
step misleads or contradicts the solution.

Misundersta Some Attempters


nding don’t correctly • Make sure to differentiate Error Types:
Calculation distinguish between o Calculation Error: Mistakes in
vs. calculation and mathematical computation.
Reasoning reasoning errors. ▪ Example: Incorrect simplification,
Errors like “2 - (3 + 4) = 2 - 3 + 4 = 3”
(should be -5).
o Reasoning Error: Errors in logic that
lead to inaccurate conclusions.
▪ Example: Misinterpreting
problem constraints, e.g.,
“Assuming Bob could only be 1st
or 2nd place, but concluding he
came in 3rd.”
o If a step contains both errors, prioritize
reasoning errors for correction, as
these impact the model’s problem-
solving approach more broadly.

Prompt Examples
(Do not copy. These are just to give you some inspiration.)

Skills Prompt

Four scientists are working to create a recombinant DNA plasmid


that expresses a specific gene. To achieve this, they want to insert
the "Coding sequence and Terminator" of a gene into an
Biology expression plasmid, using restriction enzymes to cut the plasmid
and the gene, followed by ligation to join the fragments. The
researchers have access to four restriction enzymes (A, B, C, and
D) with different recognition sequences. They each proposed a
strategy for digestion and ligation to achieve this goal.

Plasmid: ->(Ori) -> (Selection Marker) -> (Promoter) -> (C site) ->
(B site) -> (A site) ->

Gene of Interest: (A site) -> (B site) -> (Coding Sequence) -> (A


site) -> (Terminator) -> (D site) -> (C site)

The recognition sequences of restriction enzymes are as follows:


Enzyme A:
CCC|GGG,
GGG|CCC.

Enzyme B:
G|GATC C,
C CTAG|G.

Enzyme C:
G ACGT|C,
C|TGCA G.

Enzyme D:
A|GATC T,
T CTAG|A.

Strategies proposed by the scientists:


Scientist 1: Digestion with Enzyme A followed by ligation.

Scientist 2: Digestion with Enzyme C and B followed by ligation.

Scientist 3: Digestion with Enzyme B and D followed by ligation.

Scientist 4: Digestion with Enzyme C and A followed by ligation.

Which researcher would successfully produce the desired


recombinant plasmid?
A: Scientist 1
B: Scientist 2
C: Scientist 3
D: Scientist 4
What is the tunneling probability of an electron with an energy of
$4.5 \text{ eV}$ through a potential barrier with a width of $1.5
\text{ nm}$ and a height of $5 \text{ eV}$?

A: $1.087 \times 10^{-1}$


B: $1.31 \times 10^{-5}$
C: $1.9 \times 10^{-5}$
Physics D: $9.209 \times 10^{-6}$
Using the following concentration and absorbance data make a
calibration plot. Concentration (ppm): 1.2, 2.3, 5.0, 8.4, 11.8.
Absorbance: 0.15, 0.26, 0.65, 0.99, 1.31. Which sample
transmitted % of light to have concentration of 3.66 ppm?
A. 18%
B. 63%
C. 37%
Chemistry D. 88%
An electron wavefunction is given by $\psi = A \sin(2\pi x/L)$,
where $L=3\times 10^{-8} \, \text{m}$. What are the possible
outcomes of a measurement of the electron's momentum along the
$x$ direction?

A. $\pm56.3 \, \text{eV}/c$

B. $\pm 41.3 \, \text{eV}/c$

C. $\pm24.5 \, \text{eV}/c$

Physics D. $\pm35.6 \, \text{eV}$


A female mouse inherited the alleles H, B, t from his mother and h,
b, t from his father.
A male mouse inherited the alleles h, b, T from his mother and h,
b, t from his father.
H and h code for short and long hair respectively. B and b code for
black and grey shed. T and t code for long and short tail.
Genes H and B are linked and have a recombination frequency of
0.25.
If these two mice have progeny together, what are the odds of
them having a short-haired, grey-shed, short-tailed mouse?
A - 12.5 %
B - 18.75 %
C - 6.25 %
Biology D - 25 %
You are processing a 6k bp plasmid for cloning. You first digested
it with XbaI, which cleaves it at positions 995, 3000 and 4995. You
then digest it with NspI, which cleaves the plasmid at positions
Biology 2005, 3998 and 5995. You analysed the mixture in an agarose gel,
which cannot resolve fragments that are within 30 bp in size. How
many bands will you observe in the gel?
A: 1
B: 2
C: 3
D: 4
You have a DNA sequence:
ATGAGATCCGCGAGCGAGCGATGGCGACGA. You have
mutated the 8th base counting from the right to A, and the 7th base
counting from the left to C. What will the sequence translate into?
Note: Stop when encounter stop codon.
A: MRSASERWRR
B: MRASWR
C: MRSAESAWAA
Biology D: MRPASER
The standard reduction potential for the reduction half-reaction
$Cu^{2+} + 2e^- \rightarrow Cu$ is 0.34 V. If the concentration of
$Cu^{2+}$ is 0.025 M and the concentration of copper is 0.85 M,
what is the cell potential at 298 K using the Nernst equation?
A. 0.26 V
B. 0.29 V
C. 0.31 V
Chemistry D. 0.34 V
A $Fe^{3+}$ ion is placed in an octahedral crystal field with a
splitting energy $\Delta _0=1050 cm^{-1}$. Calculate the crystal
field stabilization energy for the ion in this environment. Assume
$P = 1600 cm^{-1}$.
A. $-1800 cm^{-1}$
B. $1350 cm^{-1}$
C. $2500 cm^{-1}$
Chemistry D. $0 cm^{-1}$
What is the ratio of energy of third and fifth excited energy levels in
one dimensional half-harmonic oscillator?
Physics A)$0.652$ B)$0.636$ C)$0.689$ D)$0.640$
Find the streamline equation for the velocity field
\(\mathbf{V}=u\mathbf{i}+v\mathbf{j}+w\mathbf{k}\), where \(u=6x\),
\(v=-2y\), and \(w=2z\), passing through the point \((2,2,0)\).
\[
\begin{aligned}&\text{A}.\,\,\,yx^{1/3}=2\sqrt[3]{2},\,z=0\\&\text{B}.\,\,
\,xy^{1/3}=3\sqrt[3]{2},\,z=0\\&\text{C}.\,\,\,yx^{-
1/3}=3\sqrt[3]{2},\,z=1\\&\text{D}.\,\,\,xy^{-
1/3}=2\sqrt[3]{2},\,z=1\end{aligned}
Physics \]
You're working with a patient who likely has a genetic disease.
This particular disease is known to be autosomal recessive. She is
Biology adopted so she doesn't know much about her biological family.
What is the risk that the patient's biological grandmother is the
carrier of an abnormal allele that causes the disease?

A. 0%

B. 50%

C. 100%

D. 25%
A molecule absorbs light at 220 nm. What is the corresponding
energy of this transition?
a. 1.92 eV
b. 4.13 eV
c. 5.64 eV
Chemistry d. 5.71 eV
What is the energy required to promote a helium atom confined in
a box of length 1 metre into the first excited state?
A. 2.577x10$^{-41}$ J
B. 2.177x10$^{-41}$ J
C. 2.477x10${-41}$ J
Chemistry D. 2.677x10${-41}$ J
A 25.00 mL sample of 0.070 M acetic acid is titrated with 0.090 M
NaOH. After adding 18.30 mL of NaOH, what is the pH of the
solution? Assume the dissociation constant $K_a$ for acetic acid
is $1.8 \times 10^{-5}$.
A. 4.74
B. 4.95
C. 5.74
Chemistry D. 5.95

Grading Rubric
Dimension 1-2 (Fail) 3 (Okay) 4-5 (Good/Perfect)

Prompt Rubric
How your prompt will be rated

Prompt Requirements Major requirements Meets requirements but All requirements met.
missing. The prompt has minor issues. The The prompt matches
does not lead to prompt leads to one the skills and leads to a
reasoning or calculation
errors or doesn't match reasoning or calculation reasoning or calculation
the skills error. error.

Prompt Clarity & Major clarity issues; Mostly clear, but could Clear and specific; all
Specificity prompt is vague or hard be interpreted multiple necessary information
to follow; key details ways or lacks a minor is included.
missing. detail.

Rating Rubric
How your ratings of the Initial Response will be evaluated

Correctness of Attempter incorrectly Some steps are All steps are correctly
Individual Steps identifies at least one correctly identified, but identified.
step, and justification is there is minor
inadequate. misjudgment.

Initial Response Major issue with N/A All labels are correctly
Labels labeling the response selected; the response
(e.g., incorrect labels is accurately labeled.
for steps).

Justification Rubric
How reviewers will grade your written ratings justifications

Analysis Justification is generic N/A Justification is specific


or verdict is skewed. and balanced; verdict is
reasonable.

Support Claims Claims contradict the 1 claim lacks evidence, All claims logically
verdict or are but the rest are defend the verdict, are
inaccurate; 2+ claims accurate and well- accurate, and
lack evidence. supported. supported by evidence.

Accuracy 1 or more pieces of 1 piece of evidence is All evidence is accurate


evidence are inaccurate misconstrued. and not misconstrued.
or misconstrued.

Rewrite Rubric
How reviewers will grade your rewritten response steps

Accuracy 1 or more major factual 2 or more minor factual No major factual errors
errors or misleading errors or misleading or misleading points.
points. points.

Rewritten Steps Clearly worse than the About the same quality Step clearly performs
Quality model response. as the model response. better than the model
response; rewritten to
Steps should be the correct degree.
rewritten but aren't.
Overall Task Quality
How reviewers will grade the Overall Quality of your task

Original Work Content contains N/A No chatbot usage or


evidence of chatbot plagiarism; content is
usage or plagiarism. original.

Harmful Content Contains harmful N/A No harmful content is


content. present.

Spelling/Grammar/For 4+ minor 1-3 minor errors or 1 No discernible errors;


matting spelling/grammar/punct egregious spelling clean formatting.
uation errors, or major error.
formatting issues.

Clarity & Structure Content is hard to Minor clarity issues; Clear and easy to
follow or unclear. generally makes sense. follow.

Repetitiveness/Releva Contains 3+ repetitive 1-2 repetitive or No unnecessary


nce or irrelevant sentences. irrelevant sentences. repetition or irrelevant
content.

Reviewers, please navigate to the Reviewer Instructions.

Mail Valley Reasoning


Reviewer Checklist
For this project, the attempter’s task As a reviewer, your task will be to:
is to: Rate the attempter’s prompt
Write a prompt to stump the model Determine if their time estimates are reasonable
Provide a time estimate for devising a Determine if the attempter annotated the initial
solution to and completing the prompt response and correctness ratings accurately
Annotate the initial response Review and potentially fix their rewritten steps
Rewrite the incorrect steps Review and potentially revise the final solution
Provide the correct final solution to the provided by the attempter
prompt Rate the overall quality of the attempter’s task

You will be asked to:


🔴 Reject the task.
🟡 Fix and submit the task.
🟢 Approve the task as is with no changes.
✍️ Provide feedback to the attempter on their work quality.

Whether you reject, fix, or accept the task will be determined by the guidelines in the Reviewer
Checklist below.

Checklist Criteria for Reviewers


Quality Action Checklist Criteria (for Reviewers)
of the
Task

1-2 Poor Select The prompt follows any 1 of these criteria:


[Reject] The prompt does not trick the model into making at least 1
as the reasoning/calculation error
Reviewer The prompt is missing necessary details to solve the problem, is unclear and
Verdict difficult to follow (beyond expected knowledge in the field)
Or The prompt does not match the skill/subskill type assigned
Fix + The prompt is impractical or contains an impossible request
Select The prompt gives conflicting instructions that cannot be completed at the same
[Approve] time
as the The time estimate for a human to complete the prompt is unreasonable
Reviewer The prompt, as copied into the “write the question from…” field does not
Verdict contain the full prompt with the answer options removed.
The prompt doesn’t have a single, objectively correct, answer, with or without
the multiple choice options present (question is opinion based, open ended,
has multiple correct answers if you weren’t limited to the answer options, etc).
The prompt’s question does not make sense without the answer options
present (references the answer options, is fill-in-the-blank style, ends with
phrasing like “the best explanation is:”). This also precludes answers like “none
of the above” or “a and b”.
The prompt asks more than one question.
The prompt is not in the 4-answer MCQ format.
The step correctness ratings and first response labels follow these
criteria:
At least 1 response step has been labeled incorrect that should have been
labeled correct or vice versa
The rewritten steps follow these criteria:
The rewrites do not address errors in execution and/or combine multiple steps
into one
The rewritten step is unclear and cannot be understood own its own and it is
not properly formatted
The rewrites contain 1 or more major and/or 2 or more minor factual errors
The rewrites are worse than the original response and/or only modified when
they needed to be completely rewritten
Note: Do not penalize an attempter for making minimal changes to the
original response.
The reason given for incorrect steps follows these criteria:
The reason given does not explain why the step is incorrect
The reason given does not specify if the step is the final solution to the prompt
or not
The reason given is not specific to the given step
The reason given overstates the impact of minor issues or understates the
significance of major errors
The reason given lacks sufficient or accurate evidence to support the reason
for 2 or more steps
The final answer to the prompt follows these criteria:
The final answer provided by the contributor is incorrect
The final answer step is not formatted as “The best answer is” followed by the
letter of the answer.
The final answer step needed to be corrected to “the best answer is” but the
“Instruction Following” field wasn’t marked as “no”
Overall sentiment:
The task will need significant revisions that will take you more than 35 minutes
to correct. Provide detailed feedback to the contributor and reject the task.
3 - Fix + The prompt follows these criteria:
Adequat Select The prompt tricks the model into making at least 1 reasoning/calculation error
e [Approve] The prompt includes all necessary details to solve the problem and matches the
as the topic/reasoning type assigned
Reviewer The prompt is mostly clear, but the request could be interpreted in multiple ways
Verdict The prompt’s request is close to being impractical and the model will not be able
to fulfill everything asked, but the prompt is still answerable
Format may need some reworking
The prompt difficulty level is labeled correctly
The time estimate for a human to complete the prompt is reasonable
The step correctness ratings and first response labels follow these
criteria:
The contributor has correctly labeled most of the steps as “correct” or “incorrect”
The contributor may have labeled the initial response incorrectly (i.e. selected
“yes” or “no” appropriately for instruction following), but this can easily be
adjusted
The rewritten steps follow these criteria:
The rewritten response contains 0 major factual errors and only 1 minor
factual error or misleading statement
The rewritten step would likely perform about the same as the original step
The reason given for incorrect steps follows these criteria:
All claims logically defend the verdict and are accurate
1 claim does not have sufficient evidence to support the reason, but can be
easily fixed
The final answer to the prompt follows these criteria:
The final answer provided by the contributor is correct, but could be clearer
Overall sentiment:
The task is mostly correct but requires multiple fixes (which take < 25 minutes)
before it can be sent to the customer.

4 Fix + The prompt follows these criteria:


Excelle Select All necessary criteria is met
nt [Approve] The step correctness ratings and first response labels follow these
as the criteria:
Reviewer The contributor has correctly labeled all steps and response labels
Verdict The rewritten steps follow these criteria:
The rewritten response contains 0 factual errors and 0 misleading statements
The rewritten step would clearly perform better than the original and correct
The reason given for incorrect steps follows these criteria:
The reason given is specific to the steps
All supporting claims are accurate, contain sufficient evidence, and logically
defend the reason given
The final answer to the prompt follows these criteria:
The final answer provided by the contributor is correct
Overall sentiment:
The task is solid but requires small refinements before it is sent to the customer.
5 Select The overall task meets the following criteria:
Excelle [Approve] All necessary criteria met with 0 changes necessary
nt as the Overall sentiment:
Reviewer The task is perfect and ready to be sent to the customer.
Verdict.
Rubrics to Reference
The below are the rubrics that our internal quality team will use to measure the quality of tasks that have
been reviewed

Grading Rubric
All tasks on this project will be reviewed according to the following rubrics:

SFT Prompt

4-5 (Good/
Field 1-2 (Fail) 3 (Okay) Perfect) Additional Notes
[Major
Requirements
Missing] - Prompt leads to an
- Prompt does not error in reasoning in
Prompt lead to an error in at least one step
Requirement reasoning in at least - The prompt
s one step matches the topic
- The prompt does not and reasoning type
match the topic and/or assigned
reasoning type
assigned
[Major Clarity /
- There is little to no
Specificity Issues]
[Minor Clarity / room for
-It's not clear what is
Specificity Issues] misinterpretation of
being asked, the
- It's mostly clear the specific request
Prompt prompt is extremely
what is being asked - Prompt has a
Clarity and difficult to follow, or is
but the request specific request that
Specificity overly vague
could reasonably doesn't require more
-Major details are
be interpreted than one minor
missing that are
multiple ways assumption to
needed to answer the
answer it
prompt
[Major Feasibility [Minor Feasibility - The prompt is
Issues] Issues] completely actionable
-Prompt contains an - Prompt's request by an LLM or chatbot
Feasibility
impractical or is verging on being - The prompt
impossible request impractical and the contains no
that can't be LLM won't be able conflicting
answered by an LLM to completely fulfill instructions/statemen
in a single response everything asked in ts
- Prompt gives the prompt, but the
conflicting/contradictin prompt is still
g instructions that answerable with
can't be fulfilled concessions
simultaneously
(unless specifically
instructed to do so)

SFT Response Rubric - Incorrect Step Rewrite

4-5 (Good/
Field 1-2 (Fail) 3 (Okay) Perfect) Additional Notes
A Major error (i.e.
factuality,
calculations, figures,
or references) is an
incorrect or
- [Major Factual - The response
misleading statement
Errors] The response contains 0 major
which is central to
contains 1 or more factual errors.
- The response has the actual subject
major factual errors or - [Minor Factual
no factual errors matter or response
misleading points Error] The
Accuracy - The response has fulfillment of the
- [Minor Factual response contains
no misleading request
Errors] The response only one minor
statements A Minor error is an
contains 2 or more factual error or
incorrect or
minor factual errors or misleading
misleading statement
misleading points statement.
which is adjacent to
the actual subject
matter or response
fulfillment of the
request
- [Worse than
- The updated step
Original Model
would clearly perform *Important Note* -
Response] The
better overall across Do not penalize an
rewritten step would
- [Same Quality as the rubric dimensions attempter/tasker for
clearly perform worse
Rewritten Model Response] - The rewritten step making minimal
overall across the
Steps Clearly The updated step is rewritten to be changes to the
rubric dimensions
Worse Than would likely perform correct either as original response.
- The rewritten step is
Model about the same modification to the Note to operators:
not rewritten to the
Response overall across the original step or a the model we are
extent that would be
rubric dimensions completely new step comparing against
expected depending
depending on the should always be
on the degree of
degree of state of the art.
correctness (e.g., the
correctness
step is modified when
it should be entirely
rewritten)

RLHF Ratings (Step Correctness & Initial Response Annotations)

4-5 (Good/
Field 1-2 (Fail) 3 (Okay) Perfect) Additional Notes
Note: Do NOT
penalize the model
for:
- Correct but badly
formatted solutions
- [Major Issues] The (e.g., does not use
contributor has LaTeX; uses a
- The contributor has
Correctnes incorrectly identified at decimal when a
correctly identified all
s of least one correct step fraction would be
steps that are correct,
Individual or incorrect step, and better)
and all steps that are
Steps the justification does - Correct by
incorrect
not adequately support inefficient solutions
this variance - Correct solutions
that contain written
preambles or
summaries
- Correct but
irrelevant steps
- [Major Issue] The
contributor has
incorrectly labeled the
initial response (e.g.,
Initial marked the response - The contributor has
Response as correct when it was N/A correctly labeled the
Labels incorrect; marked as initial response
yes for instruction
following when an
explicit instruction was
missed)

RLHF Justification - Incorrect Step Rationale

4-5 (Good/
Criteria 1-2 (Fail) 3 (Okay) Perfect) Additional Notes
- [Generic
Justification] The
rationale isn't specific to
the given step and the
responses have
material differences
- The rationale is not
- [Skewed
Analysis N/A generic and is not
Assessment]
skewed
Rationale overstates
the impact of minor
issues or understates
the significance of
major errors leading to
the wrong verdict
- [Claims Contradict
Verdict] 1 or more
supporting claims does
- All claims logically
not logically defend the
defend the verdict - All supporting
verdict
- All claims being claims logically
- [Claims Inaccurate] *NOTE* Some claims
made are accurate defend the verdict
1 or more of the claims are self-evident and
Support - [Claim Lacks - All claims being
being made is require little in way of
Claims Evidence (1)] 1 made are accurate
inaccurate stated evidence to
claim does not - All claims have
- [Claims Lack support
have sufficient sufficient evidence
Evidence (2+)] 2 or
evidence within the within the rationale
more claims do not
rationale
have sufficient
evidence within the
rationale
- [Evidence
Inaccurate] 1 piece of - All evidence is
evidence used in the accurate
rationale is inaccurate - [Evidence - All evidence is
Accuracy - [Evidence Misconstrued (1)] accurate and not
Misconstrued (2)] 2 1 piece of evidence misconstrued
pieces of evidence in the rationale is
used in the rationale being misconstrued
are being misconstrued

Core Dimensions (Applicable across all CB generated/ edited


content)

Criteria 1-2 (Fail) 3 (Okay) 4-5 (Good/ Perfect) Additional Notes


- [Chatbot Usage]
The content contains
significant evidence of
chatbot usage
- [Plagiarism] The - The content does not
content contains direct contain significant
plagiarism without evidence of chatbot
Original Work citation usage
N/A N/A
(QC Only) Note: Evidence of - The content does not
chatbot usage could contain direct
be excessive usage of plagiarism without
fluff / pleasantries, citation
generic responses,
lack of human
understanding and
nuance, etc.
See project specific
or general
guidelines on what
constitutes "harmful
content"
Beyond project
specific, Scale's
safety team
enumerates two
- [Harmful Content] - Content does not types of harmful
Harmful
The content contains N/A contain or asks about content:
Content
any harmful content harmful content 1. Content harms -
unsafe text (bigotry,
conspiracy theories)
2. Facilitations
harms - text that
enables unsafe
behavior (how to
make a bomb)
Be on the lookout for
both!
Things to consider
- [Some Fluency include writing style,
- [Lacks Native Errors] Writing is tone, word choice,
Fluency] Writing is not mostly that of a verbosity, and
- Native-level writing
Native that of a native native speaker, awkward sentence
for specified language
Fluency speaker in the but there is 1 structure.
and locale.
specified language strange phrase Examples include:
and locale. that gives you word for word
pause. translation from
another language
"Incorrect word
order: In German,
saying ""Ich habe
gekauft das Buch""
instead of ""Ich habe
das Buch gekauft""
(I bought the book).
Overuse of
pronouns: In
Spanish, saying
""Yo voy a la tienda,
y yo compro pan""
instead of just ""Voy
a la tienda y compro
pan"" (I go to the
store and buy
bread).
Literal translations of
idioms: In French,
saying ""Il pleut des
chats et des
chiens"" (It's raining
cats and dogs)
instead of the
correct idiom ""Il
pleut des cordes""
(It's raining ropes).
Formal/informal
confusion: In
Japanese, using
casual language (友
達語) when a more
formal style (敬語) is
appropriate, or vice
versa."
- [Grammar and - [Multiple Minor NOTE: Scale these
Punctuation Errors] Errors] Has 3 standards to the
Has 4 or more spelling spelling, grammar length of content as
(minor), grammar, and and punctuation - (5) There are no appropriate. I.E. if
punctuation errors errors easily discernible the content is very
Spelling / - [Multiple Egregious - [Egregious errors short (a paragraph
Grammar / Spelling Errors] Has Spelling Error] - (4) [Minor Errors (1- or less) you may
Formatting 2 or more egregious Has 1 egregious 2)] minor errors in grade more harshly
spelling errors spelling error spelling / grammar / "Grammar" as used
- [Major Formatting - [Minor formatting here constitutes
Errors] The response Formatting Error] punctuation, syntax,
contains broken Minor formatting wording, sentence,
formatting such as issues such as word choices, etc
broken formatting for a multiple new lines An egregious error
list or broken between content is something that
markdown changes the
Note: For spelling, meaning of what's
grammar, and written or is a
punctuation, errors completely jumbled
accumulate across the spelling or sentence
entire task. For
example, if the prompt
had 2 spelling errors
and the response had
2 spelling errors, we'd
fail the task
Consider whether
the content can be
improved by altering
word choice/syntax,
- [Minor Clarity sentence structure,
- [Major Clarity
Issues] Content or idea organization.
Clarity / Issues] Content is - Content is clear and
makes sense but Note: For Reasoning
Structure extremely difficult to makes sense
has some minor tasks tasks this
follow or is unclear.
clarity issues. would require
outlining the logical
steps required to
reach a given
conclusion
NOTE: Scale these
standards to the
- [Repetitive Content] length of content as
The content contains - The content does not appropriate. I.E. if
unnecessary contain unnecessary the content is very
repetition, having 3 or - [Not Relevant] repetition, having 2 or short (a paragraph
more sentences that The content fewer sentences that or less) you may
Repetitivenes
express the exact contains 3 express the same grade more harshly
s / Relevance
same idea irrelevant idea *Rule of thumb* -
- [Not Relevant (4+)] sentences - The content contains You could delete the
The content contains 4 2 or fewer irrelevant irrelevant or
or more irrelevant sentences repetitive material
sentences and it would not
materially detract
from the content

You might also like