Mail Valley Reasoning
Mail Valley Reasoning
Attempter Specifications
Table of Contents
Project Introduction
Helpful Links
Task Steps Overview
Task Specifications
Step 1: Write a Prompt to Stump the Model
The Phrasing Of Your Prompt Should Not Reference The Fact There Are Answer
Options Present
Please Make Sure you Copy The Entire Question into The “Write the question from the
multiple-choice prompt” Field
Step 2: Annotate the Initial Response
When to Label a Step as LLM
When to Label a Step as Python
Some very brief examples
Step 3: Rewrite Incorrect Steps
Step 4: Assess Response Quality
Appendix
COMMON ERRORS
Prompt Examples
Grading Rubric
Project Introduction
Your job is to create a complex prompt(multiple choice question) that causes the model to make
a reasoning or calculation error, rate the model response, and rewrite the incorrect steps in
the response, justifying why the steps were wrong.
REMEMBER!
The goal of this project is to get the Model to fail (make a reasoning or
calculation error).
The more complex your prompts are, the better.
Helpful Links
• Calculator.Net - Includes a number of useful calculators including quadratic formula,
LCM, GCF, prime factorization, permutations, combinations, triangles, volume, hex, and
much more.
• GeoGebra - Desmos but with more functionality for geometry. Would recommend testing
outside of tasking hours to get familiarity with the tool as there is a learning curve
• WolframAlpha - a computational engine that answers questions and solves problems
across various subjects, including math, science, and finance.
• NOTE: If you do not feel comfortable with the skills that you’ve been assigned, please
skip the task.
• Rate each reasoning step within the model’s response according to various
dimensions
3. Rewrite the Incorrect Steps and Justify the Rewrites (See Task Specifications)
• Indicate whether the response was accurate and followed the instructions given in the
prompt
Please take a look at the COMMON ERRORS that we frequently see in this project.
Task Specifications
IMPORTANT:
If you are uncomfortable with the skills provided
PLEASE SKIP THE TASK.
• STEP 1, write a prompt that has exactly one clear objective answer.
• STEP 2, provide 4 possible answers, in a multiple choice format, so that your answer
is EXACTLY ONE of the choices.
o Multiple choice answers be of the form “A. “ such as
A. Choice 1
B. Choice 2
C. Choice 3
D. Choice 4
• DO NOT use variations like A) or (A) or a:
Note: Doing STEP 1 and then STEP 2 you will help avoid the following Common
Errors
★ Common Errors ★
★ Remember ★
• Prompts must have a single objective answer that can be reached without
the MC options!
• Prompts with multiple answers are NOT VALID!
• Prompts that request a numerical value usually avoid these issues!
A: $e^{404}$
B: $4e^{404}$
C: $e^{-404}$
D: $4e^{-404}$
Please Make Sure you Copy The Entire Question(Including the context)
into The “Write the question from the multiple-choice prompt” Field. The
only thing you will not copy are the multiple choice options
EXAMPLE:
Prompt:
Consider a box containing 1 mole of hydrogen atoms (which can be assumed to act as an ideal
gas) at a pressure of 2 atmospheres and a temperature of $20^\circ \text{ C}$. What is the ratio of
the number of atoms in the first excited state to the number of atoms in the ground state?
A. $e^{404}$
B. $4e^{404}$
C. $e^{-404}$
D. $4e^{-404}$
This is what you would write in the text box:
Consider a box containing 1 mole of hydrogen atoms (which can be assumed to act as an ideal
gas) at a pressure of 2 atmospheres and a temperature of $20^\circ \text{ C}$. What is the ratio of
the number of atoms in the first excited state to the number of atoms in the ground state?
To assign difficulty level to your prompt to your best estimation, answer the following
questions:
Estimate (in minutes) the time it takes a human to solve the problem once the correct
approach is identified. (Round up to the nearest multiple of 5): ______________________
Tag a difficulty level based on complexity and the likelihood of model success:
• Easy
o A high school student can solve this without difficulty.
• Medium
o An undergraduate student can solve this without help.
• Hard
o An undergraduate student can solve this but might need to check their notes
or look up a hint.
Step 2: Annotate the Initial Response
The AI Model’s response will be separated into steps for easy analysis.
If Mary eats half the pizza and Jose eats one third of the pizza then
together they ate ½+⅓ = ⅖ of the pizza.
• Incorrect Reasoning - when the step includes a logical reasoning mistake. For
example:
In the race we know Bob comes in either 1st or 2nd and John comes
in 2nd or 3rd and Abby comes in 3rd then we conclude that Bob
comes in 2nd.
As ice cream sales and drowning rates both increase in the summer,
we conclude that ice cream causes drowning.
There may be times when a step contains both types of errors: reasoning and
calculation.
In many cases, a reasoning error is more significant, as it takes the model down
the wrong line of reasoning:
REMEMBER:
Your prompt should be complex enough
to cause the model to fail (make a reasoning or calculation error)
in at least one (1) reasoning or calculation step.
If this did not happen, you must return to your prompt and make it more complex.
Go back to the prompt writing box and click on the “retry chat from here” button.
In this part of the task, rewrite any steps that were rated as Incorrect
• Either modify the the original step or be a completely new step depending on the
degree of correctness
• Address any logical errors in the overall problem approach
• Address any errors in execution
• Maintain the same level of detail as other steps in the response
• Be written clearly, using simple language that can be understood by a high school
student
• Include only the information required to solve the problem
• Ensure that the step is self-contained and understandable on its own
• DO NOT introduce any new errors (this is a common mistake)
Next, write a brief but detailed justification for why your rewrite was necessary.
In this part of the task, provide a rationale for any steps that were rated as Incorrect
• Is this step the final step, and does it contain the solution to the prompt?
o Yes
o No
▪ If yes, the model will stop generating subsequent steps
▪ If no, the model will generate subsequent steps to solve it once you
save the response.
III. PROVIDE FINAL ANSWER IN THE LAST STEP
In this part you should supply the correct answer to your prompt.
• The final step should state: “The best answer is” followed by
the letter of the correct option.
• Do not box the final answer or alter this formatting at all.
Note: There can be statements before this, but the end of the final step should state,
“The best answer is” followed by the letter answer, with no additional formatting.
In this final step of the task, you will encounter the following:
Instruction Following
• Does the model follow instructions and understand the specifications of the prompt?
(Note: the model can be inaccurate and still follow instructions.)
o Yes
o No
NOTE: If you corrected the final step for not following the “The best answer
is” format, you must mark the Instruction Following box as “No”.
DEFINITIONS
• Accuracy
o Did the model make mistakes either in its calculation or reasoning in the first
Turn?
• Instruction Following
o Thinking of the model as a child, did the model attempt to solve your prompt
even if it made mistakes along the way?
o NOTE: If you corrected the final step for not
following the “The best answer is” format, you
must mark the Instruction Following box as “No”.
Appendix
COMMON ERRORS
COMMON ERRORS in Mail Valley and how to avoid them!
Additionally, rewrites
are occasionally
placed in the
justification box
instead of in the
designated rewrite
section, causing
confusion.
Prompt Examples
(Do not copy. These are just to give you some inspiration.)
Skills Prompt
Plasmid: ->(Ori) -> (Selection Marker) -> (Promoter) -> (C site) ->
(B site) -> (A site) ->
Enzyme B:
G|GATC C,
C CTAG|G.
Enzyme C:
G ACGT|C,
C|TGCA G.
Enzyme D:
A|GATC T,
T CTAG|A.
A. $\pm56.3 \, \text{eV}/c$
C. $\pm24.5 \, \text{eV}/c$
A. 0%
B. 50%
C. 100%
D. 25%
A molecule absorbs light at 220 nm. What is the corresponding
energy of this transition?
a. 1.92 eV
b. 4.13 eV
c. 5.64 eV
Chemistry d. 5.71 eV
What is the energy required to promote a helium atom confined in
a box of length 1 metre into the first excited state?
A. 2.577x10$^{-41}$ J
B. 2.177x10$^{-41}$ J
C. 2.477x10${-41}$ J
Chemistry D. 2.677x10${-41}$ J
A 25.00 mL sample of 0.070 M acetic acid is titrated with 0.090 M
NaOH. After adding 18.30 mL of NaOH, what is the pH of the
solution? Assume the dissociation constant $K_a$ for acetic acid
is $1.8 \times 10^{-5}$.
A. 4.74
B. 4.95
C. 5.74
Chemistry D. 5.95
Grading Rubric
Dimension 1-2 (Fail) 3 (Okay) 4-5 (Good/Perfect)
Prompt Rubric
How your prompt will be rated
Prompt Requirements Major requirements Meets requirements but All requirements met.
missing. The prompt has minor issues. The The prompt matches
does not lead to prompt leads to one the skills and leads to a
reasoning or calculation
errors or doesn't match reasoning or calculation reasoning or calculation
the skills error. error.
Prompt Clarity & Major clarity issues; Mostly clear, but could Clear and specific; all
Specificity prompt is vague or hard be interpreted multiple necessary information
to follow; key details ways or lacks a minor is included.
missing. detail.
Rating Rubric
How your ratings of the Initial Response will be evaluated
Correctness of Attempter incorrectly Some steps are All steps are correctly
Individual Steps identifies at least one correctly identified, but identified.
step, and justification is there is minor
inadequate. misjudgment.
Initial Response Major issue with N/A All labels are correctly
Labels labeling the response selected; the response
(e.g., incorrect labels is accurately labeled.
for steps).
Justification Rubric
How reviewers will grade your written ratings justifications
Support Claims Claims contradict the 1 claim lacks evidence, All claims logically
verdict or are but the rest are defend the verdict, are
inaccurate; 2+ claims accurate and well- accurate, and
lack evidence. supported. supported by evidence.
Rewrite Rubric
How reviewers will grade your rewritten response steps
Accuracy 1 or more major factual 2 or more minor factual No major factual errors
errors or misleading errors or misleading or misleading points.
points. points.
Rewritten Steps Clearly worse than the About the same quality Step clearly performs
Quality model response. as the model response. better than the model
response; rewritten to
Steps should be the correct degree.
rewritten but aren't.
Overall Task Quality
How reviewers will grade the Overall Quality of your task
Clarity & Structure Content is hard to Minor clarity issues; Clear and easy to
follow or unclear. generally makes sense. follow.
Whether you reject, fix, or accept the task will be determined by the guidelines in the Reviewer
Checklist below.
Grading Rubric
All tasks on this project will be reviewed according to the following rubrics:
SFT Prompt
4-5 (Good/
Field 1-2 (Fail) 3 (Okay) Perfect) Additional Notes
[Major
Requirements
Missing] - Prompt leads to an
- Prompt does not error in reasoning in
Prompt lead to an error in at least one step
Requirement reasoning in at least - The prompt
s one step matches the topic
- The prompt does not and reasoning type
match the topic and/or assigned
reasoning type
assigned
[Major Clarity /
- There is little to no
Specificity Issues]
[Minor Clarity / room for
-It's not clear what is
Specificity Issues] misinterpretation of
being asked, the
- It's mostly clear the specific request
Prompt prompt is extremely
what is being asked - Prompt has a
Clarity and difficult to follow, or is
but the request specific request that
Specificity overly vague
could reasonably doesn't require more
-Major details are
be interpreted than one minor
missing that are
multiple ways assumption to
needed to answer the
answer it
prompt
[Major Feasibility [Minor Feasibility - The prompt is
Issues] Issues] completely actionable
-Prompt contains an - Prompt's request by an LLM or chatbot
Feasibility
impractical or is verging on being - The prompt
impossible request impractical and the contains no
that can't be LLM won't be able conflicting
answered by an LLM to completely fulfill instructions/statemen
in a single response everything asked in ts
- Prompt gives the prompt, but the
conflicting/contradictin prompt is still
g instructions that answerable with
can't be fulfilled concessions
simultaneously
(unless specifically
instructed to do so)
4-5 (Good/
Field 1-2 (Fail) 3 (Okay) Perfect) Additional Notes
A Major error (i.e.
factuality,
calculations, figures,
or references) is an
incorrect or
- [Major Factual - The response
misleading statement
Errors] The response contains 0 major
which is central to
contains 1 or more factual errors.
- The response has the actual subject
major factual errors or - [Minor Factual
no factual errors matter or response
misleading points Error] The
Accuracy - The response has fulfillment of the
- [Minor Factual response contains
no misleading request
Errors] The response only one minor
statements A Minor error is an
contains 2 or more factual error or
incorrect or
minor factual errors or misleading
misleading statement
misleading points statement.
which is adjacent to
the actual subject
matter or response
fulfillment of the
request
- [Worse than
- The updated step
Original Model
would clearly perform *Important Note* -
Response] The
better overall across Do not penalize an
rewritten step would
- [Same Quality as the rubric dimensions attempter/tasker for
clearly perform worse
Rewritten Model Response] - The rewritten step making minimal
overall across the
Steps Clearly The updated step is rewritten to be changes to the
rubric dimensions
Worse Than would likely perform correct either as original response.
- The rewritten step is
Model about the same modification to the Note to operators:
not rewritten to the
Response overall across the original step or a the model we are
extent that would be
rubric dimensions completely new step comparing against
expected depending
depending on the should always be
on the degree of
degree of state of the art.
correctness (e.g., the
correctness
step is modified when
it should be entirely
rewritten)
4-5 (Good/
Field 1-2 (Fail) 3 (Okay) Perfect) Additional Notes
Note: Do NOT
penalize the model
for:
- Correct but badly
formatted solutions
- [Major Issues] The (e.g., does not use
contributor has LaTeX; uses a
- The contributor has
Correctnes incorrectly identified at decimal when a
correctly identified all
s of least one correct step fraction would be
steps that are correct,
Individual or incorrect step, and better)
and all steps that are
Steps the justification does - Correct by
incorrect
not adequately support inefficient solutions
this variance - Correct solutions
that contain written
preambles or
summaries
- Correct but
irrelevant steps
- [Major Issue] The
contributor has
incorrectly labeled the
initial response (e.g.,
Initial marked the response - The contributor has
Response as correct when it was N/A correctly labeled the
Labels incorrect; marked as initial response
yes for instruction
following when an
explicit instruction was
missed)
4-5 (Good/
Criteria 1-2 (Fail) 3 (Okay) Perfect) Additional Notes
- [Generic
Justification] The
rationale isn't specific to
the given step and the
responses have
material differences
- The rationale is not
- [Skewed
Analysis N/A generic and is not
Assessment]
skewed
Rationale overstates
the impact of minor
issues or understates
the significance of
major errors leading to
the wrong verdict
- [Claims Contradict
Verdict] 1 or more
supporting claims does
- All claims logically
not logically defend the
defend the verdict - All supporting
verdict
- All claims being claims logically
- [Claims Inaccurate] *NOTE* Some claims
made are accurate defend the verdict
1 or more of the claims are self-evident and
Support - [Claim Lacks - All claims being
being made is require little in way of
Claims Evidence (1)] 1 made are accurate
inaccurate stated evidence to
claim does not - All claims have
- [Claims Lack support
have sufficient sufficient evidence
Evidence (2+)] 2 or
evidence within the within the rationale
more claims do not
rationale
have sufficient
evidence within the
rationale
- [Evidence
Inaccurate] 1 piece of - All evidence is
evidence used in the accurate
rationale is inaccurate - [Evidence - All evidence is
Accuracy - [Evidence Misconstrued (1)] accurate and not
Misconstrued (2)] 2 1 piece of evidence misconstrued
pieces of evidence in the rationale is
used in the rationale being misconstrued
are being misconstrued