Green Wizards
Green Wizards
Math Reasoning
Step-by-Step
Task Specifications
Table of Contents
Project Overview
Task Specifications
Step 1b: Keep Prompting the Model until it Produces an Incorrect Response
Step 4: Regenerate
Step 5: Summarize
APPENDIX
LaTex Guidelines
Rubrics
Helpful Links
Project Overview
Welcome to Green Wizards Math Reasoning Step-by-Step! This project aims to enhance advanced
models by developing and evaluating complex user prompts. For each pair of skills, generate prompts
that integrate the skills in a sophisticated way. Prompts should be challenging enough that [latest
models] are unlikely to produce accurate responses. Model outputs will be annotated to identify and
correct errors.
Key Terminology:
• Prompt: A question or statement designed to elicit a response from the model, incorporating
two complex skills.
• Model Response: The model’s step-by-step answer to the prompt, used for evaluation.
• Annotation: The process of reviewing the model’s response for accuracy, relevance to the
prompt, and proper skill use, while noting any reasoning errors.
• Error Identification: Detecting reasoning flaws or gaps in the model’s application of skills.
• Error Correction: Fixing identified errors and providing clear explanations for the corrections.
• Select two Skills to incorporate into a prompt (you must use both)
• Write a prompt that can stump the model (causing it to make a reasoning error)
• Check the model’s response, determining whether each step is correct or incorrect
• Determine if the step should be solved using LLM (text-based), or python (code-based
computations).
• If the response does not have any reasoning errors, repeat Step 1.
• Regenerate the response after each correction, restarting the model's train of thought
• Repeat Steps 2-4 until the model arrives at the correct answer
5. SUMMARIZE - Summarize the task using the checkboxes and short responses
• Accuracy - Does the original model response (BEFORE REWRITES) make an error?
• Instruction Following - Does the response AFTER REWRITES follow all prompt instructions?
• Skills Check - Does the response AFTER REWRITES utilize both of the skills selected in step 1?
• First Incorrect Step - What was the first incorrect step with a reasoning error made by the
model?
• Skills Gap - What Skill(s) does the model fail at or not include in the response?
Task Specifications
In this step, you will write a math problem for the model to solve. Problems should adhere to the
specifications below.
Specifications list
1. Math problems must lead to an error in reasoning in at least one of the steps in the model’s
initial
responsehttps://www.google.com/search?client=safari&rls=en&q=(i.e.%2C+avoid+problems+wh
ere&ie=UTF-8&oe=UTF-8.
2. Math problems must use proper LaTeX formatting for all mathematical expressions.
1. Avoid problems that don't contain the information necessary to solve them
1. Dividing by zero
2. Asking for the square root of a negative number if the domain is real numbers
3. Avoid problems containing terms, concepts, theorems, or lemmas that do not adhere to
mathematical rules
1. A question like “find all the roots of this quadratic x^2 + 5x + 6” is not a valid prompt because it
would result in two answers, -2 and -3
2. A better question to use is “what is the minimum of the roots of this quadratic x^2 + 5x + 6”
because the ground truth final answer is -3 only.
1. Prompts can only ask questions that result in an answer that can be machine verified
1. Prompts must only ask for one question and one answer
Given that $\sin \theta = \frac{1}{3}$, find the value of $\frac{\cos^3 2\theta}{\cos^2 3\theta}$,
considering only the principal value of $\theta$. Check if the resulting expression is a perfect cube, and
identify any degenerate cases for the values of $\theta$.
1. Problems should be complex enough to stump the model but also not contrived, meaning that
the problems should reflect realistic scenarios/asks
Good Example Given a circle with a radius of r, This is a good example because it integrates
a square is inscribed inside the geometry concepts (circle and square) and
circle. Calculate the area of the requires the model to compute areas of both
region outside the square but shapes and then find the difference. It
inside the circle. challenges the model to perform multiple steps
and apply different geometric formulas, making
it complex enough to potentially stump the
model.
Bad Example What is the area of a rectangle This problem is too simple for the task
with a length of 5 units and a requirements. It only requires a basic
width of 3 units? application of the area formula for rectangles
(Area=length×width\text{Area} = \text{length}
\times \text{width}Area=length×width) and
does not challenge the model's problem-solving
abilities. It lacks complexity and is not suitable
for evaluating the model’s capacity to handle
more advanced tasks.
Bad Example What is the sum of the first 10 This problem is unsuitable because it includes
positive integers? Also, find the two separate math problems in one prompt,
area of a triangle with a base of which goes against the requirement that a math
5 units and a height of 7 units.
problem should have only one correct solution.
Both problems individually are also too simple.
10. Math problems should not contain any spelling, grammar, or formatting errors
Feel free to find inspiration in this [EXT] Math R2 SFT - Math Problem Bank
Next, you will solve the math problem and verify the problem satisfies all the constraints above.
Step 1b: Keep Prompting the Model until it Produces an Incorrect Response
Next, you will submit the math problem for the model to solve. If the model answers correctly, or only
makes calculation errors, submit a different problem. Keep submitting problems until the model
produces an incorrect response.
IMPORTANT NOTE:
• Your task is to create a prompt that results in at least one reasoning error.
• If your prompt only produces a calculation error, revise it to introduce a reasoning error before
completing the task.
When you identify a mistake in the model's response, follow these steps:
1. Mark the step as "Incorrect."
• Not Incorrect: Issues like poor LaTeX formatting, valid but inefficient solutions, or suboptimal
expressions that are still mathematically correct should not be labeled as incorrect.
When asked to rate if each step is done correctly or incorrectly, you will see 4 possible step labels:
Incorrect - LLM is appropriate, Correct - LLM is appropriate, Incorrect - Python is appropriate, Correct -
Python is appropriate. We will review more in detail below the two images!
The below flow charts show how to think through which of the 4 step labels to choose:
2. Should this step ideally be solved using the model’s LLM or Python capabilities?
1. This question relates to determining whether a step should be solved using the model's large
language model (LLM) or its Python capabilities (implicit code execution, or ICE).
2. LLMs are better suited for tasks involving reasoning, logic, or proofs. Python is ideal for more
complex calculations or numerical tasks where precision is crucial.
• Select LLM is ideal to solve the problem when the step is similar to:
Essentially when the math problem requires abstract concept, word, or pattern recognition
• Deducing or applying proofs, theorems, propositions, and corollaries where the model relies on
step-by-step reasoning, intuition, and context
• Abstract math, such as measure theory (e.g. prove a certain Lebesgue set is measurable) or
number theory (e.g., determining if an extension of a field has finitely many intermediate fields)
• Select Python is ideal to solve the problem when the step is similar to:
Essentially when the math problem involves a very precise input / output
• Applying number tests that don’t rely on large computation, e.g. calculating a limit or applying a
ratio test or root test
• Precise numerical computation or complex arithmetic, where there are large numbers or multi-
step calculations that might introduce errors when using an LLM (e.g., what is
938193 x 93189318)
• Trigonometry and evaluation of calculus concepts (e.g. find the quadratic roots of x^2 - 7x + 4;
or solve cos(x)^2 + sin(x^2) = 0 at x = 0)
• Matrix operations or calculations involving vectors (e.g. find a vector orthogonal to v1 = [3,0,2];
or multiply M_1 and M_2 where each is a matrix; or find the determinant of M_3)
• Applied probability and statistics, such as calculating standard deviations or regressions, (e.g.
what is the standard deviation of this dataset; or binomial distribution, such as what is the
probability of getting exactly 8 heads in 15 coin flips, assuming the coin is fair?)
In this part of the task, you will provide a rationale for any steps that were rated as Incorrect
Specifications List
3. Rationales do not require LaTeX. If you do use LaTeX, the LaTeX should follow the style guide.
Rewriting Guidelines
• Maintain Sequential Integrity: Ensure that your rewrite preserves the logical order of the original
reasoning and includes the same level of detail.
• Clarity in Rewriting: Rewrite the step clearly and concisely, using simple language that avoids
jargon. The rewrite should be self-contained and understandable, following a logical sequence of
reasoning.
• Focus on Problem-Solving: Only include information necessary to solve the problem. Do not add
extraneous details like definitions of basic concepts.
Types of Errors
For this project, each response must include at least one reasoning error. If the only error is a calculation
mistake, modify the prompt to create a reasoning error.
Examples of Errors
Example 1: Rectangle Area (Reasoning Error)
• Problem: A rectangle’s length is doubled, and its width is increased by 3 meters. What is the new
area?
• Problem: How much will it cost to rent a car for 3 days and drive 150 miles?
Step 4: Regenerate
• Regenerate the response after each correction, restarting the model's train of thought
• Repeat Steps 2-4 until the model arrives at the correct answer
Step 5: Summarize
In this final step, you should summarize the task using the checkboxes and short responses, according to
the following dimensions:
• Accuracy
• Instruction Following
• Skills Check
• Does the response AFTER REWRITES utilize both of the skills selected in step 1?
• What was the first incorrect step with a reasoning error made by the model?
• Skills Gap
• What Skill(s) does the model fail at or not include in the response?
APPENDIX
LaTex Guidelines
• Only ask for help writing the LaTeX, as opposed to asking for help solving the problem.
• Remember that the impetus for this project is that LLMs are often very wrong at math. We don't
want an incorrect solution from an LLM biasing your own solution!
To help catch mistakes and ensure that your responses are accurate in mathematical calculations, we
highly recommend using a dedicated math-solving or calculation verification tool.
• WolframAlpha
• Desmos
• Symbolab
• GeoGebra
By using one of these tools, you can reduce errors and enhance the precision of mathematical answers.
Quillbot:
• Google Chrome
• Microsoft Edge
Grammarly:
• Google Chrome
LanguageTool:
• Safari
• Firefox
• Google Chrome
Word problems Understanding and "If a car travels 100 miles in 2 hours, how
interpreting natural fast is it traveling?" The model must
language scenarios understand the wording and context to
form a correct solution.
When to Use LLM What It Involves Example
Proofs, theorems, Logical reasoning required Proving that the square root of 2 is
propositions, and to prove theorems or irrational through logical steps rather
corollaries deductions than computational output.
Algebra or solving for Solving algebraic Solving for x in an equation like x^2 + 4x +
variables equations 1=0
computationally
Isolating y in the equation xy = y+1
Function evaluation Define and evaluate a Compute f(x) = x^2*sin(x) for a range of
given function values.
Numerical computation Approximations and other Solve e^x = x^4 using Newton’s Method.
intensive numerical
Numerical integration using Simpson’s
computations
Method
Calculating determinants
Hypothesis testing
Rubrics
Follows most
Instruction All explicit instructions are
Misses one or more explicit instructions but may
Following / followed, and the
instructions or does not fully subjectively miss some
Response response fully addresses
address the prompt. aspects of fully
Fulfillment the prompt.
answering the question.
Includes unnecessary
Unnecessary No unnecessary greetings
greetings like "I'd love to
Greetings / N/A or pleasantries at the
help you" or “Anything else I
Pleasantries beginning or end.
can assist with?”
Provides useful
information but may Well-balanced, insightful,
Overly simplistic, lacking
Depth / Nuance need more detail, or with appropriate depth
meaningful detail or depth.
includes excessive, and nuance.
distracting detail.
Helpful Links
• Calculator.Net - Includes a number of useful calculators including quadratic formula, LCM, GCF,
prime factorization, permutations, combinations, triangles, volume, hex, and much more.
• GeoGebra - Desmos but with more functionality for geometry. Would recommend testing
outside of tasking hours to get familiarity with the tool as there is a learning curve