0% found this document useful (0 votes)
436 views17 pages

Green Wizards

green

Uploaded by

juliemaey727
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
436 views17 pages

Green Wizards

green

Uploaded by

juliemaey727
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Green Wizards

Math Reasoning

Step-by-Step

Task Specifications

Table of Contents

Project Overview

Task Attempt Workflow

Task Specifications

Step 1: Write a Prompt (math problem)

Step 1a: Solve the Math Problem

Step 1b: Keep Prompting the Model until it Produces an Incorrect Response

Step 2: Rate the Correctness of Each Step

Step 2a: Write a rationale for incorrect steps

Step 3: Rewrite Incorrect Steps

Step 4: Regenerate

Step 5: Summarize

APPENDIX

LaTex Guidelines

Rubrics

Helpful Links

Project Overview

Welcome to Green Wizards Math Reasoning Step-by-Step! This project aims to enhance advanced
models by developing and evaluating complex user prompts. For each pair of skills, generate prompts
that integrate the skills in a sophisticated way. Prompts should be challenging enough that [latest
models] are unlikely to produce accurate responses. Model outputs will be annotated to identify and
correct errors.

Key Terminology:

• Prompt: A question or statement designed to elicit a response from the model, incorporating
two complex skills.

• Model Response: The model’s step-by-step answer to the prompt, used for evaluation.
• Annotation: The process of reviewing the model’s response for accuracy, relevance to the
prompt, and proper skill use, while noting any reasoning errors.

• Error Identification: Detecting reasoning flaws or gaps in the model’s application of skills.

• Error Correction: Fixing identified errors and providing clear explanations for the corrections.

• Teaching Material: If necessary, developing educational resources to explain and address


reasoning errors.

Task Attempt Workflow

1. PROMPT - Write and Solve an Elegant Math Problem

• Select two Skills to incorporate into a prompt (you must use both)

• Write a prompt that can stump the model (causing it to make a reasoning error)

• Make sure your problem can be solved

2. RESPONSE - Rate the Correctness of Each Step

• Check the model’s response, determining whether each step is correct or incorrect

• Determine if the step should be solved using LLM (text-based), or python (code-based
computations).

• If the response does not have any reasoning errors, repeat Step 1.

• Write a brief justification explaining the error

3. REWRITING - Rewrite incorrect steps

• Correct the model's reasoning by rewriting any incorrect steps

• Write a brief justification explaining the changes

4. REGENERATE - Repeat the process until the Final Answer is reached

• Regenerate the response after each correction, restarting the model's train of thought

• Repeat Steps 2-4 until the model arrives at the correct answer

5. SUMMARIZE - Summarize the task using the checkboxes and short responses

• Accuracy - Does the original model response (BEFORE REWRITES) make an error?

• Instruction Following - Does the response AFTER REWRITES follow all prompt instructions?

• Skills Check - Does the response AFTER REWRITES utilize both of the skills selected in step 1?

• First Incorrect Step - What was the first incorrect step with a reasoning error made by the
model?

• Skills Gap - What Skill(s) does the model fail at or not include in the response?
Task Specifications

Step 1: Write a Prompt (math problem)

In this step, you will write a math problem for the model to solve. Problems should adhere to the
specifications below.

Specifications list

1. Math problems must lead to an error in reasoning in at least one of the steps in the model’s
initial
responsehttps://www.google.com/search?client=safari&rls=en&q=(i.e.%2C+avoid+problems+wh
ere&ie=UTF-8&oe=UTF-8.

2. Math problems must use proper LaTeX formatting for all mathematical expressions.

3. Prompts must be original.

4. Math problems must be solvable

1. Avoid problems that don't contain the information necessary to solve them

2. Avoid problems with impossible scenarios, for example:

1. Dividing by zero

2. Asking for the square root of a negative number if the domain is real numbers

3. Avoid problems containing terms, concepts, theorems, or lemmas that do not adhere to
mathematical rules

5. Math problems must have only one correct solution

1. Avoid problems that have many potential answers

2. Answers must be either a number or a mathematical expression

3. Answers should be able to be read by something like Python or Wolphram Alpha

6. There must only be one answer to the prompt

1. A question like “find all the roots of this quadratic x^2 + 5x + 6” is not a valid prompt because it
would result in two answers, -2 and -3

2. A better question to use is “what is the minimum of the roots of this quadratic x^2 + 5x + 6”
because the ground truth final answer is -3 only.

7. Prompts must not ask for proofs

1. Prompts can only ask questions that result in an answer that can be machine verified

2. This is an example of a prompt that would fail:


In a regular hexagon, three diagonals are drawn, intersecting at the center of the hexagon. The diagonals
divide the hexagon into $6$ congruent triangles, $3$ of which are colored red and the remaining $3$ are
colored blue. Prove that there exists a coloring of the triangles such that it is possible to draw a line that
passes through the center of the hexagon and divides it into two parts that are each one color.

8. Prompts must not ask multiple questions in the problem statement

1. Prompts must only ask for one question and one answer

2. This is an example of a prompt that would fail:

Given that $\sin \theta = \frac{1}{3}$, find the value of $\frac{\cos^3 2\theta}{\cos^2 3\theta}$,
considering only the principal value of $\theta$. Check if the resulting expression is a perfect cube, and
identify any degenerate cases for the values of $\theta$.

9. Math problems must be sufficiently complex, but not contrived

1. Problems should be complex enough to stump the model but also not contrived, meaning that
the problems should reflect realistic scenarios/asks

Category Example Explanation

Good Example Given a circle with a radius of r, This is a good example because it integrates
a square is inscribed inside the geometry concepts (circle and square) and
circle. Calculate the area of the requires the model to compute areas of both
region outside the square but shapes and then find the difference. It
inside the circle. challenges the model to perform multiple steps
and apply different geometric formulas, making
it complex enough to potentially stump the
model.

Bad Example What is the area of a rectangle This problem is too simple for the task
with a length of 5 units and a requirements. It only requires a basic
width of 3 units? application of the area formula for rectangles
(Area=length×width\text{Area} = \text{length}
\times \text{width}Area=length×width) and
does not challenge the model's problem-solving
abilities. It lacks complexity and is not suitable
for evaluating the model’s capacity to handle
more advanced tasks.

Bad Example What is the sum of the first 10 This problem is unsuitable because it includes
positive integers? Also, find the two separate math problems in one prompt,
area of a triangle with a base of which goes against the requirement that a math
5 units and a height of 7 units.
problem should have only one correct solution.
Both problems individually are also too simple.

10. Math problems should not contain any spelling, grammar, or formatting errors

Feel free to find inspiration in this [EXT] Math R2 SFT - Math Problem Bank

Step 1a: Solve the Math Problem

Next, you will solve the math problem and verify the problem satisfies all the constraints above.

Step 1b: Keep Prompting the Model until it Produces an Incorrect Response

Next, you will submit the math problem for the model to solve. If the model answers correctly, or only
makes calculation errors, submit a different problem. Keep submitting problems until the model
produces an incorrect response.

IMPORTANT NOTE:

• Your task is to create a prompt that results in at least one reasoning error.

• If your prompt only produces a calculation error, revise it to introduce a reasoning error before
completing the task.

Step 2: Rate the Correctness of Each Step

When you identify a mistake in the model's response, follow these steps:
1. Mark the step as "Incorrect."

2. Explain the mistake and why it is wrong.

3. Rewrite the model’s reasoning with a corrected solution.

4. Select either Reasoning Error or Calculation Error.

Criteria for Incorrect Labeling:

• Incorrect: A clear mathematical error is present.

• Not Incorrect: Issues like poor LaTeX formatting, valid but inefficient solutions, or suboptimal
expressions that are still mathematically correct should not be labeled as incorrect.

When asked to rate if each step is done correctly or incorrectly, you will see 4 possible step labels:
Incorrect - LLM is appropriate, Correct - LLM is appropriate, Incorrect - Python is appropriate, Correct -
Python is appropriate. We will review more in detail below the two images!

The below flow charts show how to think through which of the 4 step labels to choose:

Each step label is designed to address two questions:

1. Is the math in the step correct?


1. This is straightforward: verify if the mathematical operations performed are correct and the
model’s output is accurate.

2. Should this step ideally be solved using the model’s LLM or Python capabilities?

1. This question relates to determining whether a step should be solved using the model's large
language model (LLM) or its Python capabilities (implicit code execution, or ICE).

2. LLMs are better suited for tasks involving reasoning, logic, or proofs. Python is ideal for more
complex calculations or numerical tasks where precision is crucial.

When to Use LLM or Python

• Select LLM is ideal to solve the problem when the step is similar to:

Essentially when the math problem requires abstract concept, word, or pattern recognition

• Word problems that require an understanding of natural language

• Pattern recognition, like simple sequences

• Deducing or applying proofs, theorems, propositions, and corollaries where the model relies on
step-by-step reasoning, intuition, and context

• Abstract math, such as measure theory (e.g. prove a certain Lebesgue set is measurable) or
number theory (e.g., determining if an extension of a field has finitely many intermediate fields)

• Theoretical probability, (e.g. What is P(A)P(B) for independent A and B)

• Select Python is ideal to solve the problem when the step is similar to:

Essentially when the math problem involves a very precise input / output

• Simple variable-based calculus (e.g. derive x^2 + 2x + 7)

• Applying number tests that don’t rely on large computation, e.g. calculating a limit or applying a
ratio test or root test

• Manipulating or isolating quantifiers or variables, where a solution can be calculated (e.g.


isolate y in 2x + 3 = 7y; or substitute y = 3z+3 when x = 10+7y)

• Basic arithmetic with small numbers (under 100, e.g. 25 + 37)

• Algebra or solving for variables (e.g. solve for x in 3x+7=3)

• Precise numerical computation or complex arithmetic, where there are large numbers or multi-
step calculations that might introduce errors when using an LLM (e.g., what is
938193 x 93189318)

• Trigonometry and evaluation of calculus concepts (e.g. find the quadratic roots of x^2 - 7x + 4;
or solve cos(x)^2 + sin(x^2) = 0 at x = 0)

• Matrix operations or calculations involving vectors (e.g. find a vector orthogonal to v1 = [3,0,2];
or multiply M_1 and M_2 where each is a matrix; or find the determinant of M_3)
• Applied probability and statistics, such as calculating standard deviations or regressions, (e.g.
what is the standard deviation of this dataset; or binomial distribution, such as what is the
probability of getting exactly 8 heads in 15 coin flips, assuming the coin is fair?)

• Distributions or calculations that require calculating areas under a curve, evaluating an


integral, or volume (e.g. what is the probability that a normally distributed variable with mean
100 and standard deviation 15 falls between 90 and 120?; or, “what is the volume of this solid
evolution (3x+1)^0.5 at these boundaries …”)

• Numerical approximations or evaluating series (e.g. a Monte Carlo approximation, a binomial


distribution, applying Simpson’s rule or a Striling approximation)

Step 2a: Write a rationale for incorrect steps

In this part of the task, you will provide a rationale for any steps that were rated as Incorrect

Specifications List

1. Rationales should clearly explain why the step is incorrect

2. Rationales should use proper spelling and grammar

3. Rationales do not require LaTeX. If you do use LaTeX, the LaTeX should follow the style guide.

4. Rationales should not contain “model”, “AI”, “LLM”, etc.

Step 3: Rewrite Incorrect Steps

Rewriting Guidelines

• Maintain Sequential Integrity: Ensure that your rewrite preserves the logical order of the original
reasoning and includes the same level of detail.

• Clarity in Rewriting: Rewrite the step clearly and concisely, using simple language that avoids
jargon. The rewrite should be self-contained and understandable, following a logical sequence of
reasoning.

• Focus on Problem-Solving: Only include information necessary to solve the problem. Do not add
extraneous details like definitions of basic concepts.

Types of Errors

1. Calculation Errors: Simple arithmetic or operational mistakes (e.g., incorrect multiplication).

2. Reasoning Errors: Mistakes in understanding the process or logic behind a solution.

For this project, each response must include at least one reasoning error. If the only error is a calculation
mistake, modify the prompt to create a reasoning error.

Examples of Errors
Example 1: Rectangle Area (Reasoning Error)

• Problem: A rectangle’s length is doubled, and its width is increased by 3 meters. What is the new
area?

• Incorrect Answer: Adds the dimensions instead of multiplying.

• Error: Misunderstanding the formula for area.

Example 2: Car Rental (Calculation Error)

• Problem: How much will it cost to rent a car for 3 days and drive 150 miles?

• Incorrect Answer: Correct understanding, but miscalculates the total cost.

• Error: Simple arithmetic mistake.

Step 4: Regenerate

Repeat the process until Final Answer is reached

• Regenerate the response after each correction, restarting the model's train of thought

• Repeat Steps 2-4 until the model arrives at the correct answer

Step 5: Summarize

In this final step, you should summarize the task using the checkboxes and short responses, according to
the following dimensions:

• Accuracy

• Does the original model response (BEFORE REWRITES) make an error?

• Instruction Following

• Does the response AFTER REWRITES follow all prompt instructions?

• Skills Check

• Does the response AFTER REWRITES utilize both of the skills selected in step 1?

• First Incorrect Step

• What was the first incorrect step with a reasoning error made by the model?

• Skills Gap

• What Skill(s) does the model fail at or not include in the response?
APPENDIX

LaTex Guidelines

Please refer to this guide for the LaTeX formatting guidelines.

All prompts and rewrites should be written in proper Single $ LaTeX.

If you are unfamiliar with LaTeX:

• Refer to the style guides that are linked in the task.

• Feel free to use an LLM to help you. Make sure to:

• Ask it to write the expression in Single $ Latex

• Only ask for help writing the LaTeX, as opposed to asking for help solving the problem.

• Remember that the impetus for this project is that LLMs are often very wrong at math. We don't
want an incorrect solution from an LLM biasing your own solution!

To help catch mistakes and ensure that your responses are accurate in mathematical calculations, we
highly recommend using a dedicated math-solving or calculation verification tool.

• WolframAlpha

• Great for general math problems, calculus, and symbolic computation.

• Desmos

• Perfect for graphing and exploring functions interactively.

• Symbolab

• Useful for step-by-step solutions in calculus, algebra, and more.

• GeoGebra

• A powerful tool for dynamic geometry, algebra, calculus, and statistics.

By using one of these tools, you can reduce errors and enhance the precision of mathematical answers.

You only need to install one of the following extensions:

Quillbot:

• Google Chrome

• Microsoft Edge

Grammarly:

• Google Chrome

LanguageTool:
• Safari

• Firefox

• Google Chrome

Python vs. LLM Guidance (labeling calculation and reasoning errors)

When to Use LLM

When to Use LLM What It Involves Example

Abstract concepts, word, Recognizing complex Identifying trends in a sequence like


or pattern recognition ideas, abstract patterns, or Fibonacci or interpreting a complex word
word problems problem.

Word problems Understanding and "If a car travels 100 miles in 2 hours, how
interpreting natural fast is it traveling?" The model must
language scenarios understand the wording and context to
form a correct solution.
When to Use LLM What It Involves Example

Pattern recognition Recognizing or predicting Identifying the next number in a


logical patterns or sequence, like 2, 4, 8, 16 (powers of 2).
sequences

Proofs, theorems, Logical reasoning required Proving that the square root of 2 is
propositions, and to prove theorems or irrational through logical steps rather
corollaries deductions than computational output.

Abstract math Handling complex, Proving that a set is measurable using


theoretical math not Lebesgue measure or determining
reducible to steps properties in advanced number theory.

Theoretical probability Explaining theoretical Explaining the theoretical probability of


reasoning in probability two independent events, such as "What
scenarios is P(A and B) if P(A) = 0.3 and P(B) = 0.5?"

When to Use Python

When to Use Python What It Involves Example

Basic arithmetic Performing basic Adding, subtracting, multiplying, or


arithmetic operations dividing numbers under 100, such as 25 +
3725 + 37.

Algebra or solving for Solving algebraic Solving for x in an equation like x^2 + 4x +
variables equations 1=0
computationally
Isolating y in the equation xy = y+1

Function evaluation Define and evaluate a Compute f(x) = x^2*sin(x) for a range of
given function values.

Compute subsequent values in a sequence


such as Fibonacci.
When to Use Python What It Involves Example

Manipulating or isolating Manipulating equations Solving for y in an equation like


variables computationally 2x+3=7yx+3, or substituting values into an
equation like y=3z+3w when z=10+7y, w =
10 + 7yx.

Trigonometry Solving trigonometric Computing arctan(1.5)


problem computationally
Using Law of Cosines to find the length of
a side in a triangle

Computational calculus Performing calculus Differentiating functions like x^3+7x^2 +


operations like derivation, 2x.
integration, or limits

Calculating area under a probability


density function to determine
probabilities

Computing volumes using integrals

Finding Taylor series

Applying number tests Running computational Calculating the limit of a function


tests to check limits or
Series tests for convergence.
convergence

Numerical computation Approximations and other Solve e^x = x^4 using Newton’s Method.
intensive numerical
Numerical integration using Simpson’s
computations
Method

Monte Carlo simulations

Matrix operations or Solving matrix or vector- Determining if a set of vectors is linearly


vector calculations related problems independent

Calculating determinants

Solving a system of equations


When to Use Python What It Involves Example

Applied probability and Performing statistical or Determining the standard deviation of a


statistics probability calculations dataset

Calculating the probability of exactly 8


heads in 15 coin flips

Hypothesis testing

Reasoning Error Identification Guidelines

Error categories Description

Common The model cites some common sense that is


Sense Incorrect Common Sense incorrect.

Model skips certain steps important to the


logical flow of the solution. In other words the
Gaps in logical reasoning model response is missing a crucial step that
steps results in a nontrivial logical leap.

The model's conclusion is simply a


restatement of the premise, offering no new
Circular Reasoning insight or logical progression.

Logical Providing conflicting or contradictory


Reasoning Inconsistent Reasoning information within the same response

Analytical Mistake in decomposing the problem into


Thinking Incorrect decomposition smaller pieces

Deductive Unsupported Drawing broad conclusions from insufficient


Reasoning generalizations reasoning or insufficient observations.

Model making an assumption or a general


Inductive rule that is incorrect, inappropriate or
Reasoning Incorrect Assumption unnecessary.

The model makes comparisons to the wrong


Incorrect Comparative targets that do not help with the core
Reasoning Analysis Faulty Comparison arguments.
The model incorrectly assumes that one event
caused another simply because they occurred
together. Correlation doesn't always equal
False Causality causation.

Causal The model concludes an effect from


Reasoning Weak Causality insufficient or inconclusive or weak causes

Pattern The model finds an incorrect pattern or


Recognition Incorrect Pattern regularity from some given observations

The model uses the statistic tools correctly,


Incorrect Conclusion from but draws the wrong and unsupported
Statistics conclusion

Statistical The model cites the wrong statistic tool for


Reasoning Wrong Statistics the current problem

Temporal Faulty temporal The model makes wrong reasoning based on a


Reasoning reasoning misunderstanding of time

Abstract Proposing a level of abstraction for solving the


Thinking Incorrect Abstraction problem that is not appropriate or accurate.

Correct name of the formula, theorem,


Misstated Formula lemma etc but wrong form

Incorrect usage of the Correct name and form of the formula,


formula theorem, lemma etc but wrong place to use it

Incorrect Wrong corresponding Correct form of the formula, theorem, lemma


Formula formula etc but wrong name

Basic calculation mistakes. Substitution of


Arithmetic Errors value errors and term simplification errors.

Order of Operations Incorrect application of PEMDAS/BODMAS,


errors leading to incorrect results.

Incorrect Rounding Inaccurate rounding


Incorrect Incorrect
Calculation Calculation Incorrect Unit Errors Mishandling units of measurement.
Missing sign/the sign of the expression is
Incorrect Use of Signs flipped.

Rubrics

• Prompt Grading Rubric

• This is how your prompts will be graded by Reviewers:

1-2 (Fail) 3 (Okay) 4-5 (Good/ Perfect)

Skills are mentioned


Skills are included but Skills are deeply integrated,
superficially and do not
not fully leveraged, and the response requires
play a meaningful role in
Skill Integration leading to only moderate each skill to be applied in a
the response, leading to
or surface-level meaningful and non-trivial
incomplete or trivial
application. way.
answers.

The prompt is too simple, The prompt is sufficiently


The prompt has some
allowing the model to complex, requiring
Prompt complexity but may still
provide a correct or trivial significant reasoning and
Complexity be straightforward or
response without analysis, preventing a
easy to answer.
significant reasoning. simple or trivial solution.

The prompt is unclear, The prompt is clear,


The prompt is mostly
overly vague, or missing specific, and leaves little
Prompt Clarity & clear but may allow for
essential details, making it room for misinterpretation,
Specificity multiple reasonable
difficult to follow or answer requiring no more than one
interpretations.
accurately. minor assumption.

The prompt is impractical The prompt is verging on The prompt is fully


or impossible to answer impractical, but can still actionable within the
Feasibility within the model's be answered with model's capabilities, with
capabilities, or contains concessions or partial no conflicting or
contradictory instructions. fulfillment. contradictory instructions.

• Response Grading Rubric

• This is how your response process will be graded by Reviewers:

1-2 (Fail) 3 (Okay) 4-5 (Good/ Perfect)


[The correction introduces The correction addresses The correction is accurate
further errors, is vague or some mistakes but may and fully resolves errors,
irrelevant, and fails to leave subtler issues making the response
improve the response’s unaddressed. While coherent, logical, and
Correction Quality
coherence or accuracy. The clearer, the response aligned with the prompt.
revised response performs remains incomplete or The response improves
worse across evaluation imprecise and performs overall in quality and
criteria. the same as the original. performance.

Contains one or more major Contains no major factual


Contains up to two
factual errors, or multiple errors, with only one
Accuracy minor factual errors or
minor errors/misleading minor error or misleading
misleading statements.
points. statement.

Follows most
Instruction All explicit instructions are
Misses one or more explicit instructions but may
Following / followed, and the
instructions or does not fully subjectively miss some
Response response fully addresses
address the prompt. aspects of fully
Fulfillment the prompt.
answering the question.

Includes unnecessary
Unnecessary No unnecessary greetings
greetings like "I'd love to
Greetings / N/A or pleasantries at the
help you" or “Anything else I
Pleasantries beginning or end.
can assist with?”

Provides useful
information but may Well-balanced, insightful,
Overly simplistic, lacking
Depth / Nuance need more detail, or with appropriate depth
meaningful detail or depth.
includes excessive, and nuance.
distracting detail.

Helpful Links

• Calculator.Net - Includes a number of useful calculators including quadratic formula, LCM, GCF,
prime factorization, permutations, combinations, triangles, volume, hex, and much more.

• GeoGebra - Desmos but with more functionality for geometry. Would recommend testing
outside of tasking hours to get familiarity with the tool as there is a learning curve

You might also like