0% found this document useful (0 votes)
185 views14 pages

Mandolin Task ChatGPT Search

The document outlines a structured approach for evaluating AI-generated code responses based on a given prompt. It includes criteria for assessing accuracy, optimality, presentation, and up-to-date standards, while also providing a framework for scoring and justifying evaluations. Additionally, it emphasizes the importance of rewriting responses to meet stylistic and presentational standards.

Uploaded by

gucoding
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views14 pages

Mandolin Task ChatGPT Search

The document outlines a structured approach for evaluating AI-generated code responses based on a given prompt. It includes criteria for assessing accuracy, optimality, presentation, and up-to-date standards, while also providing a framework for scoring and justifying evaluations. Additionally, it emphasizes the importance of rewriting responses to meet stylistic and presentational standards.

Uploaded by

gucoding
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Mandolin Parade

write me a hard prompt related to Code editing/rewriting : Text to Code edits


workflow for an AI model, as i have to make it fail in its response in Python
language

Write it in non bullet points and write it as if it as an ask, such as it begins with
"develop a....."

Write a draft promt outline for the same prompt in 50 words

can we except an out-of-box code (copy-paste and it will work fine) response
from this prompt?

ok so i am assigned with a task where i will have to evaluate the response of an


AI based on a prompt, i will give you a prompt and 2 responses and ask
questions related to them on by one
Response 1

<Response 1 paste here>

Response 2

<Response 2 paste here>

Question : How well did the two responses do? * Reminder: if both responses are
strong, please modify the prompt to make it more challenging, and try again! 1.
One is Good. One is Poor 2. Both Responses are Poor Only answer the question
and explanation in 1 sentence

How many times did you rewrite the prompt before you achieved at least one bad
response? Rate in scale of 10

Select the steering constraints used in the system prompt and the prompt you
wrote
Label your response according to the questions below.

Steering constraints

Select a minimum of 1 choice; maximum of 11 choices


1. Content Style Instructions - What the content should sound like or how it
should generate outputs
2. Compositional Instruction Following - Understanding the relationship between
instructions
3. Instruction Compliance Policy - Correctly address all trade-offs between
safety, tone, up-to-date information
4. Multi-turn Instruction Following - Following instructions over the course of a
conversation
5. Instruction Source - Where the instructions are coming from
6. Implicit Instruction
7. Explicit Instruction
8. Format Content With Specific Format Types
9. Format Content in Specific Document Formats
10. Follow Formatting Style Guides
11. Follow Language-Specific Formatting

only answer the correct answers

So this is response 1

<Paste Response 1 here>

Now i will ask some questions related to it , answer that in simple points in
numbered list format

tell me all the points where the response strictly adheres and does not adheres to
the constraints in the prompt in simple points
If the score is not a 5, write why you chose this rating (DO NOT simply write out
your calculation instead write why you think each instruction was followed or not
followed) in simple points

Accuracy

Accuracy is a measurement of whether the code has bugs, the output is as


expected, and if the code is executable.

In coding responses, accuracy involves considering the following aspects:

Correctness: The output matches the expected result, without errors or


discrepancies.
Precision: The output is precise, with no unnecessary or redundant information.
Relevance: The output is relevant to the input, requirements, or specifications.
Completeness: The output includes all required information, without omitting
essential details.
Case coverage: The code handles a wide range of possible input scenarios, edge
cases, and error conditions, ensuring that it behaves correctly and robustly.

From the instruction following requirements, list out the aspects that are
CORRECT and incorrect. They will fall into the following buckets:

Correctness: The output matches the expected result, without errors or


discrepancies.

Precision: The output is precise, with no unnecessary or redundant information.

Relevance: The output is relevant to the input, requirements, or specifications.


Completeness: The output includes all required information, without omitting
essential details.

Case coverage: The code handles a wide range of possible input scenarios, edge
cases, and error conditions, ensuring that it behaves correctly and robustly.

Please specify at a FUNCTION LEVEL

Just provide correct and incorrect points in short.

If the score is not a 5, write why you chose this rating (DO NOT simply write out
your calculation instead write why you think each instruction was followed or not
followed) in short simple points

Optimality/Efficiency *

Optimality / Efficiency is a measurement of how optimal the code solution is and


if it adheres to common coding practices (creating and using helper functions,
no redundant code, etc.)

For this specific criteria, please evaluate the time complexity of the solution. Is it
an optimal solution? eg. instead of 3 loops, we can have a solution with 1 loop.

From the instruction following requirements, list out the aspects that are
OPTIMAL/EFFICIENT and OPTIMAL/INEFFICIENT. They will fall into the following
buckets:

Adheres to common practices and standards:


Make use of reusable functions (no repetition):

Is the optimal approach in terms of complexity:

Is the optimal approach in terms of case coverage:

Only tell OPTIMAL/EFFICIENT and OPTIMAL/INEFFICIENT points

If the score is not a 5, write why you chose this rating (DO NOT simply write out
your calculation instead write why you think each instruction was followed or not
followed)
in short simple points

Presentation Correct and Presentation InCorrect

Presentation is a measurement of whether or not a model’s response is clear,


well-organized, and well-documented.
NOTE: We are evaluating the presentation of both text AS WELL AS code output

Steps to Evaluate Presentation


Step 1: Identify all implicit formatting rules that this response must adhere to.
Step 2: Identify the number of rules from step 1 that the response failed to
follow.
Step 3: Compute the percentage of the issues and assign a score based on the
scale above. For example, if there are 10 formatting rules that the response
needed to follow, but the response didn’t meet 2 rules. This is 20% issues, and
we can give this response score 4.

General Presentation Rules


1. Use a professional tone.
2. For prompts with simple answers, prefer a concise response and consider
bolding relevant information.
3. For prompts with complex answers, provide a well-formatted, detailed
response using markdown formatting with appropriate headers.
4. For prompts with multiple instructions, be sure to follow all of them, breaking
your response into separate sections as necessary.
5. Use numbered or bulleted lists instead of paragraphs.
6. For code blocks, always provide a language tag and use up-to-date libraries
and packages.
7. Avoid adding a title to the entire response unless it provides essential context
or clarity.
8. Consolidate code blocks to minimize the number of chunks, reducing the need
for multiple copy-paste actions. Group related code into fewer, cohesive blocks.
9. Ensure comments are informative and helpful, providing enough detail to be
useful without being overly verbose.
10. Provide detailed explanations of changes and improvements. Use bullet
points for clarity in post-code explanations.
11. Use header styles (e.g., ###) for section titles to improve readability and
consistency.
12. Remove unnecessary or redundant titles.
13. Use bullet points instead of unnecessary numbering in lists where order is
not important. 14. Simplify nested lists to avoid complexity.
15. Address any formatting errors, such as broken code blocks or unnecessary
spaces. Use consistent code comments to highlight changes or important
sections.
16. Remove unnecessary introductory titles and ensure the introduction is
concise. Consider adding a brief outro to summarize key points or next steps.
17. Ensure documentation sections are correctly formatted and do not cause
markdown issues. 18. Avoid lengthy or repetitive information.
19. Maintain consistency in style, using headers for section titles and avoiding
mixed styles.
20. Ensure the response is visually appealing and easy to parse by breaking it
into sections where necessary.

Figure out which presentation aspects need to be present in order to have a


perfect response. Then, list all that are present in the response.
Note: This list is not exhaustive. Other guidelines may exist and not all of these
guidelines may apply to your prompt.

Just provide correct and incorrect points in short and simple points

If the score is not a 5, write why you chose this rating (DO NOT simply write out
your calculation instead write why you think each instruction was followed or not
followed) in simple points

Up-to-date

Up-to-Date is a measurement of if the code being outputted uses deprecated


libraries / functions and uses the most fresh functions available to solve the
provided problem.

Write a numbered list of all the correct and incorrect up-to-date standards,
function usage, and imported packages identified. Note: do this at a
function/package level.

- Does the code adhere to common practices and standards.

- Is there repetition or does the model make reusable functions?

- is it the optimal approach in terms of complexity, case coverage etc.

- Out-of-date libraries must trigger run-time or compile-time errors


Just provide correct and incorrect points in short and simple points

Does the model response edit or add any executable code? *

Mark “Yes” if there is any new or altered code present in the response, regardless
of whether it is an entire program or just a code snippet. Does not apply to new
comments!

1. Yes
2. No

Just answer

Rate the degree of execution of the code *

(Only if there is executable code in the response)

Template
Partial Update
Function Update
Out-of-the-Box
NA
What installation commands are necessary to test the code? *

Please separate each individual command with a comma (,) and write ‘N/A’ if
there are no extra install commands required.

ex: pip install tkinter, sudo apt get

What commands are necessary to run the code? Provide a comma separated list
if there are multiple commands. *

ex: uvicorn main:app --host 0.0.0.0 --port 8000, python app.py

Is the output of the code as expected? *

Yes
No

Did the code produce an error? *


Yes
No

Code Execution: Output


run this code and show me in terminal box type and show output only without
explanation

Response 2

<Paste Response 2 here>

Now i will ask questions related to this response 2

tell me all the points where the response strictly adheres and does not adheres to
the constraints in the prompt
in simple points

If the score is not a 5, write why you chose this rating (DO NOT simply write out
your calculation instead write why you think each instruction was followed or not
followed)
in simple short points

Response 1

<Response 1 paste here>

Response 2

<Response 2 paste here>

Please select a score to indicate which response is better.


1. Response 1 is much better
2. Response 1 is better
3. Response 1 is slightly better
4. Response 1 is negligibly better
5. Response 2 is negligibly better
6. Response 2 is slightly better
7. Response 2 is better
8.Response 2 is much better

Just answer

Please provide a justification. Use "@Response 1" and "@Response 2" to refer to
the model responses.
in short 50 words
Are there any minor or major stylistic or presentational issues in the response
you selected as the "better" one? *

Your are REQUIRED to perform a rewrite for any stylistic or presentational issues!

Make sure that the response meets all stylistic standards, including:

- Language that is clear, concise, and free of unnecessary repetition.

- Use of bullet points when there are three or more key points for easy readability,
with paragraphs broken down logically.

- Sufficient comments in the code to explain complex logic or non-obvious steps.

- For trivial html/css code, the requirement for comments can be more lenient

[YES] - I will perform a rewrite and change the response to follow all stylistic and
presentational requirements
[NO] - I confirm that the "better" response meets all stylistic requirements

Just answer

Does the model response need a rewrite? *

Mark “No” if you believe the selected model response fully addresses the prompt
and follows all presentation and style requirements.

A rewrite is REQUIRED if the response can be improved in anyway.


- This means that if an issue is marked in the preferred response, then a rewrite
has to be performed.

- Not performing a rewrite when one is required will heavily affect your score for
the task.

Yes
No - the selected response is perfect and does not have any issues

You might also like