0% found this document useful (0 votes)
52 views

Explaining Competitive-Level Programming Solutions Using LLMs

This paper proposes a method to automatically annotate natural language explanations to programming problem-solution pairs from competitive programming contests. The researchers show that while state-of-the-art LLMs like GPT-3.5 and GPT-4 perform poorly at solving such problems, they can describe and explain solutions well. The method generates a structured explanation for problems containing descriptions and analysis of the key ideas. Explanations are evaluated based on how well they satisfy a human expert and help LLMs solve problems more effectively. Experimental results found that while GPT-3.5 and GPT-4 describe solutions comparably, GPT-4 shows a better understanding of the key ideas.

Uploaded by

Richard Melo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Explaining Competitive-Level Programming Solutions Using LLMs

This paper proposes a method to automatically annotate natural language explanations to programming problem-solution pairs from competitive programming contests. The researchers show that while state-of-the-art LLMs like GPT-3.5 and GPT-4 perform poorly at solving such problems, they can describe and explain solutions well. The method generates a structured explanation for problems containing descriptions and analysis of the key ideas. Explanations are evaluated based on how well they satisfy a human expert and help LLMs solve problems more effectively. Experimental results found that while GPT-3.5 and GPT-4 describe solutions comparably, GPT-4 shows a better understanding of the key ideas.

Uploaded by

Richard Melo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Explaining Competitive-Level Programming Solutions using LLMs

♥ ♣,♠ ♦ ♥
Jierui Li , Szymon Tworkowski , Yingying Wu and Raymond Mooney
♥ ♣ ♠
The University of Texas at Austin, University of Warsaw, IDEAS NCBR

University of Houston
{jierui, mooney}@cs.utexas.edu, [email protected], [email protected]

Abstract (OpenAI, 2023b). Competitive-level programming


problems epitomize problem-solving strategies for
In this paper, we approach competitive-level algorithmic, mathematical, geometric, and graph-
programming problem-solving as a compos-
theoretic problems. Solving them necessitates un-
ite task of reasoning and code generation. We
arXiv:2307.05337v1 [cs.CL] 11 Jul 2023

propose a novel method to automatically anno- derstanding problems, familiarity with algorithms,
tate natural language explanations to <problem, reasoning skills, creative algorithm development,
solution> pairs. We show that despite poor per- and efficient, robust implementation.
formance in solving competitive-level program-
ming problems, state-of-the-art LLMs exhibit Problem Statement: Sign Swap
a strong capacity in describing and explaining Given an array of n integers a1 , a2 , . . . , an , where ai ≠
solutions. Our explanation generation method- 0, check if you can make this array sorted by using the
ology can generate a structured solution expla- following operation any number of times (possibly zero).
An array is sorted if its elements are arranged in a non-
nation for the problem containing descriptions
decreasing order. Select two indices i and j (1 ≤ i, j ≤ n)
and analysis. To evaluate the quality of the such that ai and aj have different signs. In other words,
annotated explanations, we examine their effec- one must be positive and one must be negative. Swap the
tiveness in two aspects: 1) satisfying the human signs of ai and aj . For example if you select ai = 3 and
programming expert who authored the oracle aj = −2, then they will change to ai = −3 and aj = 2.
solution, and 2) aiding LLMs in solving prob- Solution:
lems more effectively. The experimental results a= [ * map ( i n t , i n p u t ( ) . s p l i t ( ) ) ]
h=sum ( 1 f o r v i n a i f v < 0)
on the CodeContests dataset demonstrate that b = [ a b s ( a [ i ] ) * ( −1 i f i <h e l s e 1 ) f o r
while LLM GPT3.5’s and GPT-4’s abilities in i in range ( len ( a ) ) ]
describing the solution are comparable, GPT-4 p r i n t ( ’ y e s ’ i f s o r t e d ( b ) ==b e l s e ’ no ’ )
shows a better understanding of the key idea Solution Description: Move negative signs to the front of
behind the solution. the array and check if it’s already non-decreasing.
Solution Explanation: Swapping different signs for any
1 Introduction number of times means you can move negative signs ar-
bitrarily. A non-decreasing array must have negative el-
Recent Large Language Models (LLMs) have ements ahead of positive elements and moving negative
signs ahead is the optimal operation can be made.
shown impressive capabilities for various reason-
GPT-3.5 Generated Program:
ing tasks, including multi-hop question answering a = l i s t ( map ( i n t , i n p u t ( ) . s p l i t ( ) ) )
n e g _ c o u n t = sum ( 1 f o r x i n a i f x < 0 )
(Wang et al., 2022a; Lyu et al., 2023), common- pos_count = len ( a ) − neg_count
while True :
sense reasoning (Zelikman et al., 2022), symbolic i f neg_count > pos_count :
for i in range ( len ( a ) ) :
reasoning (Hua and Zhang, 2022), and math word i f a [ i ] < 0 : a [ i ] = −a [ i ] ; n e g _ c o u n t −= 1 ;
p o s _ c o u n t += 1 ; b r e a k
problem-solving (Zhou et al., 2023; Chen et al., e l i f pos_count > neg_count :
2022b). The chain-of-thought prompting method for i in range ( len ( a ) ) :
i f a [ i ] > 0 : a [ i ] = −a [ i ] ; p o s _ c o u n t −= 1 ;
(Wei et al., 2022) explicitly instructs LLMs to gen- e l s e : break
n e g _ c o u n t += 1 ; b r e a k

erate intermediate steps until the final answer is i f a l l ( a [ i ] <= a [ i +1 ] f o r i i n r a n g e ( l e n ( a ) −1) ) : p r i n t ( "
YES" )
reached, enabling the model to decompose prob- e l s e : p r i n t ( "NO" )

lems and solve them step-by-step. Nevertheless, Table 1: An example of a problem and solution from
challenges persist when tackling complex reason- Codeforces with human oracle Solution, Description,
ing tasks, such as competitive-level programming and Explanation. The formatting in the problem is sim-
problems. For instance, even powerful models plified. Program generated by GPT-3.5 is incorrect.
like GPT-4 outperform fewer than 5% of human
competitors in virtual contests from Codeforces Previous works on automatically solving pro-
gramming problems focus on tasks mapping fairly The 7 points can be categorized into Description-
detailed natural language instructions to programs. Level (i.e., Point 1,3,4,5) and Analysis-Level (i.e.,
Li et al. (2022); Ni et al. (2023); Chen et al. (2022a) Point 2,6,7) of the problem and solution. To eval-
verified and selected candidate programs by run- uate the quality of the generated explanations, we
ning them on human-written or automatically gen- examine their effectiveness in two respects: 1) be-
erated test cases. Shojaee et al. (2023); Chen et al. ing positively rated by the human programming
(2023); Schäfer et al. (2023) incorporated execution expert who authored the “oracle” solution, and 2)
feedback as an intermediate step during code gener- aiding LLMs in solving the problems more effec-
ation to enhance the programming ability of LLMs. tively. In the explanation-aided evaluation, Ex-
While these methods yield promising results for plainer generates the explanation given the oracle
fairly straightforward implementation tasks, they solution while Solver generates programs from the
fall short on algorithmic reasoning tasks. explanation. We use GPT-turbo-3.5 and GPT-4 as
Table 1 shows a sample problem from Code- the Explainer and GPT-turbo-3.5 as the Solver in
1
forces. Compared to most instruction-to-program our experiments.
tasks, competitive-level programming, a problem- In the human evaluation, we ask human experts
to-program task, is more challenging. Before im- who are the authors of solutions to score the expla-
plementing the program, one needs to first abstract nations from -2 to 2. They give the explanations
the problem and create a mathematical represen- positive scores averaging 0.81 and 1.30 on GPT-
tation, come up with potential solutions, consider 3.5 Explainer and GPT-4 Explainer respectively.
time and space complexity constraints, and corner With respect to explanation-aided program synthe-
cases, and finally identify a proper problem-solving sis, we find that different points of the generated
strategy. explanations can guide the model to solve the prob-
In order to disentangle reasoning about the prob- lem better, with the solve rate at pass@10 (one of
lem from code implementation, we advocate de- the top 10 generated programs is deemed correct)
composing the process of solving a programming increasing from 6.1% to 42.4% on the CodeCon-
problem. Instead of directly generating a program tests (Li et al., 2022) test set. In addition, we found
given the problem statement, we propose adding that GPT-turbo-3.5 performs significantly worse at
explicit, intermediate reasoning steps in natural Analysis-Level explanations compared to GPT-4.
language. These steps are more aligned with how Both of them can generate high-quality Descrip-
humans typically solve such problems and utilize tions.
the idea of chain-of-thought. The main contributions of this work are:
2
However, while <problem, solution> pairs are
publicly available on the practice websites (Li et al.,
2022), natural language descriptions or explana- 1. We advocate disentangling reasoning about
tions of “how to solve the problem” are hard to the problem and code generation in solving
collect at scale, as it requires additional annotations competitive-level programming problems.
from programming experts. Hence, we propose to 2. We propose a Specific-to-General prompt to
automatically generate explanations using an LLM. automatically generate structured natural lan-
We found that: 1) Given both the problem and a guage explanations for <problem, solution>
human-written program solution, pairs.
LLMs like GPT-3.5 and GPT-4 can describe the
solution in natural language reasonably well; and 2) 3. We demonstrate that this proposed method
Given the automatically generated explanation as a yields convincing explanations that are posi-
hint, LLMs perform better at solving the problem. tively scored by the program authors and serve
Our explanation generation methodology shown as effective hints to better solve the problems.
in Figure 1 employs a hierarchical prompt, request-
ing a step-by-step explanation from detailed code
Though the main focus of this paper is not solv-
analysis to a broader understanding of the solution.
ing competitive-level programming problems. We
1
https://codeforces.com/problemset/problem/
further discuss how such explanations can poten-
1670/A tially be used to improve problem-solving, which
2
A solution here refers to a correct program. we leave as a potential avenue for future research.
Figure 1: The explanation generation and evaluation framework and corresponding prompts (Top). An example of
the full explain prompt (Bottom Left) and model’s output is in Appendix Table 7. The blue points are descriptions
while the grey points are analysis. We give the explanation based on the oracle solution to the instructed solver as a
hint (Bottom Right) to evaluate the quality of the generated explanation.

2 Background competitors compromise readability for fast imple-


mentation. Therefore, solutions are often hard to
2.1 Challenges in Solving and Annotation understand by others without natural language ex-
Competitive-level programming problems (Mirza- planation. Small-scale solution explanations, also
yanov et al., 2020) are more indirect compared to known as editorials, can be found in some blogs on
many code implementation tasks. Reasoning and the Internet, but collecting large-scale editorials in
problem-solving strategies are usually necessary a unified format is still infeasible. In this paper, we
before implementation (Skiena and Revilla, 2003; tackle how to use LLMs to generate silver-standard
Laaksonen, 2020); and they require solutions to explanations automatically, thereby addressing the
be both correct and efficient. While brute-force need for accessible and comprehensive solution
solutions may be feasible for some problems, they explanations in competitive-level programming.
are frequently deemed inadequate due to their high
time and space complexity. Additionally, some 2.2 Problem Formulation
problems may intentionally obscure the key idea We formalize our task with a problem set consisting
behind the solution, presenting more puzzle-like of n problems P = {p1 , p2 , ⋯, pn }; each problem
challenges. pi is a text sequence that describes the following
The challenges in solving competitive-level pro- aspects clearly.
gramming lie in not only the implementation phase
• Problem statement: a natural language de-
but also the reasoning process that precedes it,
scription of the problem, as the first cell in
which has not been adequately addressed by pre-
Table 1.
vious works. Consider the problem in Table 1, if
given a specific instruction, LLMs optimized for • Input/Output: the format and input/output con-
code generation can generate the correct program. straints (e.g. ranges) for the submitted pro-
However, the reasoning process of why it is correct gram s.
is not reflected in the problem or the program solu- • Example: An example of a correct pair of
tion. To bridge the gap between the problem and input/output.
the solution, natural-language-explained solutions • Note (Optional): Explanation of the Example
and reasoning steps can be potentially helpful. input/output.
Annotating explanations on how to solve those
questions can be difficult and time-consuming, Each pi corresponds to a set of oracle human
1 2 t
even for highly skilled programming competitors. solutions Si = {si , si , ⋯, si } where t is the num-
Solutions are written under a time constraint, and ber of total solutions of pi , we then select top k
165 problems in the test set of CodeContests, the
proposed general-to-specific prompt can boost the
solve@10 from 6.1% to 9.1%. Through reasoning
general-to-specific, the LLM can perform a bit bet-
ter at solving programming problems. However,
upon examining the failed cases, we discovered
that for most problems, the model makes a mis-
take at a very early stage, ultimately resulting in a
completely incorrect solution.

3 Method
In the process of problem-solving, a human typ-
ically constructs a solution by progressing from
Figure 2: The Baseline Solver Prompt and General-to- a general idea to a detailed code implementation.
Specific (G2S) Prompt which asks LLMs to follow the However, explaining that solution involves a re-
reasoning steps till it reaches the state of implementa-
verse approach. This entails examining the code
tion.
on a line-by-line basis, interpreting the role of each
function, and then rationalizing the algorithmic
solutions following 2 simple rules: (1) We only steps in relation to the original problem. There-
consider correct human solutions, i.e., those that fore, we design a specific-to-general explanation
have passed the online judge system; (2) Solu- generation method.
tions in Si are ranked according to their program-
ming language and size in bytes, with a preference 3.1 Specific-to-General Solution Explaining
for Python-implemented solutions and shorter pro- Previous works have demonstrated the ability of
grams. LLMs to explain code; therefore, we investigated
All experiments in this paper are zero-shot on generating explanations automatically using an
large language models without fine-tuning. LLM with both the problem and sample solution as
j
input. For a problem-solution pair {pi , si } where
2.3 General-to-Specific Prompting Solver j
j ≤ k, an explanation ei is generated. For each
Before delving into the explanations, we first dis- problem pi , a set of explanations Ei is generated
1 2 k
cuss the general capacity of LLMs to solve those given different solutions {si , si , ⋯, si }.
problems directly from problem to generated so- Although simple prompts such as ’explain the
lution or thinking step-by-step. We note that our solution’ may generate useful explanations, these
methodology requires using instruction-finetuned often lack crucial information and are difficult to
language models (Ouyang et al., 2022), as we pro- evaluate due to their output’s diversity. To tackle
vide zero-shot instructions in our prompt. this issue, we deliberately control aspects of the
We designed a general-to-specific reasoning explanations, requiring them to include a ’problem
chain, which is inspired by humans’ step-by-step summary’ that demonstrates our understanding of
thought processes in solving such problems. As the problem and three levels of ’natural language
shown in Figure 2, we prompt the LLM to start description of the problem,’ illustrating our ability
from a general understanding of the problem and to comprehend the solution from a low-level to a
potential algorithms to use, then gradually transit to high-level perspective. These can be considered
a more specific and detailed level of understanding, as ’Description-level’ explanations. The elements
till finally implementing a program in Python. such as ’used algorithm,’ ’time complexity,’ and
For each problem, we generate k programs ’proof of correctness’ fall under ’Analysis-level’
1 2 k
{gi , gi , ⋯, gi } with LLMs as the k candidates to explanations, showcasing the language model’s
conduct a solve@k evaluation, as defined by Chen overall analysis and understanding of the solution.
et al. (2021). In other words, if any of the generated The method for this specific-to-general explanation
k programs is considered as a correct solution, then prompt is detailed in the left part of 1.
this problem is regarded as solved. Format-guided-generated explanations are clear
When experimenting with GPT-turbo-3.5 on the and structured, thus making it easier to disentan-
Ratings
gle information and evaluate each aspect. In our Dataset total [800, 1000] (1000, 1500] (1500, 2000] (2000, 3600]
experiment, over 99% of explanations contain all CodeContests 165 18.2% 17.0% 20.0% 44.8%
Our Data 50 34% 46% 20% 0%
defined points, with less than 1% skipping some
later points due to the length constraint. Table 2: Difficulty statistics (higher ratings = more diffi-
In addition, thinking from detailed code-level cult) for the dataset. The problems in our dataset exclude
implementation can also provide the intermediate hard problems (rating over 2k), as they exceed the rating
steps in context. The LLM can reach a better gen- of our annotators.
eral understanding of the solution by looking at its
previously generated general descriptions.
Metric We employ pass@k (Chen et al., 2021)
3.2 Explanation Instructed Solver as our evaluation metric for solve rate. For each
problem pi , we sample k programs generated from
In order to evaluate the quality of generated expla-
GPT-3.5 and evaluate them using Solve Rate@k
nations, we design an automatic metric to test how
metric: the percentage of programs that pass all
much it can aid in solving the problem if included
hidden test cases when submitted to Codeforces’
in the instruction. In this setting, we give both the
online judge. We first filter the programs by their
original problem as well as one of Description-level
output on the public test cases before submitting
points to the LLM Solver with the corresponding
them and also measure Pass Public@k: the per-
prompt given in the right part of Figure 1. If a
centage of programs that pass the public test cases
given instruction enables the LLM Solver to solve
given in the examples. The above metrics are ab-
a problem it was previously unable to solve, we
breviated as ‘solve@k’ and ‘public@k’.
consider that instruction to be more informative
than one that does not yield such an outcome.
4.2 Human Evaluation
4 Experiments We measured the quality of LLM-generated expla-
nations using human evaluation. We collect 50
4.1 Experimental Setup <problem, solution> pairs from Codeforces, ensur-
Model We use both GPT-3.5-turbo and GPT- ing that their format remained consistent with those
4 (OpenAI, 2023a,b) as the Explainer for expla- in CodeContests.
3
nation generation. We use GPT-3.5-turbo for all
our experiments as Solver LLM for code genera- Author Likert Scores Recognizing that under-
tion. We will refer to it as GPT-3.5 for simplicity. standing and explaining others’ solutions can be a
The temperature t is set to 0 wherever only one challenging task for programmers, we employed
sample is needed, and 0.2 otherwise. Max-length an annotator-centered evaluation approach. We ex-
of text is set to 4096, and we skipped 0.7% of cases tracted solutions and corresponding problems from
where the max length is exceeded. Codeforces for an expert annotator. The Explainer
then generates an explanation for the annotator’s
Data To ensure the effectiveness and accuracy of solution, which was subsequently scored by the
our results, given that GPT-3.5 may have seen some author of the explained solution. Note that each
<problem, solution> pairs in its training data, we explanation is scored by the author of the solution
use the CodeContests test set as our main dataset being explained.
in this paper. It contains 165 real online contest We generated explanations for 50 problems with
problems from Codeforces, the earliest of which ratings ranging from 800 to 2000, along with their
dates back to Oct 2021, which is after the knowl- corresponding solutions, and provided these ex-
edge cutoff of GPT-3.5 and GPT-4 (Sep. 2021). planations to human experts. They were asked to
Additionally, we extract a small subset of 50 more assign a Likert score from −2 (very poor) to 2 (ex-
recent problems from Codeforces for human eval- cellent).
uation. Table 2 are statistics based on their level- The evaluation consists of ten questions, each
of-difficulty ratings. Problems with ratings over one corresponding to a specific aspect of the ex-
2k are considered very difficult, most of which can planation. We separately assess the quality of the
only be solved by medal-winning competitors. response to each point of our G2S prompt (see Fig-
3
Due to the usage limit of GPT-4, we run larger scale ure 1). Furthermore, we developed three criteria to
experiments only on GPT-3.5-turbo. evaluate various aspects of the overall explanation:
at the one-sentence description, there are ambigu-
ous terms like ‘original array’ or ‘move elements’,
which might mislead the problem-solving if inter-
preted incorrectly. This is due to natural languages’
ambiguous nature compared to programming lan-
guages.
Models exhibit shortcomings when explaining
solution correctness, as they may not comprehen-
Figure 3: Human Likert scores (−2: very poor to 2: ex- sively account for conditions stipulated in the prob-
cellent) evaluating various aspects of the explanations. lem statement. For instance, when explaining Ex-
ample 1, it failed to recognize that “swapping signs
of 2 elements multiple times means moving signs
1. Usefulness: How useful is the explanation as
arbitrarily along the array” is a crucial condition,
guidance to solve the problem?
which is not mentioned explicitly in natural lan-
2. Clearness: How good is the explanation in guage. This highlights a potential limitation in the
terms of describing everything clearly and models’ ability to extract and incorporate essen-
avoiding ambiguity? tial information from various parts of the problem
3. Understanding: How much does the LLM statement when generating explanations.
understand the key idea behind the solution? We also present the full input/output and our
scores for both successful and failed cases in ap-
The average Likert scores over 50 problems are pendix A.
shown in Figure 3. Regarding the scores for the so-
lution descriptions (Step-by-Step Solution Descrip- 4.3 Automatic Metrics: Improving Solutions
tion, Explanation of the Solution, Solution in One We further investigated the ability of generated ex-
Sentence) and usefulness, both GPT-3.5 and GPT-4 planations to improve problem solving. Our funda-
Explainer are positively rated by expert annotators, mental assumption is that if an explanation accu-
with an average of 1.16 and 1.36 respectively. rately describes the algorithm, it should be capable
However, GPT-3.5 receives near zero or nega- of guiding the implementation of a correct solution.
tive scores on questions including why it’s correct, Consequently, we experimented with versions of
clearness, and understanding, showing its inade- the Instructed Solver Prompt in Figure 1, wherein
quate ability to grasp the key idea behind the so- one point in the explanation (i.e., an aspect of the
lution, while GPT-4 performs better (0.68 ∼ 0.88 solution) is provided to the GPT-3.5 Solver as a
score higher) on these aspects. This reveals a clear hint for solving the problem.
difference in the abilities of GPT-3.5 and GPT-4 to We compare it with two baseline solvers that, un-
reason and analyze competitive-level programming like our solver from Figure 1, are not conditioned
solutions. on explanations and only get the problem statement
as an input: zero-shot prompt (denoted as Baseline
Qualitative Analysis We observed several inter- in Table 4) and General to Specific (G2S) “step-by-
esting aspects of the explanations generated by the step” prompt shown in Figure 2. We also check that
models. Models can effectively explain code by explanations do not contain code snippets to ensure
integrating the problem statement and the solution. the solutions are not directly leaked in explana-
Step-by-step descriptions are often more concise tions. However, note that it is still not a completely
than line-by-line code reports by summarizing com- “fair” comparison, since the automatically gener-
plex operations and avoiding re-stating well-known ated ‘silver explanations’ are conditioned on oracle
algorithms (e.g., depth-first-search). solutions.
A sample explanation from GPT-3.5 is given in
Table 3. It describes the solution very well in both Main results For GPT-3.5, we measure pass@k
specific (step-by-step) and general (one-sentence) for k = {1, 5, 10}, but only pass@1 for GPT-4 due
levels. It summarizes the operations of ‘count < 0’ to access limits. To sample k programs, we sample
and ‘multiply -1 or 1’ into ‘negative on the left, k different human solutions for Explainer and then
positive on the right’ and explains if it’s sorted, generate a program for each explanation.
then ‘yes’ otherwise, ‘no’. However, if we look Results are shown in Table 4. Different
1). Brief Problem Summary: Given an array of n integers, ing that they share similar capabilities in terms of
check if it can be sorted by swapping the signs of any two
elements with different signs. Description-level explanations.
2). Used Algorithm: The algorithm used is a simple imple-
mentation of the problem statement. GPT-3.5 Solver
3). Step-by-step Solution Description: solve@1 solve@5 solve@10 public@10
1. Read the input array a. Baseline 1.8 3.6 6.1 13.9
2. Count the number of negative elements in the array and G2S prompt 2.4 5.4 9.1 18.8
store it in h. GPT-3.5 Solver With Silver Explanation
3. Create a new array b by taking the absolute value of each w/ UsedAlg 1.8 (1.2) 4.2 6.1 13.3
element in a and multiplying it by -1 if the index is less than w/ S-by-S 13.3 (15.8) 32.2 42.4 47.9
h, otherwise by 1. w/ Exp-Sol 6.1 (4.8) 17.6 23.6 32.7
4. Check if the sorted array b is equal to the original array b. w/ OneSent 4.2 (4.2) 9.1 13.9 26.1
5. If they are equal, print "yes", otherwise print "no". w/ TC(O(⋅)) 1.8 (2.4) 3.6 6.7 13.3
4). Explanation of the Solution: The solution works by
first counting the number of negative elements in the array. Table 4: Different aspects of the explanation’s effect on
We then create a new array b by taking the absolute value
of each element in a and multiplying it by -1 if the index is improving program generation. Values are percentage
less than h, otherwise by 1. This ensures that all negative % and ‘solve’ and ‘public’ are short for ‘Solve Rate’
elements are on the left side of the array and all positive and ‘Pass Public Tests’. Solve@1 results in parentheses
elements are on the right side. We then check if the sorted are from GPT-4’s generated explanations. The bottom 5
array b is equal to the original array b. If they are equal, it rows correspond to Figure 1’s points 2,3,4,5, and 6 in
means that we can sort the original array by swapping the
signs of any two elements with different signs. the left prompt.
5). Solution in one sentence: The solution works by count-
ing the number of negative elements in the array, creating a
new array with all negative elements on the left and positive Pass Public Tests vs. Solve One observation
elements on the right, and checking if the sorted array is from Table 4 is that solve@10 is significantly less
equal to the original array.
6). Time Complexity: The time complexity of the solution than public@10. For a program that passes the
is O(nlogn) due to the sorting operation. public tests but fails the hidden tests, there are two
7). Proof of correctness (Why this is correct): The solution possibilities: 1) It is incorrect and only applies to a
is correct because it ensures that all negative elements are
on the left side of the array and all positive elements are subset of test data, including the public tests; 2) It
on the right side. This means that we can sort the array by is inefficient. As discussed before, in competitive-
swapping the signs of any two elements with different signs.
The sorted array will be equal to the original array if and
level programming, a “correct” but slow implemen-
only if we can sort the array using the given operation. tation does not count as a solution, as there are
constraints on time and space complexity. There-
Table 3: GPT-3.5’s explanation to the example in Table fore, we further study programs that pass the public
1, which understands and clearly describes the key idea
tests but may fail hidden tests. As shown in Table 5,
behind the solution. Note it comprehends the code from
a detailed level in ‘3)’ and a general level in ‘5)’. blue:
the baseline has 48.9% of its programs rejected by
correct, red: incorrect) the online judge due to inefficiency, indicating that
GPT-3.5 tends to generate inefficient implementa-
tions (e.g., brute force solutions).
Description-level aspects of explanations improve
both the solve rate and pass public rate. The most Solve Wrong Answer TLE Other
Baseline 35.1% 15.6% 48.9% 0%
detailed aspect, Step-by-Step Solution Description
G2S prompt 38.3% 14.1% 47.6% 0%
(S-by-S), which provides a detailed natural lan- w/ UsedAlg 39.1% 18.9% 42.1% 0%
guage description of the implementation, offers the w/ S-by-S 75.6% 11.4% 11.4% 1.6%
most significant benefit to problem-solving, result- w/ Exp-Sol 73.6% 11.9% 11.1% 1.4%
ing in a solve rate @1 that is 7.4 times higher than w/ OneSent 56.6% 27.9% 14.0% 1.5%
the baseline. The impact of Explanation of the Table 5: Final judgement of generated programs that
Solution (Exp-Sol) and Solution in One Sentence pass the public tests. TLE means time limit exceeded,
(OneSent) is comparatively lower due to their con- and other includes memory limit exceeded and runtime
4
cise nature, which offers a less clear path towards error.
the solution. However, providing information on
the algorithms used (UsedAlg) or the expected When provided hints from the solution descrip-
time complexity (TC) does not improve GPT-3.5’s tion, the portion of TLE programs drops signifi-
problem-solving capabilities. 4
This is for all submissions, i.e., one problem might have
The pass@1 results for GPT-4 Explainer are up to k submissions, which is different from the problem-wise
not significantly better than for GPT-3.5, indicat- solve rate.
j
the solutionsi , we generate one explanation
j j
ei , and one corresponding program gi .
2. Sample k explanations: We only take the first
1 2 k
solution si , and sample Ei = {ei , ei , ⋯, ei },
j
for each explanation ei , we generate one cor-
j
responding program gi .
3. Sample k programs: We only sample 1 so-
lution si and one corresponding explana-
tion ei , then we sample k programs Gi =
1 2 k
Figure 4: The aiding effects of 3 levels of Solution De- {gi , gi , ⋯, gi } given the explanations.
scription over different difficulty ratings. The difference
5
in color shows the gain in solve@10.
w/ OneSent public@10 solve@10
Baseline 13.9% 6.1%
Statg1. Sample 10 Human Solution si 26.1% 13.9%
cantly. Although GPT-3.5 may still make mistakes Statg2. Sample 10 Explanation ei 24.8% 12.1%
in some details or fail to consider corner cases even Statg3. Sample 10 Programs gi 18.2% 6.7%

with hints from the explanation, it is better at avoid- Statg3. 10 gi from GPT-4 explanation 18.2% 10.9%

ing inefficient solutions. Table 6: Comparison of sampling strategies. Strategies


Another interesting observation is that the are numbered. Rows 2,3,4 are GPT-3.5 sampling 10
wrong answer rate for one-sentence explanation- solutions/explanations/programs respectively, the last
instructed solving is higher than the baseline. One row is GPT-4 Explainer sampling 10 programs from a
possible explanation is that it is challenging to in- single explanation.
corporate corner case handling in a one-sentence
solution description, which makes GPT-3.5 more Table 6 shows that the first strategy of sampling
likely to implement an almost-correct program. from 10 different human oracle solutions is the
most effective. Additionally, the second strategy of
Difficulty of the problem We further study the
sampling 10 explanations from one oracle solution
aiding effect of three levels of Solution Description
yields better results than sampling 10 programs
on problems of different difficulty ratings. Code-
from one explanation (strategy 3). One potential
forces problems are given ratings, the higher the
reason is that some human solutions may have poor
ratings are, the more challenging the problem is.
readability or employ complex implementations
Individuals who consistently solve problems with
that are hard to follow. By sampling different hu-
ratings of 2000 are in the 93rd percentile of all par-
man oracle solutions, there is a higher likelihood
ticipants. As shown in Figure 4, the solve rate
that explanations based on clear and concise solu-
decreases as the ratings increase and no explana-
tions can serve as better hints. Similarly, sampling
tion can help solve complex problems. However,
diverse explanations can mitigate the issue of mis-
for easier problems, even a one-sentence hint en-
leading, incorrect explanations. We also compared
ables GPT-3.5 to solve approximately 70% of prob-
the explanation quality of GPT-4 (i.e., only as an
lems, compared to the ∼ 30% baseline. Further-
Explainer) and found it to be superior to GPT-3.5
more, hints can effectively help to solve medium-
in the same setting (10.9% vs. 6.7%). We skipped
difficulty problems which were previously unsolv-
other settings due to experimental limitations.
able.
Sampling Strategies In our approach, we gener- 5 Related work
ate k programs and treat all of them as candidates 5.1 Solving Competitive-level programming
without re-ranking, making the sampling strategy problems
crucial. We therefore compared three strategies for
sampling k programs. Early attempts to apply deep learning to solve
competitive-level programming problems (Balog
1. Sample k human solutions: for each pi , we et al., 2017) utilized traditional approaches such
1 2 k
sample Si = {si , si , ⋯, si }, and for each of as SMT solvers and search to generate short pro-
5
The outlier is because there’s only one rating=3100 prob- grams for simple problems. Polosukhin and Ski-
lem in the CodeContests test set. danov (2018) collected a dataset of human-written
problem statements and solutions for Codeforces tions with student-created explanations, finding
problems and introduced sequence model baselines that the LLM-created explanations were easier to
that could solve a small subset of their dataset. understand and more accurate. Chen et al. (2023)
With the advent of Transformers, AlphaCode (Li utilizes self-generated explanations as feedback to
et al., 2022) achieved significant progress in solv- its self-debug. In comparison, our work targets ex-
ing competitive-level programming problems by plaining competitive-level programming problems,
attaining a rating equivalent to the top 54% of par- aiming not only to clarify the code implementation
ticipants on Codeforces by finetuning LLMs in the but also to point out the key idea behind the solu-
problem-to-solution scenario with the CodeCon- tion, its correctness, choice of algorithms, and time
tests dataset collected from Codeforces. Notably, complexity.
AlphaCode requires sampling 1M program candi-
dates per problem to achieve a 29.6% solve rate on 6 Conclusion and Future Work
their test set. Zelikman et al. (2023) improves upon
In this paper, we propose explaining competitive-
AlphaCode by using fewer samples for the same
level programming solutions using LLMs. Given a
level of performance. Our study focuses on ex-
problem and its corresponding human oracle solu-
plaining solutions to problems, rather than directly
tion given, LLMs can generate structured explana-
solving them. To the best of our knowledge, this
tions that are positively scored by human authors.
is the first attempt to explain competitive-level pro-
Our evaluation demonstrates that both GPT-3.5 and
gramming solutions using language models, which
GPT-4 exhibit reasonable capabilities in generating
places a landmark of the reasoning and interpreting
faithful and clear descriptions, which can guide
ability of those models.
another LLM to better solve the problem. GPT-4
5.2 Reasoning with large language models outperforms GPT-3.5 significantly in analyzing the
problem and solution, as well as capturing the key
Wei et al. (2022) has demonstrated that by breaking
ideas behind the solution.
down the reasoning steps through chain-of-thought
Our automatic evaluation metric examines an
(CoT) prompting, LLMs are able to solve challeng-
ideal scenario: when a hint is based on an ora-
ing reasoning problems by following the correct
cle human solution, it effectively guides the LLM
logic step-by-step. This method, along with ma-
to generate improved programs for solving prob-
jority voting, has led to notable advancements in
lems. However, a system should be able to learn
solving high-school-level mathematical problems
from human programming solutions to improve its
(Lewkowycz et al., 2022). Kojima et al. (2022)
own problem-solving on novel problems without
generalize the idea of CoT to zero-shot learning.
guidance from a human solution. This raises the
Another technique that builds upon CoT is the
question: Can the LLM-generated explanations be
self-consistency decoding strategy (Wang et al.,
utilized to improve subsequent problem-solving?
2022b). This approach samples diverse reasoning
Our explanation method can potentially be ap-
paths and selects the most consistent answer, which
plied to annotate large-scale data (e.g., the full
has shown to improve LLMs’ performance on com-
CodeContests training set), yielding thousands of
plex reasoning tasks by embracing multiple ways
silver explanations that can be used to fine-tune
of thinking. Additionally, Parsel (Zelikman et al.,
a reasoning model for competitive-level program-
2023) proposed a framework that focuses on en-
ming problems. This approach could help bridge
hancing LLMs’ hierarchical multi-step reasoning
the long-standing reasoning gap between problem
capabilities, particularly for tasks such as generat-
and program for complex programming problems.
ing complex programs.
Moving forward, we aim to further address solving
5.3 Code Comprehension with LLMs such problems by focusing on enhancing reasoning
for programming problems.
Several existing works have explored generat-
ing code explanations using LLMs. MacNeil
Limitations
et al. (2023) integrated LLM-generated code ex-
planations into an interactive e-book on web soft- One primary limitation of this work is that we ex-
ware development, showing that students found perimented on only one dataset and two LLMs,
the generated explanations helpful. Leinonen namely GPT-3.5 and GPT-4, so it’s unclear whether
et al. (2023) compared LLM-generated explana- our method can generalize well to other LLMs
or problem sources other than Codeforces. Here References
we just assume that the competitive-level program- Matej Balog, Alexander L. Gaunt, Marc Brockschmidt,
ming problems are well defined so the distribution Sebastian Nowozin, and Daniel Tarlow. 2017. Deep-
shift won’t be large between sources. coder: Learning to write programs. In International
Conference on Learning Representations.
Another limitation stems from the annotator-
centered nature of our human evaluation process, Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan,
which prevents us from assessing annotator agree- Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022a.
ment. Individual annotators were only able to score Codet: Code generation with generated tests.
explanations based on their own solutions. While
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan,
we provided guidelines for assigning scores, the Henrique Pondé de Oliveira Pinto, Jared Kaplan,
evaluation process remains inherently subjective, Harrison Edwards, Yuri Burda, Nicholas Joseph,
and interpretations may vary among different anno- Greg Brockman, Alex Ray, Raul Puri, Gretchen
tators. Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
try, Pamela Mishkin, Brooke Chan, Scott Gray,
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Ethics Statement Kaiser, Mohammad Bavarian, Clemens Winter,
Philippe Tillet, Felipe Petroski Such, Dave Cum-
mings, Matthias Plappert, Fotios Chantzis, Eliza-
Our research is driven by the potential benefits beth Barnes, Ariel Herbert-Voss, William Hebgen
of improved problem-solving capabilities and a Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
deeper understanding of programming concepts for
William Saunders, Christopher Hesse, Andrew N.
developers and learners. However, we acknowl- Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan
edge the ethical implications and potential risks Morikawa, Alec Radford, Matthew Knight, Miles
specific to our work. Brundage, Mira Murati, Katie Mayer, Peter Welinder,
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
This work focuses on the task of automatic code Sutskever, and Wojciech Zaremba. 2021. Evaluat-
generation, but we emphasize that it is not intended ing large language models trained on code. CoRR,
to replace human efforts in programming. Machine- abs/2107.03374.
generated programs may contain errors or vulner-
Wenhu Chen, Xueguang Ma, Xinyi Wang, and
abilities, and it is crucial to thoroughly verify any William W. Cohen. 2022b. Program of thoughts
AI-generated code snippets before using them. Pro- prompting: Disentangling computation from reason-
viding code explanations should not be seen as ing for numerical reasoning tasks.
an endorsement to blindly trust the generated pro-
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and
grams. Users must carefully understand, verify, Denny Zhou. 2023. Teaching large language models
and examine AI-generated code to ensure its cor- to self-debug.
rectness and safety.
Wenyue Hua and Yongfeng Zhang. 2022. System 1
+ system 2 = better world: Neural-symbolic chain
Acknowledgement of logic reasoning. In Findings of the Association
for Computational Linguistics: EMNLP 2022, pages
601–612, Abu Dhabi, United Arab Emirates. Associ-
This material is based on research that is sup- ation for Computational Linguistics.
ported in part by the Air Force Research Laboratory
(AFRL) and DARPA, for the KAIROS program un- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
der agreement number FA8750-19-2-1003. The taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
guage models are zero-shot reasoners. In Advances
views and conclusions contained herein are those in Neural Information Processing Systems.
of the authors and should not be interpreted as
necessarily representing the official policies, either Antti Laaksonen. 2020. Guide to Competitive Program-
expressed or implied, of DARPA, IARPA, NSF, or ming - Learning and Improving Algorithms Through
Contests, Second Edition. Undergraduate Topics in
the U.S. Government. Computer Science. Springer.
We sincerely thank our annotators, especially
smax, for their efforts in this work, and also all Juho Leinonen, Paul Denny, Stephen MacNeil, Sami
Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran,
reviewers for their valuable suggestions on this and Arto Hellas. 2023. Comparing code explana-
paper. tions created by students and large language models.
CoRR, abs/2304.03938.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank
Ethan Dyer, Henryk Michalewski, Vinay V. Ra- Tip. 2023. Adaptive test generation using a large
masesh, Ambrose Slone, Cem Anil, Imanol Schlag, language model.
Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur,
Guy Gur-Ari, and Vedant Misra. 2022. Solving quan- Parshin Shojaee, Aneesh Jain, Sindhu Tipirneni, and
titative reasoning problems with language models. Chandan K. Reddy. 2023. Execution-based code
CoRR, abs/2206.14858. generation using deep reinforcement learning.

Yujia Li, David Choi, Junyoung Chung, Nate Kush- Steven Skiena and Miguel A. Revilla. 2003. Program-
man, Julian Schrittwieser, Ré mi Leblond, Tom Ec- ming challenges: the programming contest training
cles, James Keeling, Felix Gimeno, Agustin Dal manual. SIGACT News, 34:68–74.
Lago, Thomas Hubert, Peter Choy, Cyprien de Mas-
Boshi Wang, Xiang Deng, and Huan Sun. 2022a. Itera-
son d’Autume, Igor Babuschkin, Xinyun Chen, Po-
tively prompt pre-trained language models for chain
Sen Huang, Johannes Welbl, Sven Gowal, Alexey
of thought. In Proceedings of the 2022 Conference on
Cherepanov, James Molloy, Daniel J. Mankowitz,
Empirical Methods in Natural Language Processing,
Esme Sutherland Robson, Pushmeet Kohli, Nando
pages 2714–2730, Abu Dhabi, United Arab Emirates.
de Freitas, Koray Kavukcuoglu, and Oriol Vinyals.
Association for Computational Linguistics.
2022. Competition-level code generation with Al-
phaCode. Science, 378(6624):1092–1097. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V.
Le, Ed H. Chi, and Denny Zhou. 2022b. Self-
Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, consistency improves chain of thought reasoning in
Delip Rao, Eric Wong, Marianna Apidianaki, and language models. CoRR, abs/2203.11171.
Chris Callison-Burch. 2023. Faithful chain-of-
thought reasoning. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.
Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Chain of thought prompting elicits reasoning in large
Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and language models. arXiv preprint arXiv:2201.11903.
Juho Leinonen. 2023. Experiences from using code
explanations generated by large language models in Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D.
a web software development e-book. In Proceedings Goodman, and Nick Haber. 2023. Parsel: A (de-
of the 54th ACM Technical Symposium on Computer )compositional framework for algorithmic reasoning
Science Education, Volume 1, SIGCSE 2023, Toronto, with language models.
ON, Canada, March 15-18, 2023, pages 931–937.
ACM. Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Good-
man. 2022. Star: Bootstrapping reasoning with rea-
Mike Mirzayanov, Oksana Pavlova, Pavel Mavrin, Ro- soning. Advances in Neural Information Processing
man A. Melnikov, A. S. Plotnikov, V. A. Parfenov, Systems, 35:15476–15488.
and Andrew Stankevich. 2020. Codeforces as an
educational platform for learning programming in Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,
digitalization. Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi.
Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, 2023. Least-to-most prompting enables complex rea-
Wen tau Yih, Sida I. Wang, and Xi Victoria Lin. 2023. soning in large language models.
Lever: Learning to verify language-to-code genera-
tion with execution.

OpenAI. 2023a. ChatGPT: Optimizing Language


Models for Dialogue. https://openai.com/blog/
chatgpt.

OpenAI. 2023b. Gpt-4 technical report.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-


roll L. Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Maddie Simens, Amanda Askell, Peter Welinder,
Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022.
Training language models to follow instructions with
human feedback. CoRR, abs/2203.02155.

Illia Polosukhin and Alexander Skidanov. 2018. Neural


program search: Solving programming tasks from
description and examples. CoRR, abs/1802.04335.
A Appendix
Case Study Table 7 presents an example of the
input/output of the model, which contains our
specific-to-general prompt as well as the compar-
ison between GPT3.5 and GPT4 generated expla-
nations. We can see that both GPT-3.5 and GPT-4
describe the problem and solution very well. Both
models correspond the solution logic to the prob-
lem situation correctly. In the analysis of why this
solution is correct, GPT-3.5 and GPT-4 mention the
key idea of increasing IQ backward means saving
IQ for the future.
Table 8 presents an incorrectly explained exam-
ple. LLMs can describe the problem and illustrate
the printing array operation as “moving the last
element to the head of the array”. However, both of
them fail to understand the purpose of placing the
maximum element in the front as they ignore one
crucial condition in the problem: all elements are
strictly positive. Nevertheless, GPT-4 maintains a
better understanding of the problem by noticing the
condition that the given array is already sorted in
non-decreasing order.
Hint: Here’s an accepted solution from an expert programmer.
Can you analyze the code and identify the algorithmic approach
Input You are required to read and try to understand a competitive programming problem statement and interpret its correct
used? Please also describe the steps of the algorithm in
solution. Think carefully about why doing this can correctly solve the problem.
natural language, explaining what every named variable means,
———–
what each step does and how it works. If there is anything
Problem:Doremy is asked to test n contests. Contest i can only be tested on day i. The difficulty of contest i is ai . Initially,
non-obvious in the step, explain why it is done. Please notice
Doremy’s IQ is q. On day i Doremy will choose whether to test contest i or not. She can only test a contest if her current IQ is
that in the algorithm, some conclusions might be first deduced
strictly greater than 0.
from conditions and directly used.
If Doremy chooses to test contest i on day i, the following happens:
if ai > q, Doremy will feel she is not wise enough, so q decreases by 1; otherwise, nothing changes. def solve ( ) :
Doremy wants to test as many contests as possible. Please give Doremy a solution. n , q =map ( i n t , i n p u t ( ) . s p l i t ( ) )
Input A= l i s t ( map ( i n t , i n p u t ( ) . s p l i t ( ) ) )
4 A. r e v e r s e ( )
The input consists of multiple test cases. The first line contains a single integer t(1 ≤ t ≤ 10 ) — the number of test cases. The
q_now = 0
description of the test cases follows.
5 9 record = []
The first line contains two integers n and q (1 ≤ n ≤ 10 , 1 ≤ q ≤ 10 ) — the number of contests and Doremy’s IQ in the f o r a i n A:
beginning. i f a > q_now and q_now <q : # c a n
9
The second line contains n integers a1 , a2 , ⋯, an (1 ≤ ai ≤ 10 ) — the difficulty of each contest. take off
5
It is guaranteed that the sum of n over all test cases does not exceed 10 . r e c o r d . append ( 1 )
Output q_now += 1
For each test case, you need to output a binary string s, where si = 1 if Doremy should choose to test contest i, and si = 0 e l i f a > q_now and q_now >= q :
otherwise. The number of ones in the string should be maximum possible, and she should never test a contest when her IQ is zero r e c o r d . append ( 0 )
or less. e l s e : r e c o r d . append ( 1 )
If there are multiple solutions, you may output any. p r i n t ( " " . join ([ s t r ( _ ) for _ in
Example Input Example Output reversed ( record ) ]) )
5 1
11 11 t=int ( input () )
1 110 for i in range ( t ) :
21 1110 solve ()
12 01111 Analyze the problem and the accepted code of the correct
31 solution. Avoid ambiguity in your description as possible. Each
121 part should be independently interpretable, i.e., each point
42 should be clear to readers who have no access to other points.
1431 Answer in the following Format:
52 1). Brief Problem Summary:
51243 2). Used Algorithm:
Note 3). Step-by-step Solution Description:
In the first test case, Doremy tests the only contest. Her IQ doesn’t decrease. 4). Explanation of the Solution:
In the second test case, Doremy tests both contests. Her IQ decreases by 1 after testing contest 2. 5). Solution in one sentence:
In the third test case, Doremy tests contest 1 and 2. Her IQ decreases to 0 after testing contest 2, so she can’t test contest 3. 6). Time Complexity:
7). Proof of correctness (Why this is correct):

GPT-3.5’s Output: GPT-4’s Output:


1). Brief Problem Summary: 1). Brief Problem Summary:
Doremy is asked to test n contests. Contest i can only be tested on day i. The difficulty of Doremy wants to test as many contests as possible without her IQ dropping to 0 or less. The input
contest i is ai . Initially, Doremy’s IQ is q. On day i Doremy will choose whether to test consists of multiple test cases, each containing the number of contests, Doremy’s initial IQ, and the
contest i or not. She can only test a contest if her current IQ is strictly greater than 0. difficulty of each contest. The output should be a binary string indicating which contests Doremy
Doremy wants to test as many contests as possible. should test.
2). Used Algorithm: 2). Used Algorithm:
Greedy Algorithm The algorithm is a greedy algorithm that iterates through the contests in reverse order.
3). Step-by-step Solution Description: 3). Step-by-step Solution Description:
Read the number of test cases. Read the number of test cases T .
For each test case: For each test case:
Read the number of contests and Doremy’s IQ in the beginning. Read the number of contests n and Doremy’s initial IQ q.
Read the difficulty of each contest. Read the list of contest difficulties A and reverse it.
Reverse the list of difficulties. Initialize qnow to 0 and create an empty list called record.
Initialize a variable qnow to 0 and an empty list record. Iterate through the reversed list of contest difficulties A:
For each difficulty a in the reversed list of difficulties: If the current contest difficulty a is greater than qnow and qnow is less than q, append 1 to
If a is greater than qnow and qn ow is less than q, append 1 to the record list and the record list and increment qnow by 1.
increase qnow by 1. If the current contest difficulty a is greater than qnow and qnow is greater than or equal to q,
If a is greater than qnow and qnow is greater than or equal to q, append 0 to the append 0 to the record list.
record list. Otherwise, append 1 to the record list.
If a is less than or equal to qnow , append 1 to the record list. Print the record list in reverse order as a string.
Print the binary string obtained by reversing the record list. 4). Explanation of the Solution:
4). Explanation of the Solution: The solution works by iterating through the contests in reverse order and deciding whether Doremy
The solution uses a greedy approach to maximize the number of contests Doremy can test. should test the contest or not based on her current IQ (qnow ) and the contest difficulty. If the
The idea is to start from the last contest and work backwards. For each contest, if its contest difficulty is greater than her current IQ and her current IQ is less than her initial IQ, she
difficulty is greater than Doremy’s current IQ and her IQ is less than the maximum allowed should test the contest and her IQ will increase. If the contest difficulty is greater than her current IQ
IQ, she can choose to take off and decrease her IQ by 1. If the difficulty is greater than her and her current IQ is greater than or equal to her initial IQ, she should not test the contest.
IQ and her IQ is already at the maximum allowed IQ, she cannot take off and must skip the Otherwise, she should test the contest as her IQ will not decrease. Likert score: 2
contest. If the difficulty is less than or equal to her IQ, she can test the contest. By working 5). Solution in one sentence:
backwards, we ensure that Doremy tests as many contests as possible while maintaining her The solution is a greedy algorithm that iterates through the contests in reverse order and decides
IQ above 0. Likert score: 2 whether Doremy should test the contest based on her current IQ and the contest difficulty.
5). Solution in one sentence: 6). Time Complexity:
The solution uses a greedy approach to maximize the number of contests Doremy can test by The time complexity of the solution is O(n) for each test case, where n is the number of contests.
working backwards from the last contest and choosing to test or skip each contest based on 7). Proof of correctness (Why this is correct):
its difficulty and Doremy’s current IQ. The algorithm is correct because it ensures that Doremy tests as many contests as possible without
6). Time Complexity: her IQ dropping to 0 or less. By iterating through the contests in reverse order, the algorithm can
The time complexity of the solution is O(n) for each test case, where n is the number of make the optimal decision for each contest based on her current IQ and the contest difficulty. If the
contests. contest difficulty is greater than her current IQ and her current IQ is less than her initial IQ, she
7). Proof of correctness (Why this is correct): should test the contest and her IQ will increase, allowing her to test more contests in the future. If
The solution is correct because it uses a greedy approach that always chooses the optimal the contest difficulty is greater than her current IQ and her current IQ is greater than or equal to her
solution at each step. By working backwards from the last contest, we ensure that Doremy initial IQ, she should not test the contest as her IQ will decrease and she may not be able to test
tests as many contests as possible while maintaining her IQ above 0. The solution is optimal future contests. Otherwise, she should test the contest as her IQ will not decrease, ensuring that she
because if we skip a contest that we could have tested, we lose the opportunity to test it later can test as many contests as possible. Likert score: 1
and we may not be able to test as many contests as possible. Therefore, the solution is
correct and optimal. Likert score: 2

Table 7: A full example of Input prompt and <problem, solution> pair and GPT’s generated explanation(output).
Likert scores are provided for points (4) and (7), all other points are scored 2 for both models as it describes the
problem and solution well. Both models get the key idea to work backward and choose to take the contest greedily.
Their analysis of when Doremy should take a contest is correct. One minor issue with GPT-4 is that its analysis does
not specify that its description(future, increase) is from a backward perspective. (see text in color, blue: correct, red:
incorrect).
Hint: Here’s an accepted solution from an expert
Input programmer. Can you analyze the code and identify the
You are required to read and try to understand a competitive programming problem statement and interpret its correct solution. Think algorithmic approach used? Please also describe the steps
carefully about why doing this can correctly solve the problem. of the algorithm in natural language, explaining what
———– every named variable means, what each step does and
Problem: how it works. If there is anything non-obvious in the step,
Problem: An array a is called ugly if it contains at least one element which is equal to the sum of all elements before it. If the array is explain why it is done. Please notice that in the algorithm,
not ugly, it is beautiful. some conclusions might be first deduced from conditions
For example: the array [6, 3, 9, 6] is ugly: the element 9 is equal to 6 + 3; the array [5, 5, 7] is ugly: the element 5 (the second one) is and directly used.
equal to 5; the array [8, 4, 10, 14] is beautiful: 8 ≠ 0, 4 ≠ 8, 10 ≠ 8 + 4, 14 ≠ 8 + 4 + 10, so there is no element which is equal to t = int ( input () )
the sum of all elements before it. for _ in range ( t ) :
You are given an array a such that 1 ≤ a1 ≤ a2 ≤ ⋯ ≤ an ≤ 100 You have to reorder the elements of a in such a way that the __ = i n p u t ( )
resulting array is beautiful. Note that you are not allowed to insert new elements or erase existing ones, you can only change the order a = l i s t ( map ( i n t , i n p u t ( ) . s p l i t ( )
of elements of a. You are allowed to keep the array a unchanged, if it is beautiful. ))
Input i f a [ 0 ] == a [ − 1 ] :
The first line contains one integer t(1 ≤ t ≤ 2000) — the number of test cases. p r i n t ( "NO" )
Each test case consists of two lines. The first line contains one integer n (2 ≤ n ≤ 50). The second line contains n integers else :
a1 , a2 , . . . , an (1 ≤ a1 ≤ a2 ≤ ⋯ ≤ an ≤ 100) Output p r i n t ( "YES" )
For each test case, print the answer as follows: p r i n t ( a [ − 1 ] , end = " " )
if it is impossible to reorder the elements of a in such a way that it becomes beautiful, print NO; otherwise, in the first line, print YES. p r i n t (* a [: −1])
In the second line, print n integers — any beautiful array which can be obtained from a by reordering its elements. If there are multiple Analyze the problem and the accepted code of the correct
such arrays, print any of them. solution. Avoid ambiguity in your description as possible.
Example Input Example Output Each part should be independently interpretable, i.e., each
4 YES point should be clear to readers who have no access to
4 3636 other points.
3366 NO Answer in the following Format:
2 YES 1). Brief Problem Summary:
10 10 24153 2). Used Algorithm:
5 YES 3). Step-by-step Solution Description:
12345 144 4). Explanation of the Solution:
3 5). Solution in one sentence:
144 6). Time Complexity:
7). Proof of correctness (Why this is correct):

GPT-3.5’s Output: GPT-4’s Output:


1). Brief Problem Summary: Given an array, reorder its elements in such a way that the 1). Brief Problem Summary: The problem requires us to reorder the elements of an array to make it
resulting array is beautiful (i.e., no element is equal to the sum of all elements before it). beautiful, which means no element in the array should be equal to the sum of all elements before it. We
If it is impossible to reorder the elements to obtain a beautiful array, print "NO". are given multiple test cases, and for each test case, we must determine if it’s possible to reorder the
2). Used Algorithm: Greedy algorithm Likert score: -2 array to make it beautiful or not.
3). Step-by-step Solution Description: 2). Used Algorithm: Greedy algorithm Likert score: -2
a. Read the number of test cases, t. 3). Step-by-step Solution Description:
b. For each test case: a. Read the number of test cases, t.
i. Read the size of the array n and the array a. b. For each test case:
ii. Check if the first and last elements of the array are equal. If they are, print "NO" i. Read the number of elements in the array, n.
and move to the next test case. ii. Read the elements of the array, a.
iii. Otherwise, print "YES". iii. Check if the first and last elements of the array are equal. If they are, print "NO".
iv. Reorder the array by moving the last element to the beginning of the array. iv. If not, print "YES" and print the reordered array by placing the last element of the array first
v. Print the reordered array. followed by the remaining elements in their original order.
4). Explanation of the Solution: The solution uses a greedy approach to reorder the array. 4). Explanation of the Solution: The given solution relies on the fact that if the first and last elements
If the first and last elements of the array are equal, it is impossible to reorder the array to of the sorted array are equal, it’s impossible to reorder the elements in such a way to form a beautiful
obtain a beautiful array. Otherwise, we can move the last element to the beginning of the array. This is because the first element will always be equal to the sum of all elements before it,
array to obtain a beautiful array. This is because the last element is not equal to the which is an empty sum, i.e., zero. So, in this case, we print "NO". If the first and last elements are not
sum of all elements before it, and moving it to the beginning ensures that no element is equal, we can reorder the array to make it beautiful by placing the last element (the largest element) first
equal to the sum of all elements before it. Likert score: -1 and keeping the remaining elements in their original order. This ensures that the largest element will
never be equal to the sum of all elements before it, as there are no elements before it. Since the
5). Solution in One Sentence: Reorder the array by moving the last element to the array is sorted, the remaining elements will also not satisfy the ugly condition. Likert score: -2
beginning of the array to obtain a beautiful array, unless the first and last elements of the 5). Solution in One Sentence: For each test case, print "NO" if the first and last elements of the sorted
array are equal. array are equal, otherwise print "YES" and reorder the array by placing the largest element first and
keeping the remaining elements in their original order.
6). Time Complexity: O(n), where n is the size of the array. 6). Time Complexity: O(t ∗ n) where t is the number of test cases and n is the number of elements in
the array.
7). Proof of Correctness: If the first and last elements of the array are equal, it is 7). Proof of Correctness: As explained in point 4, the solution is correct because it checks the condition
impossible to reorder the array to obtain a beautiful array. Otherwise, moving the last that the first and last elements of the array are equal or not. If they are equal or not. If they are equal, it’s
element to the beginning of the array ensures that no element is equal to the sum of all impossible to make the array beautiful. If they are not equal, the largest element is placed first, ensuring
elements before it, since the last element is not equal to the sum of all elements the ugly condition is not satisfied by any element in the reordered array. Since the array is sorted, the
before it. Therefore, the solution is correct. Likert score: -2 remaining elements will also not satisfy the ugly condition. Likert score: -1

Table 8: The lemon-picking example. Likert scores are provided for points (2),(4) and (7), all other points are scored
2 for both models. Unlike GPT3.5, GPT4 notices that the input array is already sorted in non-decreasing order in the
problem statement but it still fails to explain why moving the largest element to the first positions makes the array
beautiful by ignoring the crucial condition that all elements are strictly positive. (see text in bold - blue: correct, red:
incorrect).

You might also like