Explaining Competitive-Level Programming Solutions Using LLMs
Explaining Competitive-Level Programming Solutions Using LLMs
♥ ♣,♠ ♦ ♥
Jierui Li , Szymon Tworkowski , Yingying Wu and Raymond Mooney
♥ ♣ ♠
The University of Texas at Austin, University of Warsaw, IDEAS NCBR
♦
University of Houston
{jierui, mooney}@cs.utexas.edu, [email protected], [email protected]
propose a novel method to automatically anno- derstanding problems, familiarity with algorithms,
tate natural language explanations to <problem, reasoning skills, creative algorithm development,
solution> pairs. We show that despite poor per- and efficient, robust implementation.
formance in solving competitive-level program-
ming problems, state-of-the-art LLMs exhibit Problem Statement: Sign Swap
a strong capacity in describing and explaining Given an array of n integers a1 , a2 , . . . , an , where ai ≠
solutions. Our explanation generation method- 0, check if you can make this array sorted by using the
ology can generate a structured solution expla- following operation any number of times (possibly zero).
An array is sorted if its elements are arranged in a non-
nation for the problem containing descriptions
decreasing order. Select two indices i and j (1 ≤ i, j ≤ n)
and analysis. To evaluate the quality of the such that ai and aj have different signs. In other words,
annotated explanations, we examine their effec- one must be positive and one must be negative. Swap the
tiveness in two aspects: 1) satisfying the human signs of ai and aj . For example if you select ai = 3 and
programming expert who authored the oracle aj = −2, then they will change to ai = −3 and aj = 2.
solution, and 2) aiding LLMs in solving prob- Solution:
lems more effectively. The experimental results a= [ * map ( i n t , i n p u t ( ) . s p l i t ( ) ) ]
h=sum ( 1 f o r v i n a i f v < 0)
on the CodeContests dataset demonstrate that b = [ a b s ( a [ i ] ) * ( −1 i f i <h e l s e 1 ) f o r
while LLM GPT3.5’s and GPT-4’s abilities in i in range ( len ( a ) ) ]
describing the solution are comparable, GPT-4 p r i n t ( ’ y e s ’ i f s o r t e d ( b ) ==b e l s e ’ no ’ )
shows a better understanding of the key idea Solution Description: Move negative signs to the front of
behind the solution. the array and check if it’s already non-decreasing.
Solution Explanation: Swapping different signs for any
1 Introduction number of times means you can move negative signs ar-
bitrarily. A non-decreasing array must have negative el-
Recent Large Language Models (LLMs) have ements ahead of positive elements and moving negative
signs ahead is the optimal operation can be made.
shown impressive capabilities for various reason-
GPT-3.5 Generated Program:
ing tasks, including multi-hop question answering a = l i s t ( map ( i n t , i n p u t ( ) . s p l i t ( ) ) )
n e g _ c o u n t = sum ( 1 f o r x i n a i f x < 0 )
(Wang et al., 2022a; Lyu et al., 2023), common- pos_count = len ( a ) − neg_count
while True :
sense reasoning (Zelikman et al., 2022), symbolic i f neg_count > pos_count :
for i in range ( len ( a ) ) :
reasoning (Hua and Zhang, 2022), and math word i f a [ i ] < 0 : a [ i ] = −a [ i ] ; n e g _ c o u n t −= 1 ;
p o s _ c o u n t += 1 ; b r e a k
problem-solving (Zhou et al., 2023; Chen et al., e l i f pos_count > neg_count :
2022b). The chain-of-thought prompting method for i in range ( len ( a ) ) :
i f a [ i ] > 0 : a [ i ] = −a [ i ] ; p o s _ c o u n t −= 1 ;
(Wei et al., 2022) explicitly instructs LLMs to gen- e l s e : break
n e g _ c o u n t += 1 ; b r e a k
erate intermediate steps until the final answer is i f a l l ( a [ i ] <= a [ i +1 ] f o r i i n r a n g e ( l e n ( a ) −1) ) : p r i n t ( "
YES" )
reached, enabling the model to decompose prob- e l s e : p r i n t ( "NO" )
lems and solve them step-by-step. Nevertheless, Table 1: An example of a problem and solution from
challenges persist when tackling complex reason- Codeforces with human oracle Solution, Description,
ing tasks, such as competitive-level programming and Explanation. The formatting in the problem is sim-
problems. For instance, even powerful models plified. Program generated by GPT-3.5 is incorrect.
like GPT-4 outperform fewer than 5% of human
competitors in virtual contests from Codeforces Previous works on automatically solving pro-
gramming problems focus on tasks mapping fairly The 7 points can be categorized into Description-
detailed natural language instructions to programs. Level (i.e., Point 1,3,4,5) and Analysis-Level (i.e.,
Li et al. (2022); Ni et al. (2023); Chen et al. (2022a) Point 2,6,7) of the problem and solution. To eval-
verified and selected candidate programs by run- uate the quality of the generated explanations, we
ning them on human-written or automatically gen- examine their effectiveness in two respects: 1) be-
erated test cases. Shojaee et al. (2023); Chen et al. ing positively rated by the human programming
(2023); Schäfer et al. (2023) incorporated execution expert who authored the “oracle” solution, and 2)
feedback as an intermediate step during code gener- aiding LLMs in solving the problems more effec-
ation to enhance the programming ability of LLMs. tively. In the explanation-aided evaluation, Ex-
While these methods yield promising results for plainer generates the explanation given the oracle
fairly straightforward implementation tasks, they solution while Solver generates programs from the
fall short on algorithmic reasoning tasks. explanation. We use GPT-turbo-3.5 and GPT-4 as
Table 1 shows a sample problem from Code- the Explainer and GPT-turbo-3.5 as the Solver in
1
forces. Compared to most instruction-to-program our experiments.
tasks, competitive-level programming, a problem- In the human evaluation, we ask human experts
to-program task, is more challenging. Before im- who are the authors of solutions to score the expla-
plementing the program, one needs to first abstract nations from -2 to 2. They give the explanations
the problem and create a mathematical represen- positive scores averaging 0.81 and 1.30 on GPT-
tation, come up with potential solutions, consider 3.5 Explainer and GPT-4 Explainer respectively.
time and space complexity constraints, and corner With respect to explanation-aided program synthe-
cases, and finally identify a proper problem-solving sis, we find that different points of the generated
strategy. explanations can guide the model to solve the prob-
In order to disentangle reasoning about the prob- lem better, with the solve rate at pass@10 (one of
lem from code implementation, we advocate de- the top 10 generated programs is deemed correct)
composing the process of solving a programming increasing from 6.1% to 42.4% on the CodeCon-
problem. Instead of directly generating a program tests (Li et al., 2022) test set. In addition, we found
given the problem statement, we propose adding that GPT-turbo-3.5 performs significantly worse at
explicit, intermediate reasoning steps in natural Analysis-Level explanations compared to GPT-4.
language. These steps are more aligned with how Both of them can generate high-quality Descrip-
humans typically solve such problems and utilize tions.
the idea of chain-of-thought. The main contributions of this work are:
2
However, while <problem, solution> pairs are
publicly available on the practice websites (Li et al.,
2022), natural language descriptions or explana- 1. We advocate disentangling reasoning about
tions of “how to solve the problem” are hard to the problem and code generation in solving
collect at scale, as it requires additional annotations competitive-level programming problems.
from programming experts. Hence, we propose to 2. We propose a Specific-to-General prompt to
automatically generate explanations using an LLM. automatically generate structured natural lan-
We found that: 1) Given both the problem and a guage explanations for <problem, solution>
human-written program solution, pairs.
LLMs like GPT-3.5 and GPT-4 can describe the
solution in natural language reasonably well; and 2) 3. We demonstrate that this proposed method
Given the automatically generated explanation as a yields convincing explanations that are posi-
hint, LLMs perform better at solving the problem. tively scored by the program authors and serve
Our explanation generation methodology shown as effective hints to better solve the problems.
in Figure 1 employs a hierarchical prompt, request-
ing a step-by-step explanation from detailed code
Though the main focus of this paper is not solv-
analysis to a broader understanding of the solution.
ing competitive-level programming problems. We
1
https://codeforces.com/problemset/problem/
further discuss how such explanations can poten-
1670/A tially be used to improve problem-solving, which
2
A solution here refers to a correct program. we leave as a potential avenue for future research.
Figure 1: The explanation generation and evaluation framework and corresponding prompts (Top). An example of
the full explain prompt (Bottom Left) and model’s output is in Appendix Table 7. The blue points are descriptions
while the grey points are analysis. We give the explanation based on the oracle solution to the instructed solver as a
hint (Bottom Right) to evaluate the quality of the generated explanation.
3 Method
In the process of problem-solving, a human typ-
ically constructs a solution by progressing from
Figure 2: The Baseline Solver Prompt and General-to- a general idea to a detailed code implementation.
Specific (G2S) Prompt which asks LLMs to follow the However, explaining that solution involves a re-
reasoning steps till it reaches the state of implementa-
verse approach. This entails examining the code
tion.
on a line-by-line basis, interpreting the role of each
function, and then rationalizing the algorithmic
solutions following 2 simple rules: (1) We only steps in relation to the original problem. There-
consider correct human solutions, i.e., those that fore, we design a specific-to-general explanation
have passed the online judge system; (2) Solu- generation method.
tions in Si are ranked according to their program-
ming language and size in bytes, with a preference 3.1 Specific-to-General Solution Explaining
for Python-implemented solutions and shorter pro- Previous works have demonstrated the ability of
grams. LLMs to explain code; therefore, we investigated
All experiments in this paper are zero-shot on generating explanations automatically using an
large language models without fine-tuning. LLM with both the problem and sample solution as
j
input. For a problem-solution pair {pi , si } where
2.3 General-to-Specific Prompting Solver j
j ≤ k, an explanation ei is generated. For each
Before delving into the explanations, we first dis- problem pi , a set of explanations Ei is generated
1 2 k
cuss the general capacity of LLMs to solve those given different solutions {si , si , ⋯, si }.
problems directly from problem to generated so- Although simple prompts such as ’explain the
lution or thinking step-by-step. We note that our solution’ may generate useful explanations, these
methodology requires using instruction-finetuned often lack crucial information and are difficult to
language models (Ouyang et al., 2022), as we pro- evaluate due to their output’s diversity. To tackle
vide zero-shot instructions in our prompt. this issue, we deliberately control aspects of the
We designed a general-to-specific reasoning explanations, requiring them to include a ’problem
chain, which is inspired by humans’ step-by-step summary’ that demonstrates our understanding of
thought processes in solving such problems. As the problem and three levels of ’natural language
shown in Figure 2, we prompt the LLM to start description of the problem,’ illustrating our ability
from a general understanding of the problem and to comprehend the solution from a low-level to a
potential algorithms to use, then gradually transit to high-level perspective. These can be considered
a more specific and detailed level of understanding, as ’Description-level’ explanations. The elements
till finally implementing a program in Python. such as ’used algorithm,’ ’time complexity,’ and
For each problem, we generate k programs ’proof of correctness’ fall under ’Analysis-level’
1 2 k
{gi , gi , ⋯, gi } with LLMs as the k candidates to explanations, showcasing the language model’s
conduct a solve@k evaluation, as defined by Chen overall analysis and understanding of the solution.
et al. (2021). In other words, if any of the generated The method for this specific-to-general explanation
k programs is considered as a correct solution, then prompt is detailed in the left part of 1.
this problem is regarded as solved. Format-guided-generated explanations are clear
When experimenting with GPT-turbo-3.5 on the and structured, thus making it easier to disentan-
Ratings
gle information and evaluate each aspect. In our Dataset total [800, 1000] (1000, 1500] (1500, 2000] (2000, 3600]
experiment, over 99% of explanations contain all CodeContests 165 18.2% 17.0% 20.0% 44.8%
Our Data 50 34% 46% 20% 0%
defined points, with less than 1% skipping some
later points due to the length constraint. Table 2: Difficulty statistics (higher ratings = more diffi-
In addition, thinking from detailed code-level cult) for the dataset. The problems in our dataset exclude
implementation can also provide the intermediate hard problems (rating over 2k), as they exceed the rating
steps in context. The LLM can reach a better gen- of our annotators.
eral understanding of the solution by looking at its
previously generated general descriptions.
Metric We employ pass@k (Chen et al., 2021)
3.2 Explanation Instructed Solver as our evaluation metric for solve rate. For each
problem pi , we sample k programs generated from
In order to evaluate the quality of generated expla-
GPT-3.5 and evaluate them using Solve Rate@k
nations, we design an automatic metric to test how
metric: the percentage of programs that pass all
much it can aid in solving the problem if included
hidden test cases when submitted to Codeforces’
in the instruction. In this setting, we give both the
online judge. We first filter the programs by their
original problem as well as one of Description-level
output on the public test cases before submitting
points to the LLM Solver with the corresponding
them and also measure Pass Public@k: the per-
prompt given in the right part of Figure 1. If a
centage of programs that pass the public test cases
given instruction enables the LLM Solver to solve
given in the examples. The above metrics are ab-
a problem it was previously unable to solve, we
breviated as ‘solve@k’ and ‘public@k’.
consider that instruction to be more informative
than one that does not yield such an outcome.
4.2 Human Evaluation
4 Experiments We measured the quality of LLM-generated expla-
nations using human evaluation. We collect 50
4.1 Experimental Setup <problem, solution> pairs from Codeforces, ensur-
Model We use both GPT-3.5-turbo and GPT- ing that their format remained consistent with those
4 (OpenAI, 2023a,b) as the Explainer for expla- in CodeContests.
3
nation generation. We use GPT-3.5-turbo for all
our experiments as Solver LLM for code genera- Author Likert Scores Recognizing that under-
tion. We will refer to it as GPT-3.5 for simplicity. standing and explaining others’ solutions can be a
The temperature t is set to 0 wherever only one challenging task for programmers, we employed
sample is needed, and 0.2 otherwise. Max-length an annotator-centered evaluation approach. We ex-
of text is set to 4096, and we skipped 0.7% of cases tracted solutions and corresponding problems from
where the max length is exceeded. Codeforces for an expert annotator. The Explainer
then generates an explanation for the annotator’s
Data To ensure the effectiveness and accuracy of solution, which was subsequently scored by the
our results, given that GPT-3.5 may have seen some author of the explained solution. Note that each
<problem, solution> pairs in its training data, we explanation is scored by the author of the solution
use the CodeContests test set as our main dataset being explained.
in this paper. It contains 165 real online contest We generated explanations for 50 problems with
problems from Codeforces, the earliest of which ratings ranging from 800 to 2000, along with their
dates back to Oct 2021, which is after the knowl- corresponding solutions, and provided these ex-
edge cutoff of GPT-3.5 and GPT-4 (Sep. 2021). planations to human experts. They were asked to
Additionally, we extract a small subset of 50 more assign a Likert score from −2 (very poor) to 2 (ex-
recent problems from Codeforces for human eval- cellent).
uation. Table 2 are statistics based on their level- The evaluation consists of ten questions, each
of-difficulty ratings. Problems with ratings over one corresponding to a specific aspect of the ex-
2k are considered very difficult, most of which can planation. We separately assess the quality of the
only be solved by medal-winning competitors. response to each point of our G2S prompt (see Fig-
3
Due to the usage limit of GPT-4, we run larger scale ure 1). Furthermore, we developed three criteria to
experiments only on GPT-3.5-turbo. evaluate various aspects of the overall explanation:
at the one-sentence description, there are ambigu-
ous terms like ‘original array’ or ‘move elements’,
which might mislead the problem-solving if inter-
preted incorrectly. This is due to natural languages’
ambiguous nature compared to programming lan-
guages.
Models exhibit shortcomings when explaining
solution correctness, as they may not comprehen-
Figure 3: Human Likert scores (−2: very poor to 2: ex- sively account for conditions stipulated in the prob-
cellent) evaluating various aspects of the explanations. lem statement. For instance, when explaining Ex-
ample 1, it failed to recognize that “swapping signs
of 2 elements multiple times means moving signs
1. Usefulness: How useful is the explanation as
arbitrarily along the array” is a crucial condition,
guidance to solve the problem?
which is not mentioned explicitly in natural lan-
2. Clearness: How good is the explanation in guage. This highlights a potential limitation in the
terms of describing everything clearly and models’ ability to extract and incorporate essen-
avoiding ambiguity? tial information from various parts of the problem
3. Understanding: How much does the LLM statement when generating explanations.
understand the key idea behind the solution? We also present the full input/output and our
scores for both successful and failed cases in ap-
The average Likert scores over 50 problems are pendix A.
shown in Figure 3. Regarding the scores for the so-
lution descriptions (Step-by-Step Solution Descrip- 4.3 Automatic Metrics: Improving Solutions
tion, Explanation of the Solution, Solution in One We further investigated the ability of generated ex-
Sentence) and usefulness, both GPT-3.5 and GPT-4 planations to improve problem solving. Our funda-
Explainer are positively rated by expert annotators, mental assumption is that if an explanation accu-
with an average of 1.16 and 1.36 respectively. rately describes the algorithm, it should be capable
However, GPT-3.5 receives near zero or nega- of guiding the implementation of a correct solution.
tive scores on questions including why it’s correct, Consequently, we experimented with versions of
clearness, and understanding, showing its inade- the Instructed Solver Prompt in Figure 1, wherein
quate ability to grasp the key idea behind the so- one point in the explanation (i.e., an aspect of the
lution, while GPT-4 performs better (0.68 ∼ 0.88 solution) is provided to the GPT-3.5 Solver as a
score higher) on these aspects. This reveals a clear hint for solving the problem.
difference in the abilities of GPT-3.5 and GPT-4 to We compare it with two baseline solvers that, un-
reason and analyze competitive-level programming like our solver from Figure 1, are not conditioned
solutions. on explanations and only get the problem statement
as an input: zero-shot prompt (denoted as Baseline
Qualitative Analysis We observed several inter- in Table 4) and General to Specific (G2S) “step-by-
esting aspects of the explanations generated by the step” prompt shown in Figure 2. We also check that
models. Models can effectively explain code by explanations do not contain code snippets to ensure
integrating the problem statement and the solution. the solutions are not directly leaked in explana-
Step-by-step descriptions are often more concise tions. However, note that it is still not a completely
than line-by-line code reports by summarizing com- “fair” comparison, since the automatically gener-
plex operations and avoiding re-stating well-known ated ‘silver explanations’ are conditioned on oracle
algorithms (e.g., depth-first-search). solutions.
A sample explanation from GPT-3.5 is given in
Table 3. It describes the solution very well in both Main results For GPT-3.5, we measure pass@k
specific (step-by-step) and general (one-sentence) for k = {1, 5, 10}, but only pass@1 for GPT-4 due
levels. It summarizes the operations of ‘count < 0’ to access limits. To sample k programs, we sample
and ‘multiply -1 or 1’ into ‘negative on the left, k different human solutions for Explainer and then
positive on the right’ and explains if it’s sorted, generate a program for each explanation.
then ‘yes’ otherwise, ‘no’. However, if we look Results are shown in Table 4. Different
1). Brief Problem Summary: Given an array of n integers, ing that they share similar capabilities in terms of
check if it can be sorted by swapping the signs of any two
elements with different signs. Description-level explanations.
2). Used Algorithm: The algorithm used is a simple imple-
mentation of the problem statement. GPT-3.5 Solver
3). Step-by-step Solution Description: solve@1 solve@5 solve@10 public@10
1. Read the input array a. Baseline 1.8 3.6 6.1 13.9
2. Count the number of negative elements in the array and G2S prompt 2.4 5.4 9.1 18.8
store it in h. GPT-3.5 Solver With Silver Explanation
3. Create a new array b by taking the absolute value of each w/ UsedAlg 1.8 (1.2) 4.2 6.1 13.3
element in a and multiplying it by -1 if the index is less than w/ S-by-S 13.3 (15.8) 32.2 42.4 47.9
h, otherwise by 1. w/ Exp-Sol 6.1 (4.8) 17.6 23.6 32.7
4. Check if the sorted array b is equal to the original array b. w/ OneSent 4.2 (4.2) 9.1 13.9 26.1
5. If they are equal, print "yes", otherwise print "no". w/ TC(O(⋅)) 1.8 (2.4) 3.6 6.7 13.3
4). Explanation of the Solution: The solution works by
first counting the number of negative elements in the array. Table 4: Different aspects of the explanation’s effect on
We then create a new array b by taking the absolute value
of each element in a and multiplying it by -1 if the index is improving program generation. Values are percentage
less than h, otherwise by 1. This ensures that all negative % and ‘solve’ and ‘public’ are short for ‘Solve Rate’
elements are on the left side of the array and all positive and ‘Pass Public Tests’. Solve@1 results in parentheses
elements are on the right side. We then check if the sorted are from GPT-4’s generated explanations. The bottom 5
array b is equal to the original array b. If they are equal, it rows correspond to Figure 1’s points 2,3,4,5, and 6 in
means that we can sort the original array by swapping the
signs of any two elements with different signs. the left prompt.
5). Solution in one sentence: The solution works by count-
ing the number of negative elements in the array, creating a
new array with all negative elements on the left and positive Pass Public Tests vs. Solve One observation
elements on the right, and checking if the sorted array is from Table 4 is that solve@10 is significantly less
equal to the original array.
6). Time Complexity: The time complexity of the solution than public@10. For a program that passes the
is O(nlogn) due to the sorting operation. public tests but fails the hidden tests, there are two
7). Proof of correctness (Why this is correct): The solution possibilities: 1) It is incorrect and only applies to a
is correct because it ensures that all negative elements are
on the left side of the array and all positive elements are subset of test data, including the public tests; 2) It
on the right side. This means that we can sort the array by is inefficient. As discussed before, in competitive-
swapping the signs of any two elements with different signs.
The sorted array will be equal to the original array if and
level programming, a “correct” but slow implemen-
only if we can sort the array using the given operation. tation does not count as a solution, as there are
constraints on time and space complexity. There-
Table 3: GPT-3.5’s explanation to the example in Table fore, we further study programs that pass the public
1, which understands and clearly describes the key idea
tests but may fail hidden tests. As shown in Table 5,
behind the solution. Note it comprehends the code from
a detailed level in ‘3)’ and a general level in ‘5)’. blue:
the baseline has 48.9% of its programs rejected by
correct, red: incorrect) the online judge due to inefficiency, indicating that
GPT-3.5 tends to generate inefficient implementa-
tions (e.g., brute force solutions).
Description-level aspects of explanations improve
both the solve rate and pass public rate. The most Solve Wrong Answer TLE Other
Baseline 35.1% 15.6% 48.9% 0%
detailed aspect, Step-by-Step Solution Description
G2S prompt 38.3% 14.1% 47.6% 0%
(S-by-S), which provides a detailed natural lan- w/ UsedAlg 39.1% 18.9% 42.1% 0%
guage description of the implementation, offers the w/ S-by-S 75.6% 11.4% 11.4% 1.6%
most significant benefit to problem-solving, result- w/ Exp-Sol 73.6% 11.9% 11.1% 1.4%
ing in a solve rate @1 that is 7.4 times higher than w/ OneSent 56.6% 27.9% 14.0% 1.5%
the baseline. The impact of Explanation of the Table 5: Final judgement of generated programs that
Solution (Exp-Sol) and Solution in One Sentence pass the public tests. TLE means time limit exceeded,
(OneSent) is comparatively lower due to their con- and other includes memory limit exceeded and runtime
4
cise nature, which offers a less clear path towards error.
the solution. However, providing information on
the algorithms used (UsedAlg) or the expected When provided hints from the solution descrip-
time complexity (TC) does not improve GPT-3.5’s tion, the portion of TLE programs drops signifi-
problem-solving capabilities. 4
This is for all submissions, i.e., one problem might have
The pass@1 results for GPT-4 Explainer are up to k submissions, which is different from the problem-wise
not significantly better than for GPT-3.5, indicat- solve rate.
j
the solutionsi , we generate one explanation
j j
ei , and one corresponding program gi .
2. Sample k explanations: We only take the first
1 2 k
solution si , and sample Ei = {ei , ei , ⋯, ei },
j
for each explanation ei , we generate one cor-
j
responding program gi .
3. Sample k programs: We only sample 1 so-
lution si and one corresponding explana-
tion ei , then we sample k programs Gi =
1 2 k
Figure 4: The aiding effects of 3 levels of Solution De- {gi , gi , ⋯, gi } given the explanations.
scription over different difficulty ratings. The difference
5
in color shows the gain in solve@10.
w/ OneSent public@10 solve@10
Baseline 13.9% 6.1%
Statg1. Sample 10 Human Solution si 26.1% 13.9%
cantly. Although GPT-3.5 may still make mistakes Statg2. Sample 10 Explanation ei 24.8% 12.1%
in some details or fail to consider corner cases even Statg3. Sample 10 Programs gi 18.2% 6.7%
∗
with hints from the explanation, it is better at avoid- Statg3. 10 gi from GPT-4 explanation 18.2% 10.9%
Yujia Li, David Choi, Junyoung Chung, Nate Kush- Steven Skiena and Miguel A. Revilla. 2003. Program-
man, Julian Schrittwieser, Ré mi Leblond, Tom Ec- ming challenges: the programming contest training
cles, James Keeling, Felix Gimeno, Agustin Dal manual. SIGACT News, 34:68–74.
Lago, Thomas Hubert, Peter Choy, Cyprien de Mas-
Boshi Wang, Xiang Deng, and Huan Sun. 2022a. Itera-
son d’Autume, Igor Babuschkin, Xinyun Chen, Po-
tively prompt pre-trained language models for chain
Sen Huang, Johannes Welbl, Sven Gowal, Alexey
of thought. In Proceedings of the 2022 Conference on
Cherepanov, James Molloy, Daniel J. Mankowitz,
Empirical Methods in Natural Language Processing,
Esme Sutherland Robson, Pushmeet Kohli, Nando
pages 2714–2730, Abu Dhabi, United Arab Emirates.
de Freitas, Koray Kavukcuoglu, and Oriol Vinyals.
Association for Computational Linguistics.
2022. Competition-level code generation with Al-
phaCode. Science, 378(6624):1092–1097. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V.
Le, Ed H. Chi, and Denny Zhou. 2022b. Self-
Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, consistency improves chain of thought reasoning in
Delip Rao, Eric Wong, Marianna Apidianaki, and language models. CoRR, abs/2203.11171.
Chris Callison-Burch. 2023. Faithful chain-of-
thought reasoning. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.
Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Chain of thought prompting elicits reasoning in large
Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and language models. arXiv preprint arXiv:2201.11903.
Juho Leinonen. 2023. Experiences from using code
explanations generated by large language models in Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D.
a web software development e-book. In Proceedings Goodman, and Nick Haber. 2023. Parsel: A (de-
of the 54th ACM Technical Symposium on Computer )compositional framework for algorithmic reasoning
Science Education, Volume 1, SIGCSE 2023, Toronto, with language models.
ON, Canada, March 15-18, 2023, pages 931–937.
ACM. Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Good-
man. 2022. Star: Bootstrapping reasoning with rea-
Mike Mirzayanov, Oksana Pavlova, Pavel Mavrin, Ro- soning. Advances in Neural Information Processing
man A. Melnikov, A. S. Plotnikov, V. A. Parfenov, Systems, 35:15476–15488.
and Andrew Stankevich. 2020. Codeforces as an
educational platform for learning programming in Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,
digitalization. Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi.
Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, 2023. Least-to-most prompting enables complex rea-
Wen tau Yih, Sida I. Wang, and Xi Victoria Lin. 2023. soning in large language models.
Lever: Learning to verify language-to-code genera-
tion with execution.
Table 7: A full example of Input prompt and <problem, solution> pair and GPT’s generated explanation(output).
Likert scores are provided for points (4) and (7), all other points are scored 2 for both models as it describes the
problem and solution well. Both models get the key idea to work backward and choose to take the contest greedily.
Their analysis of when Doremy should take a contest is correct. One minor issue with GPT-4 is that its analysis does
not specify that its description(future, increase) is from a backward perspective. (see text in color, blue: correct, red:
incorrect).
Hint: Here’s an accepted solution from an expert
Input programmer. Can you analyze the code and identify the
You are required to read and try to understand a competitive programming problem statement and interpret its correct solution. Think algorithmic approach used? Please also describe the steps
carefully about why doing this can correctly solve the problem. of the algorithm in natural language, explaining what
———– every named variable means, what each step does and
Problem: how it works. If there is anything non-obvious in the step,
Problem: An array a is called ugly if it contains at least one element which is equal to the sum of all elements before it. If the array is explain why it is done. Please notice that in the algorithm,
not ugly, it is beautiful. some conclusions might be first deduced from conditions
For example: the array [6, 3, 9, 6] is ugly: the element 9 is equal to 6 + 3; the array [5, 5, 7] is ugly: the element 5 (the second one) is and directly used.
equal to 5; the array [8, 4, 10, 14] is beautiful: 8 ≠ 0, 4 ≠ 8, 10 ≠ 8 + 4, 14 ≠ 8 + 4 + 10, so there is no element which is equal to t = int ( input () )
the sum of all elements before it. for _ in range ( t ) :
You are given an array a such that 1 ≤ a1 ≤ a2 ≤ ⋯ ≤ an ≤ 100 You have to reorder the elements of a in such a way that the __ = i n p u t ( )
resulting array is beautiful. Note that you are not allowed to insert new elements or erase existing ones, you can only change the order a = l i s t ( map ( i n t , i n p u t ( ) . s p l i t ( )
of elements of a. You are allowed to keep the array a unchanged, if it is beautiful. ))
Input i f a [ 0 ] == a [ − 1 ] :
The first line contains one integer t(1 ≤ t ≤ 2000) — the number of test cases. p r i n t ( "NO" )
Each test case consists of two lines. The first line contains one integer n (2 ≤ n ≤ 50). The second line contains n integers else :
a1 , a2 , . . . , an (1 ≤ a1 ≤ a2 ≤ ⋯ ≤ an ≤ 100) Output p r i n t ( "YES" )
For each test case, print the answer as follows: p r i n t ( a [ − 1 ] , end = " " )
if it is impossible to reorder the elements of a in such a way that it becomes beautiful, print NO; otherwise, in the first line, print YES. p r i n t (* a [: −1])
In the second line, print n integers — any beautiful array which can be obtained from a by reordering its elements. If there are multiple Analyze the problem and the accepted code of the correct
such arrays, print any of them. solution. Avoid ambiguity in your description as possible.
Example Input Example Output Each part should be independently interpretable, i.e., each
4 YES point should be clear to readers who have no access to
4 3636 other points.
3366 NO Answer in the following Format:
2 YES 1). Brief Problem Summary:
10 10 24153 2). Used Algorithm:
5 YES 3). Step-by-step Solution Description:
12345 144 4). Explanation of the Solution:
3 5). Solution in one sentence:
144 6). Time Complexity:
7). Proof of correctness (Why this is correct):
Table 8: The lemon-picking example. Likert scores are provided for points (2),(4) and (7), all other points are scored
2 for both models. Unlike GPT3.5, GPT4 notices that the input array is already sorted in non-decreasing order in the
problem statement but it still fails to explain why moving the largest element to the first positions makes the array
beautiful by ignoring the crucial condition that all elements are strictly positive. (see text in bold - blue: correct, red:
incorrect).