When do PoT works for Reasoning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

When Do Program-of-Thought Works for Reasoning?

Zhen Bi♠♢ , Ningyu Zhang♠♢ * , Yinuo Jiang♠♢ , Shumin Deng♣ ,


Guozhou Zheng♠♢♡ , Huajun Chen♠♢♡ * ,

Zhejiang University ♢ Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph

Donghai Laboratory ♣ NUS-NCS Joint Lab, National University of Singapore
{bizhen zju, zhangningyu, 3200100732, guozhou, huajunsir}@zju.edu.cn, [email protected]
arXiv:2308.15452v6 [cs.CL] 18 Dec 2023

Abstract logic and structure


𝑰𝑭 𝒐𝒓 𝑬𝑳𝑺𝑬
In the realm of embodied artificial intelligence, the reasoning
capabilities of Large Language Models (LLMs) play a pivotal
role. Although there are effective methods like program-of-
+ ∗ +
Question: A restaurant has 3 chefs.
thought prompting for LLMs which uses programming lan- Chef A worked for 8 hours, Chef B
guage to tackle complex reasoning tasks, the specific im- worked for 6.5 hours, and Chef C 𝑪𝒐𝒎𝒑𝒍𝒆𝒙𝒊𝒕𝒚 − 𝑰𝒎𝒑𝒂𝒄𝒕𝒆𝒅
pact of code data on the improvement of reasoning capabili- worked for 9.25 hours. How many 𝑹𝒆𝒂𝒔𝒐𝒏𝒊𝒏𝒈 𝑺𝒄𝒐𝒓𝒆
ties remains under-explored. To address this gap, we propose minutes did the chefs work in total?
complexity-impacted reasoning score (CIRS), which com- ```python
bines structural and logical attributes, to measure the cor- chef_A_hours = 8
relation between code and reasoning abilities. Specifically, chef_B_hours = 6.5
we use the abstract syntax tree to encode the structural in- chef_C_hours = 9.25
formation and calculate logical complexity by considering total_hours = chef_A_hours +
chef_B_hours + chef_C_hours
the difficulty and the cyclomatic complexity. Through an total_minutes = total_hours * 60 Complexity Analysis
empirical analysis, we find not all code data of complex- print(total_minutes)
ity can be learned or understood by LLMs. Optimal level of ```
complexity is critical to the improvement of reasoning abil- Solution
ities by program-aided prompting. Then we design an auto-
synthesizing and stratifying algorithm, and apply it to instruc- What’s the crucial factor for reasoning?
tion generation for mathematical reasoning and code data fil-
tering for code generation tasks. Extensive results demon- Figure 1: We leverage code structure to analyze what kind
strates the effectiveness of our proposed approach. Code will of data is crucial for reasoning abilities of LLMs models.
be integrated into the EasyInstruct framework1 .
remains: When do program-of-thought prompting works
1 Introduction for reasoning2 ?
Large language models (LLMs) (OpenAI 2023; Anil et al. In this work, we propose the Complexity-Impacted
2023), have emerged as a general-purpose problem-solving Reasoning Score (CIRS), a comprehensive metric for the
methodology for embodied artificial intelligence. In the relationship between code reasoning steps and their impacts
realm of embodied AI, the reasoning capabilities of LLMs on LLMs’ reasoning capacities. We postulate that program-
play a pivotal role, especially when agents need to com- ming languages hold distinct advantages due to: (1) their su-
prehend the semantic intricacies of their environment for perior modeling of intricate structures compared to serial-
effective control (Chen et al. 2022b; Huang et al. 2022, ized natural language. (2) their inherent procedure-oriented
2023; Wang et al. 2023a). Recent approaches (Chen et al. logic, which assists in addressing multi-step reasoning prob-
2022a; Gao et al. 2022; Cheng et al. 2023), which we term lems. We posit that our metric should evaluate the code com-
program-of-thought, leverages programming language as a plexity from both structural and logical perspectives.
superior prompting mechanism for complex reasoning tasks. Specifically, we use abstract syntax tree (AST) to calcu-
In contrast to chain-of-thought prompting (Wei et al. 2022), late the structural complexity of code reasoning steps (ra-
program-of-thought prompting disentangles the problems tionales). To retain all structural information in AST that
into executable code segments and address them step-by- is represented as a tree, our approach leverages three AST
step. However, the correlation between the programming indicators (node count, node type, depth), which provides
language utilzation and the improvement in reasoning abil- a comprehensive understanding of code structures. Mean-
ity for LLMs is under-studied. The essential question still while, inspired by Halsted (Halstead 1977) and McCabe
* Corresponding Author. 2
In this work, we use mathematical reasoning tasks for verifi-
1
https://github.com/zjunlp/EasyInstruct cation, which is a typical problem for complex reasoning tasks.
(McCabe 1976)’s theory, we design a method to calculate maximize the likelihood of the answer A as p(A|Q).
logical complexity by integrating code difficulty and cyclo-
matic complexity. Thus, the operators, operands and control p(A|Q) = p(A|Q, Rc )p(Rc |Q) (1)
flow of the code can be taken into account. We can explicitly
compute the complexity of logical inherent in the code. where Rc is the solution of the code which will be generated.
Through an empirical analysis by our proposed CIRS, We enhance the effectiveness of solving multi-step reason-
we find that not all code data of complexity can be learned ing problems by using code prompts as intermediate steps.
and understood by LLMs and current LLMs have limited un-
derstanding of symbolic knowledge like code. Code blocks 3 Complexity-Impacted Reasoning Score
with low complexity contain insufficient knowledge, while
those with high complexity could be too difficult for LLMs To measure the the reasoning ability of the code rationale
to learn. Consequently, only code data with an optimal level Rc , we define the complexity-impacted reasoning score as
of complexity (structure&logic), neither too simple nor too the product of structural complexity ScoreSC and logical
intricate, contribute to the effective enhancement of LLMs’ complexity ScoreLC .
reasoning abilities.
Then, we propose the auto-synthesizing and stratifying Score(Rc ) = ScoreSC (Rc ) × ScoreLC (Rc ) (2)
algorithm that can automatically generate and filter out the
data with the most effective reasoning ability. We apply our Structural Complexity To calculate the structural com-
algorithm to two scenarios: (1) guiding instruction genera- plexity, we measure the structural complexity of the Abstract
tion for mathematical reasoning tasks. (2) filtering code data Syntax Tree (AST). We design a simple yet effective method
for code generation tasks. Compared to baseline models, our by selecting three indicators that can provide a comprehen-
proposed method achieves favorable results in mathematical sive understanding of structural information. Therefore, we
reasoning and shows effectiveness for code generation tasks. define the ScoreSC as follows:
In this paper, our contributions are as follows:
• We propose a novel method to measure reasoning com- ScoreSC (Rc ) = Sigmoid(f (xNode , xType , xDepth )) (3)
plexity for the code data, termed CIRS. Our approach,
which evaluates the code data from both structural and where xNode , xType and xDepth are the features of node
logical perspectives, can accurately gauges the correla- count, node types and tree depth in the AST of the code
tion between code complexity and its reasoning ability. rationale Rc . We first use the function f to apply Z-score
• We empirically analyze the impact of varying complex- normalization to the accumulated data x for each feature,
ities, identifying that optimal level of code languages, and then we aggregate the overall information by mean pool-
which is leanable for LLMs, as the pivotal factor in the ing. Next, we apply the Sigmoid function to transform the
reasoning abilities of program-of-thought prompting. data into the range of 0 to 1. The benefit of doing this is
• We design an auto-synthesizing and stratifying algorithm to preserve the distribution characteristics of the feature and
and apply our approach to both instruction generation for avoid being influenced by extreme statistical data, whether
mathematical reasoning and code data filtering for code it is exceptionally large or small. The detailed explanations
generation tasks. Extensive results demonstrates the va- for three indicators are as follows:
lidity of our proposed perspective.
• Node Count. The number of nodes reflects the size of the
code. Generally, more nodes indicate higher complex-
2 Background ity. But node count alone cannot comprehensively mea-
Code large language models have demonstrated remarkable sure code complexity because a large code with a sim-
capabilities in various tasks such as commonsense reason- ple structure might be easier to understand than a smaller
ing (Madaan et al. 2022), information extraction (Wang, Li, code with a complex structure.
and Ji 2022), mathematical reasoning (Imani, Du, and Shri-
vastava 2023), robotics manipulation (Huang et al. 2023) • Node Types. Node types help identify the structural ele-
and embodied learning agent (Wang et al. 2023a). Generally, ments present in the code, such as conditional statements,
code LLMs with larger model parameters are more effective loops, and function calls. Different node types play dif-
than vanillar LLMs for reasoning. We find that even if Codex ferent roles in the code and contribute differently to its
(Chen et al. 2021) and GPT-3.5 (Brown et al. 2020) are with complexity. Therefore, tracking the quantity of various
same parameters, Codex that is pre-trained on code corpus node types can enhance our understanding of the struc-
performs better than GPT-3 on problems such as arithmetic tural complexity of the code.
reasoning and structural prediction tasks. Intriguingly, train- • Tree Depth. The depth of the AST reflects the level of
ing on code data not only enables the ability of code under- nesting in the code. A greater tree depth may imply more
standing but may also foster the reasoning ability. complex control flow and logic, making the code harder
Inspired by Chen et al. (2022a); Gao et al. (2022), we to understand. It is important that depth alone is also not
formalize the multiple-step reasoning tasks by using code- the sole measurement criterion. A shallow tree with mul-
format chain-of-thoughts. For program-of-thought prompt- tiple simple branches might be easier to comprehend than
ing, given the input for the reasoning problem Q, we aim to a deep tree with a few complex branches.
Auto-stratification
Templates Question:
A restaurant has 3 chefs. Chef A worked Measure by CIRS
Below are some mathematical problems. for 8 hours, Chef B worked for 6.5 hours,
Can you rewrite new problems that are and Chef C worked for 9.25 hours. How
similar to them? Then you should write many minutes did the chefs work in total?
python code to solve the new problem.

Question: QUESTION ×𝑁 Solution


Reasoning Problems Complexity Distribution Subsets
New Question: ______ 𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐜 ```python
New Solution: ______ ×𝑁 𝐅𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠 chef_A_hours = 8
chef_B_hours = 6.5
chef_C_hours = 9.25
Filter
Seed datasets total_hours = chef_A_hours + chef_B_hours
+ chef_C_hours
AQuA ASDiv total_minutes = total_hours * 60 Training LLaMA
Alpaca
Synthesizing mid-range
print(total_minutes) ng (filtered)
Data
complexity
GSM8K SVAMP ```
MultiArith … Data Synthesizing • CIRS-guided Instruction Generation

In-Distribution Out-of-Distribution

AQuA ASDiv MATH


𝑆𝑐𝑜𝑟𝑒!" ~𝑓(𝑥#$%& , 𝑥'()& , 𝑥*&)+, ) GSM8K SVAMP Big Bench Hard
Evaluation
structure
solutions by code • CIRS-based Code Filtering
𝑆𝑐𝑜𝑟𝑒5" ~DifEicult 𝐷× Cyclomatic 𝑉
𝑆𝑐𝑜𝑟𝑒 = 𝑆𝑐𝑜𝑟𝑒!" 3 𝑆𝑐𝑜𝑟𝑒#" filter Code Alpaca
logic
Train Test
𝑪𝒐𝒎𝒑𝒍𝒆𝒙𝒊𝒕𝒚 − 𝑰𝒎𝒑𝒂𝒄𝒕𝒆𝒅 𝑹𝒆𝒂𝒔𝒐𝒏𝒊𝒏𝒈 𝑺𝒄𝒐𝒓𝒆 (𝑪𝑰𝑹𝑺) code intructions

Figure 2: We utilize complexity-impacted reasoning score (CIRS) to measure the complexity of code reasoning steps. We first
synthesize data and employ CIRS to analyze the complexity distribution of the code reasoning data. Then, we analyze and split
the data into three different subsets. Next, we validate the performance on different model parameters. Finally, we leverage the
auto-synthesizing and stratifying algorithm and evaluate its performance on the filtered data with the most effective complexity.

Logical Complexity We define code logical complexity and high cyclomatic complexity. We note that high cyclo-
ScoreLC integrating the difficulty D and cyclomatic com- matic complexity indicates that the program code has com-
plexity V , which is inspired by Halstead Complexity Met- plex judgement logic, potentially leading to lower quality.
rics (Halstead 1977) and McCabe’s Cyclomatic Complexity It might be difficult to test and maintain those code with
(McCabe 1976). high cyclomatic complexity. Generally, by integrating the
difficulty and cyclomatic complexity, both the complexity of
the operators, operands, and control flow of the code can be
ScoreLC (Rc ) = Sigmoid(D(Rc ) × V (Rc )) (4) taken into account. Next, we conduct experimental analysis
where Difficulty D(Rc ) denotes the difficulty for solving to empirically study the rationality of our method.
the problem and V (Rc ) means cyclomatic complexity of the
rationale Rc . To represent the effort required to comprehend 4 Experimental settings
the program, the Difficulty D(Rc ) is defined as: In order to conduct an unbiased evaluation of all model per-
n  N  formances, we use zero-shot and few-shot settings for eval-
1 2 uation. For zero-shot setting, we directly presenting mathe-
D(Rc ) = · (5)
2 n2 matical problems to the model for solution generation, with-
out any demonstrations in the input. For few-shot setting,
where n1 denotes the number of distinct operators and
we choose 3-shot for evaluation where we select three in-
N2 denotes the total number of operands in the code. n2
context examples with rationales. Our criterion for evalua-
denotes the number of distinct operands in the code rationale
tion is that the answer is considered ultimately correct only
Rc . In this formula, the term (n1 /2) represents the average
if the code executor’s answer is correct.
complexity of operators, while the term (N2 /n2 ) represents
the average complexity of operands. In Section 5, we conduct an empirical analysis of the vari-
ations in different model sizes and complexities in zero-shot
To consider the complexity of the logical loops (code con-
setting. We construct our own test dataset because there are
trol flow), we define the cyclomatic complexity V (Rc ) as:
no publicly available benchmarks up until now. Model eval-
uation is performed on AsDiv (Miao, Liang, and Su 2020),
V (Rc ) = E − N + 2 (6) GSM8K (Cobbe et al. 2021), MultiArith (Roy and Roth
where E denotes the number of edges in the control flow 2015), and SVAMP (Patel, Bhattamishra, and Goyal 2021),
graph in the code and N denotes the number of nodes in with a selection of 500 instances randomly chosen from each
the control flow graph. We employ the Sigmoid function original testset to form the new testsets. We chose gpt-3.5-
to constrain the values of code logical complexity. There is turbo as the main benchmark model and accuracy (Acc) as
a significant correlation between potential program errors our evaluation metric.
In Section 6, we train the model based on the LLaMA-7B After obtaining well-generated code data, we utilize
(Version 1.0) (Touvron et al. 2023). Vicuna (Chiang et al. CIRS (Section 3) and manually split the data into different
2023) and Falcon (Almazrouei et al. 2023) are selected as subsets based on the analysis of code complexity distribu-
the main comparison models and accuracy (Acc) is chosen tion. We put the visualized results in the supplement. Based
as the evaluation metric again. Apart from the datasets used on different complexity scores, we name the partitioned sub-
in the in-distribution setting, the model’s performance is also sets as low (lower score samples), medium (medium score
evaluated on MATH (Hendrycks et al. 2021) and BigBench- samples) and high (high score samples).
Hard (Suzgun et al. 2022) in the out-of-distribution setting.
It should be noted that we only choose level-1 problems in 5.2 Impacts of different complexity score
MATH. We utilize algorithmic and multi-step arithmetic
reasoning tasks in BIG-Bench Hard. The detailed experi- To compare the impact of different code complexities on the
mental setup is shown in the supplementary. reasoning capability of LLMs, we train three models based
on LLaMA (Version 1.0) from 7 billion to 65 billion param-
5 Empirical Analysis eters. We randomly select 1,700 instances from each sub-
set (low, medium, high) to build the training and validation
In this section, we empirically analyze the impact of differ- dataset for fair comparisons. Results are shown in Figure 3.
ent forms of code data. Specifically, we synthesize a totally
(1) Optimal level of code is crucial to the reasoning
new dataset and manually partition it using our CIRS in
abilities of program-of-thought prompting. From the re-
Section 5.1. In Section 5.2, we discuss the impact of code
sults across the four datasets, we note that the model per-
data with different complexities on the reasoning abilities
forms optimally when the complexity of the code data is
for LLMs. Then we analyze the characteristics of code data
in mid-range. This suggests that the learnable symbolic lan-
with varying complexities in Section 5.3. Finally, we con-
guage is crucial to the reasoning abilities of program-aided
duct more ablation analysis in Section 5.4 and 5.5.
prompting. The reasoning behind this is that data with overly
simplistic complexity, is too simple for LLMs, leading to
5.1 Data synthesizing
less noticeable effects. Conversely, when the complexity es-
calates significantly, the logical semantics and nested struc-
Seed source Seed size Data size tures become difficult to comprehend or learn, which could
AQuA 97,467 10,160 adversely impact the reasoning capabilities of LLMs.
GSM8K 7,644 12,812 (2) The larger the number of parameters, the more
MultiArith 600 12,163
ASDiv 2,306 13,554
significant the gain in LLM’s reasoning capabilities. It
SVAMP 3,954 12,441 is evident that as the model size increases from 7 billion to
ALL 61,130 65 billion , the effectiveness of its reasoning capability im-
proves. In fact, after fine-tuning, most 65 billion parameter
Table 1: Statistics of seeds and the generated data size. models can achieve results comparable to those of gpt-3.5-
turbo. It suggests that having a sufficient number of param-
To fairly explore the impact of the variations in different eters is crucial for substantial reasoning capabilities in lan-
complexity scores, it is necessary to avoid errors caused by guage models. Furthermore, when the language model is
the dataset itself and generate entirely new forms of code large enough, the difference in results across various com-
data. The sources of seed data include the training set of plexities is minimal. This indicates that LLMs with vast pa-
GSM8K (Cobbe et al. 2021), MultiArith (Roy and Roth rameters are more prone to symbolic data and inherently
2015), Asdiv (Miao, Liang, and Su 2020), SVAMP (Pa- have the potential to yield strong reasoning capabilities.
tel, Bhattamishra, and Goyal 2021) and AQuA (Ling et al. (3) Current LLMs have limitations in their under-
2017). In Table 1, we have synthesized over 60,000 samples standing capabilities for reasoning. We observe that when
from five seed datasets. For each dataset, we generate ap- data complexity is extremely high, the performance of
proximately 10,000 samples. We choose as many datasets as LLMs tends to decrease. It reflects that there is an inherent
possible to ensure the diversity of mathematical problems. limit to the reasoning capabilities of large language models.
Then, we design a pipeline that can automatically gen- We argue that: (1) The current architecture of LLMs (such
erate high-quality code corpus by leveraging ChatGPT. As as decoder-only LLM) has limited ability to understand com-
shown in Figure 2, we apply a template to define the for- plex knowledge, which also restricts the emergence of their
mat and then allow the API to continuously rewrite new reasoning capabilities. The prerequisite for large models to
questions and their corresponding code-format solutions. demonstrate powerful reasoning abilities is their ability to
In the construction of templates, we randomly select three comprehend the structures and logical knowledge embed-
problems from the seed datasets each time. Next, we au- ded in complex data. Therefore, it is necessary to explore
tomatically filter out the generations that do not conform model structures with stronger reasoning abilities in future
to Python syntax standards, which results in a collection of research. (2) Further enhancement in reasoning power re-
high-quality mathematical problems. For all generated data, quires the reliance on external tools. We know that the scope
we randomly sampled 10% and verified its correctness by of reasoning problems is quite broad, not only mathematical
manual checks and automated validation with GPT-4, ensur- reasoning, but also including commonsense or more com-
ing the accuracy within a reasonable margin of error. plex logical reasoning tasks. Therefore, relying solely on the
low low
gpt-3.5-turbo 74.4 medium
75 medium 75.8
74.4 75

Accuracy(%)
Accuracy(%)
high high
71.8 70.6 71.2
gpt-3.5-turbo 65.8
57.4
55.8
60.4 61.2 63.0 49.0
60 50
53.4 37.4
50.2 34.6 33.8 34.0
27.8 25.8
41.8 20.2
45 38.8
25 17.0
15.0

7 billion 13 billion 30 billion 65 billion 7 billion 13 billion 30 billion 65 billion


AsDiv Parameters GSM8K Parameters
low low
medium medium

Accuracy (%)
Accuracy (%)

high 97.8
gpt-3.5-turbo 84.8 89.8 94.0 high

84.2 80.2
79.0
79.2 75 gpt-3.5-turbo 71.0 72.2 70.0
75 67.2
64.2 59.6
62.8
63.4
54.8 55.2 55.8
41.0
50 47.6
50 55.2 51.6
26.8

25
35.6
25 21.4

7 billion 13 billion 30 billion 65 billion 7 billion 13 billion 30 billion 65 billion


MultiArith Parameters SVAMP Parameters

Figure 3: Evaluation performance on dataset GSM8K, MultiArith, ASDiv and SVAMP. We train three models (low, medium,
high) whose datasets contain the same number of samples for fair comparison. We use Accuracy (%) as the evaluation metrics.

Textual, fewer Simple, direct


Complex programming
programming programming
Question: A and B start a Question: A restaurant has 3 chefs. Question: There are 3 prime
business … After 2 years, they … of Chef A worked for 8 hours, Chef B numbers in ascending order. The
Rs. 48,000. What is B's share in the worked for 6.5 hours, and Chef C multiplication of the first 2 is 77 and
profit if they divided the profit in the worked for 9.25 hours. How many that of the last 2 is 91. What is the
ratio of their investments? minutes did the chefs work in total? last number?

```python ```python ```python


# Total investment = 60000+80000 chef_A_hours = 8 from sympy import primerange
= 140000 chef_B_hours = 6.5 primes = list(primerange(1, 92))
# A's share = (60000/140000) * 48000 chef_C_hours = 9.25 for i in range(len(primes)):
= 20571.43 total_hours = chef_A_hours + for j in range(i+1, len(primes)):
… chef_B_hours + chef_C_hours if primes[i] * primes[j] == 77:
B_share_in_profit = 27428.57 total_minutes = total_hours * 60 …
print(B_share_in_profit) print(total_minutes) print(last_prime)
``` ``` ```

Figure 4: As the CIRS score increases, there is a greater presence of logical and structural information in the code.

LLM itself is not enough to resolve all issues at once; the as- and structurally, logical insufficient problems.
sistance of more powerful external tools is required. • Simple but direct programming. As CIRS score in-
creases in the code reasoning steps, the presence of pro-
5.3 The characteristics of different CIRS scores. gramming languages with simple logical semantics and
In Figure 4, we investigate the characteristics of different structures also escalates. These samples typically involve
CIRS scores. The different subsets of CIRS scores exhibit simple and straightforward logical operations.
distinct structural and logical differences. Inspired by (Ha- • Complex programming. Samples with exceedingly
ladyna 1997; Conklin 2005) and AoPS3 , we also find the high scores contain substantial amounts of structural
results of different complexity scores correspond to the cog- function definitions or reasoning processes, which sug-
nitive level of difficulty for reasoning problems. gests the presence of numerous complex conditional
• Textual, minimal programming. Samples with lower statements and function structures. These samples are
CIRS scores contain little structural information. Al- typically highly challenging mathematical problems.
though they do contain some intermediary reasoning pro-
5.4 Excluding the effect of the complexity
cesses, these are primarily represented in flat textual de-
scriptions. These samples typically correspond to simpler distribution itself
To negate the potential skew from data distribution itself,
3
https://artofproblemsolving.com/ such as enhanced performance in the mid-range data due to
low medium Rationale-Code 79.0
Rationale-Textual
75
high invalid
5.4% 48.7% 2.6%
60.4

Accuracy (%)
54.8
CIRS-low 34.6 23.7 25.5 41.8
50
17.8% 61.1% 9.3% 35.8
34.8 27.8
CIRS-medium 53.4 46.1 47.3
19.4
25
16.9% 42.9% 9.2%

CIRS-high 31.4 49.2 30.7


AsDiv GSM8K MultiArith SVAMP
Figure 5: Ablation analysis for different code complexities.
Figure 6: Comparison for textual and code rationales. We
We use CIRS to measure the predictions for each model and
use Accuracy(%) as the evaluation metrics. Training with
divide them into four categories (low, medium, high and in-
code data demonstrates a clear advantage in all datasets.
valid). z}|{ means the percentage of output predictions and
←→ denotes the prediction result of each category (accu-
racy %). The results show that the effectiveness of complex-
data generation and stratification. The auto-synthesizing and
ity data is not because of the frequency of data occurrence.
stratifying algorithm is described in Algorithm 1.
We first do the template T filling by calling APIs and get
the synthesized dataset D. Then we calculate the distribution
its higher frequency of occurrence, we conduct a more in- of complexity for all synthesized data by CIRS and get the
depth analysis of the evaluation results at different complex- threshold set J. Next we design a threshold-based k-means
ity scores. We use the trained 7B model in Section 5.2 and clustering method that automatically partitions the dataset
conduct tests on 2,000 samples with three models (CIRS- according to complexity characteristics. Finally, we will ap-
low, CIRS-medium, CIRS-high). It should be noted that we ply our proposed algorithm for two scenarios to enhance the
use CIRS to measure the output reasoning steps for each reasoning abilities of LLMs.
model and divide them into four categories (low, medium,
high and invalid). From the results in Figure 5, we find that
CIRS-medium generates the highest number of valid pre- Algorithm 1: Auto-Synthesizing and Stratifying
dicted outputs in three distributions (17.8%, 61.1%, 9.3%). Require: T : Template, K: Number of clusters, J: Threshold set
We also observe that CIRS-medium demonstrates high ac- Ensure: C: Cluster assignments
curacy (53.4, 46.1, 47.3) in all three distributions. The ac- 1: Dataset D ← template T filling by leveraging API
curacy of predictions for each distribution by the model is 2: Threshold J ← threshold set generated by CIRS
3: Initialize C with random initial cluster assignments
independent of the quantity of training data. Therefore, we 4: repeat
can conclude that the effectiveness of complexity data is not 5: Clear all clusters
because of the frequency of data occurrence. 6: for each data point x in D do
7: Find the nearest centroid ci in C to x
5.5 Ablation analysis for textual rationales 8: Assign x to cluster ci
9: end for
To verify the effect of code and textual rationales, we substi- 10: for each cluster ci in C do
tute the code-format solving process with textual rationales 11: Recalculate centroid ci as the mean of all points assigned
using the same datasets. We sample 1,700 instances of code to ci
data within the mid-range complexity and simultaneously 12: end for
construct a dataset that uses textual rationales. We train both 13: Remove clusters from C if the average distance to their cen-
two models based on LLaMA-7B. As shown in Figure 6, troid is not in J
the code dataset demonstrates a clear advantage in all four 14: until no more updates or maximum iterations reached
datasets. It is because code inherently encapsulates logical 15: return C
semantics and structural information. Another reason is that
code can be executed by external interpreters. So solutions
with code are superior to flattened textual information. 6.2 Usage1: CIRS-guided Instruction Generation
From the analysis in Section 5, we know that the trained
6 CIRS for Improving the Reasoning Ability model with complexity optimal level of code data, exhibits
the best reasoning capabilities. Therefore, we employ our
In this section, we describe our auto-synthesizing and strat-
Algorithm 1 to filter out more data from the source dataset to
ifying algorithm in Section 6.1. Then we apply CIRS to in-
train an enhanced reasoning model, specifically targeting the
struction generation for mathematical reasoning, code data
mid-range complexity range. Totally, we collect 40,000 data
filtering for code generation tasks in Section 6.2 and 6.3.
samples to train a more powerful language model for reason-
ing. Results are shown in Table 2. For in-distribution setting,
6.1 Auto-Synthesizing and Stratifying we find that trained model outperforms Vicuna and Falcon.
Based on the processing step in Section 5, we formalize To eliminate the influence of data distribution, we directly
the whole procedure into a pipeline method for automatic test the model’s performance in the out-of-distribution set-
Mathematical Reasoning
Models Parameters In-Distribution Out-of-Distribution
AsDiv GSM8K MultiArith SVAMP MATH BigBench-Hard†
Zero-shot, Answer-only Prompting
Falcon* 7B 14.7 3.6 6.0 5.6 4.8 19.0
Vicuna 7B 35.8 8.6 16.4 33.0 14.5 25.3
gpt-3.5-turbo / 74.4 65.8 84.8 71.0 70.1 37.3
CIRS (LLaMA) 7B 69.2 40.4 97.2 70.2 38.6 37.7
Few-shot, Chain-of-thought Prompting
Falcon 7B 7.9 3.0 5.4 4.6 2.9 23.2
Vicuna 7B 34.9 9.1 17.2 32.0 17.2 35.3
gpt-3.5-turbo / 80.6 61.4 44.8 71.6 68.5 50.1
CIRS (LLaMA) 7B 65.4 37.6 96.0 69.4 39.2 36.3

Table 2: Results of mathematical reasoning tasks. † We choose algorithmic and multi-step arithmetic reasoning tasks in BIG-
Bench Hard. *Here we use Falcon-Instruct which is fine-tuned on instruction datasets.

ting. Our model perform best (the same parameters) in both et al. 2023) leverages code prompting methods for informa-
zero-shot and few-shot prompting. It is worth noting that our tion extraction tasks. Madaan et al. (2022) frames the task of
approach demonstrates comparable effectiveness to Chat- structured commonsense reasoning as code generation. (Zhu
GPT on BigBench-Hard in zero-shot setting. For MATH et al. 2023) distills LLMs into specialized, compact models
dataset, we notice that our model still outperforms the base- for reasoning tasks by program-aided prompting.
line models. But our model are much worse than ChatGPT
which is due to limitation of code data itself. Reasoning with Large Language Models The research
on reasoning abilities is a core issue in NLP (Qiao et al.
6.3 Usage2: CIRS-based Code Filtering 2023; Huang and Chang 2022; Zhao et al. 2023). The suc-
cess of LLMs have progressively achieved a series break-
throughs in various tasks or domains (Imani, Du, and Shri-
Models Parameters Acc. vastava 2023; Yang et al. 2022; Zhang et al. 2022; Chen
Alpaca 7B 24.0 et al. 2023). Some research studies (Gendron et al. 2023;
Code-LLaMA 7B 50.0 Liu et al. 2023; Varshney et al. 2023; Yuan et al. 2023;
Code (CIRS)-LLaMA 7B 55.0 Schwartz et al. 2020) are focusing on analyzing the capa-
bilities of large models themselves. (Wang et al. 2023b) im-
Table 3: Results of CIRS-based code filtering tasks. proves LLMs reasoning abilities by fine-tuning alignment
paradigm. More and more research efforts (Fu et al. 2023b;
To validate the effectiveness of our approach in code- Mukherjee et al. 2023) are being devoted to unveiling the
related tasks, we use the Algorithm 1 to filter a batch of code origin of a model’s reasoning abilities or focus on enhanc-
instruction data. We first split the Code Alpaca (Chaudhary ing the capability of smaller models. Some works (Wiegr-
2023) into train and test dataset. We leverage the whole train effe, Marasovic, and Smith 2021; Xie et al. 2023) gener-
dataset to train LLaMA-7B and the trained model is Code- ate rationales to enhance model interpretability. To measure
LLaMA. For fair comparison, we filter the train dataset and reasoning capabilities, (Fu et al. 2023c) propose a selection
get the subset with much more high-quality code instruc- scheme based on complexity prompting. (Fu et al. 2023a) is
tions. We train Code (CIRS)-LLaMA based on the filtered an open-source evaluation suite that measures LLMs’ multi-
data. The results illustrate that Code (CIRS)-LLaMA demon- step reasoning performance. Different from previous work,
strates effective performance in pure code generation tasks. our work is the first to analyze the reasoning capabilities of
We can conclude that the optimized structures and logical large language models from code data.
semantics is most beneficial for LLM’s reasoning abilities.
8 Discussion and Conclusion
7 Related Work What kind of data format is crucial for LLM’s reasoning
Program-aided Prompting Program-of-thoughts (Chen abilities? We explore the reasoning abilities for program-
et al. 2022a) prompting delegates computation steps to an of-thought prompting and the results indicate that code data
external language interpreter and (Gao et al. 2022) generates with optimal level of code, characterized by certain logical
programs as the intermediate reasoning steps. (Cheng et al. and structural qualities, is the key factor. Code data is effi-
2023) is a neural-symbolic framework that maps the task in- cient because it is inherently semi-structured and abundant
put to a program. Similarly, (Hu et al. 2023) is a neural sym- in the natural world. We can prove that: (1) The local struc-
bolic prompting method for complex reasoning tasks. Some tural properties of the data are crucial for improving reason-
methods such as (Wang, Li, and Ji 2022; Li et al. 2023; Bi ing abilities, which aligns with (Prystawski and Goodman
2023). The logical coherence or a certain amount of knowl- A.; Sutskever, I.; and Amodei, D. 2020. Language Mod-
edge circuitry inherent in the data is necessary. (2) Overly els are Few-Shot Learners. In Larochelle, H.; Ranzato, M.;
complex structural information and logic are ‘too difficult Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neu-
to learn’ for LLMs. The experimental results of this work ral Information Processing Systems 33: Annual Conference
demonstrate that knowledge of optimal level complexity is on Neural Information Processing Systems 2020, NeurIPS
most effective because it is learnable for most large language 2020, December 6-12, 2020, virtual.
models. Meanwhile, we also find that as the number of pa- Chaudhary, S. 2023. Code Alpaca: An Instruction-following
rameters in language models increases, their understanding LLaMA model for code generation. https://github.com/
of complex knowledge also improves. sahil280114/codealpaca.
In this work, we introduce CIRS to measure the rela-
tion between code reasoning steps and reasoning abilities. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto,
By considering both structural and logical attributes of code H. P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brock-
data, we use AST to encode the structural information and man, G.; Ray, A.; Puri, R.; Krueger, G.; Petrov, M.; Khlaaf,
encode structural feature by difficulty and cyclomatic com- H.; Sastry, G.; Mishkin, P.; Chan, B.; Gray, S.; Ryder, N.;
plexity. Through an empirical analysis, we find that optimal Pavlov, M.; Power, A.; Kaiser, L.; Bavarian, M.; Winter, C.;
level of code languages plays a crucial role in the reason- Tillet, P.; Such, F. P.; Cummings, D.; Plappert, M.; Chantzis,
ing abilities of program-of-thought prompting. We develop F.; Barnes, E.; Herbert-Voss, A.; Guss, W. H.; Nichol, A.;
the auto-synthesizing and stratifying algorithm that applies Paino, A.; Tezak, N.; Tang, J.; Babuschkin, I.; Balaji, S.;
mathematical reasoning and code generation tasks. Exten- Jain, S.; Saunders, W.; Hesse, C.; Carr, A. N.; Leike, J.;
sive results prove the effectiveness of the proposed method. Achiam, J.; Misra, V.; Morikawa, E.; Radford, A.; Knight,
In the future, we will expand this work to more scenarios M.; Brundage, M.; Murati, M.; Mayer, K.; Welinder, P.; Mc-
such as commonsense or logical reasoning tasks and train Grew, B.; Amodei, D.; McCandlish, S.; Sutskever, I.; and
powerful reasoning models with low computational cost. Zaremba, W. 2021. Evaluating Large Language Models
Trained on Code. CoRR, abs/2107.03374.
Acknowledgements Chen, W.; Ma, X.; Wang, X.; and Cohen, W. W. 2022a. Pro-
gram of Thoughts Prompting: Disentangling Computation
We would like to express gratitude to the anonymous re- from Reasoning for Numerical Reasoning Tasks. CoRR,
viewers for their kind comments. This work was supported abs/2211.12588.
by the National Natural Science Foundation of China (No.
62206246), the Fundamental Research Funds for the Central Chen, X.; Zhang, N.; Xie, X.; Deng, S.; Yao, Y.; Tan,
Universities (226-2023-00138), Zhejiang Provincial Natu- C.; Huang, F.; Si, L.; and Chen, H. 2022b. Know-
ral Science Foundation of China (No. LGG22F030011), Prompt: Knowledge-aware Prompt-tuning with Synergis-
Ningbo Natural Science Foundation (2021J190), Yongjiang tic Optimization for Relation Extraction. In Laforest, F.;
Talent Introduction Programme (2021A-156-G), CCF- Troncy, R.; Simperl, E.; Agarwal, D.; Gionis, A.; Herman,
Baidu Open Fund, and Information Technology Center and I.; and Médini, L., eds., WWW ’22: The ACM Web Confer-
State Key Lab of CAD&CG, Zhejiang University, and NUS- ence 2022, Virtual Event, Lyon, France, April 25 - 29, 2022,
NCS Joint Laboratory (A-0008542-00-00). 2778–2788. ACM.
Chen, Z.; Zhang, W.; Huang, Y.; Chen, M.; Geng, Y.; Yu,
References H.; Bi, Z.; Zhang, Y.; Yao, Z.; Song, W.; Wu, X.; Yang,
Y.; Chen, M.; Lian, Z.; Li, Y.; Cheng, L.; and Chen, H.
Almazrouei, E.; Alobeidli, H.; Alshamsi, A.; Cappelli, A.;
2023. Tele-Knowledge Pre-training for Fault Analysis.
Cojocaru, R.; Debbah, M.; Goffinet, E.; Heslow, D.; Lau-
arXiv:2210.11298.
nay, J.; Malartic, Q.; Noune, B.; Pannier, B.; and Penedo,
G. 2023. Falcon-40B: an open large language model with Cheng, Z.; Xie, T.; Shi, P.; Li, C.; Nadkarni, R.; Hu,
state-of-the-art performance. Y.; Xiong, C.; Radev, D.; Ostendorf, M.; Zettlemoyer, L.;
Anil, R.; Dai, A. M.; Firat, O.; Johnson, M.; Lepikhin, D.; Smith, N. A.; and Yu, T. 2023. Binding Language Mod-
Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; Chu, els in Symbolic Languages. In The Eleventh International
E.; Clark, J. H.; Shafey, L. E.; and et al. 2023. PaLM 2 Conference on Learning Representations, ICLR 2023, Ki-
Technical Report. arXiv:2305.10403. gali, Rwanda, May 1-5, 2023. OpenReview.net.
Bi, Z.; Chen, J.; Jiang, Y.; Xiong, F.; Guo, W.; Chen, H.; Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.;
and Zhang, N. 2023. CodeKGC: Code Language Model Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica,
for Generative Knowledge Graph Construction. CoRR, I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot
abs/2304.09048. Impressing GPT-4 with 90%* ChatGPT Quality.
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Cobbe, K.; Kosaraju, V.; Bavarian, M.; Hilton, J.; Nakano,
Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to
A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, Solve Math Word Problems. CoRR, abs/2110.14168.
T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, Conklin, J. 2005. A taxonomy for learning, teaching, and
C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; assessing: A revision of Bloom’s taxonomy of educational
Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, objectives complete edition.
Fu, Y.; Ou, L.; Chen, M.; Wan, Y.; Peng, H.; and Khot, Ling, W.; Yogatama, D.; Dyer, C.; and Blunsom, P. 2017.
T. 2023a. Chain-of-Thought Hub: A Continuous Effort to Program Induction by Rationale Generation: Learning to
Measure Large Language Models’ Reasoning Performance. Solve and Explain Algebraic Word Problems. In Barzilay,
CoRR, abs/2305.17306. R.; and Kan, M., eds., Proceedings of the 55th Annual Meet-
Fu, Y.; Peng, H.; Ou, L.; Sabharwal, A.; and Khot, T. 2023b. ing of the Association for Computational Linguistics, ACL
Specializing Smaller Language Models towards Multi-Step 2017, Vancouver, Canada, July 30 - August 4, Volume 1:
Reasoning. CoRR, abs/2301.12726. Long Papers, 158–167. Association for Computational Lin-
guistics.
Fu, Y.; Peng, H.; Sabharwal, A.; Clark, P.; and Khot, T.
2023c. Complexity-Based Prompting for Multi-step Rea- Liu, X.; Yin, D.; Zhang, C.; Feng, Y.; and Zhao, D. 2023.
soning. In The Eleventh International Conference on Learn- The Magic of IF: Investigating Causal Reasoning Abilities
ing Representations, ICLR 2023, Kigali, Rwanda, May 1-5, in Large Language Models of Code. CoRR, abs/2305.19213.
2023. OpenReview.net. Madaan, A.; Zhou, S.; Alon, U.; Yang, Y.; and Neubig, G.
Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; 2022. Language Models of Code are Few-Shot Common-
Callan, J.; and Neubig, G. 2022. PAL: Program-aided Lan- sense Learners. CoRR, abs/2210.07128.
guage Models. CoRR, abs/2211.10435. McCabe, T. J. 1976. A Complexity Measure. IEEE Trans.
Software Eng., 2(4): 308–320.
Gendron, G.; Bao, Q.; Witbrock, M.; and Dobbie, G.
2023. Large Language Models Are Not Abstract Reason- Miao, S.; Liang, C.; and Su, K. 2020. A Diverse Corpus
ers. CoRR, abs/2305.19555. for Evaluating and Developing English Math Word Problem
Solvers. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault,
Haladyna, T. M. 1997. Writing Test Items to Evaluate J. R., eds., Proceedings of the 58th Annual Meeting of the
Higher Order Thinking. ERIC. Association for Computational Linguistics, ACL 2020, On-
Halstead, M. H. 1977. Elements of Software Science (Op- line, July 5-10, 2020, 975–984. Association for Computa-
erating and programming systems series). Elsevier Science tional Linguistics.
Inc. Mukherjee, S.; Mitra, A.; Jawahar, G.; Agarwal, S.; Palangi,
Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, H.; and Awadallah, A. H. 2023. Orca: Progressive Learn-
S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring ing from Complex Explanation Traces of GPT-4. CoRR,
Mathematical Problem Solving With the MATH Dataset. In abs/2306.02707.
Vanschoren, J.; and Yeung, S., eds., Proceedings of the Neu- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
ral Information Processing Systems Track on Datasets and
Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, Patel, A.; Bhattamishra, S.; and Goyal, N. 2021. Are NLP
December 2021, virtual. Models really able to Solve Simple Math Word Problems?
In Toutanova, K.; Rumshisky, A.; Zettlemoyer, L.; Hakkani-
Hu, Y.; Yang, H.; Lin, Z.; and Zhang, M. 2023. Code Tür, D.; Beltagy, I.; Bethard, S.; Cotterell, R.; Chakraborty,
Prompting: a Neural Symbolic Method for Complex Rea- T.; and Zhou, Y., eds., Proceedings of the 2021 Confer-
soning in Large Language Models. CoRR, abs/2305.18507. ence of the North American Chapter of the Association for
Huang, J.; and Chang, K. C. 2022. Towards Reasoning in Computational Linguistics: Human Language Technologies,
Large Language Models: A Survey. CoRR, abs/2212.10403. NAACL-HLT 2021, Online, June 6-11, 2021, 2080–2094.
Huang, W.; Wang, C.; Zhang, R.; Li, Y.; Wu, J.; and Fei- Association for Computational Linguistics.
Fei, L. 2023. VoxPoser: Composable 3D Value Maps Prystawski, B.; and Goodman, N. D. 2023. Why think step-
for Robotic Manipulation with Language Models. CoRR, by-step? Reasoning emerges from the locality of experience.
abs/2307.05973. CoRR, abs/2304.03843.
Huang, W.; Xia, F.; Xiao, T.; Chan, H.; Liang, J.; Florence, Qiao, S.; Ou, Y.; Zhang, N.; Chen, X.; Yao, Y.; Deng, S.;
P.; Zeng, A.; Tompson, J.; Mordatch, I.; Chebotar, Y.; Ser- Tan, C.; Huang, F.; and Chen, H. 2023. Reasoning with
manet, P.; Jackson, T.; Brown, N.; Luu, L.; Levine, S.; Haus- Language Model Prompting: A Survey. In ACL. The As-
man, K.; and Ichter, B. 2022. Inner Monologue: Embod- sociation for Computational Linguistics.
ied Reasoning through Planning with Language Models. In Roy, S.; and Roth, D. 2015. Solving General Arithmetic
Liu, K.; Kulic, D.; and Ichnowski, J., eds., Conference on Word Problems. In Màrquez, L.; Callison-Burch, C.; Su,
Robot Learning, CoRL 2022, 14-18 December 2022, Auck- J.; Pighin, D.; and Marton, Y., eds., Proceedings of the
land, New Zealand, volume 205 of Proceedings of Machine 2015 Conference on Empirical Methods in Natural Lan-
Learning Research, 1769–1782. PMLR. guage Processing, EMNLP 2015, Lisbon, Portugal, Septem-
Imani, S.; Du, L.; and Shrivastava, H. 2023. MathPrompter: ber 17-21, 2015, 1743–1752. The Association for Computa-
Mathematical Reasoning using Large Language Models. tional Linguistics.
CoRR, abs/2303.05398. Schwartz, R.; Stanovsky, G.; Swayamdipta, S.; Dodge, J.;
Li, P.; Sun, T.; Tang, Q.; Yan, H.; Wu, Y.; Huang, X.; and Smith, N. A. 2020. The Right Tool for the Job: Matching
and Qiu, X. 2023. CodeIE: Large Code Generation Mod- Model and Instance Complexities. arXiv:2004.07453.
els are Better Few-Shot Information Extractors. CoRR, Suzgun, M.; Scales, N.; Schärli, N.; Gehrmann, S.; Tay,
abs/2305.05711. Y.; Chung, H. W.; Chowdhery, A.; Le, Q. V.; Chi, E. H.;
Zhou, D.; and Wei, J. 2022. Challenging BIG-Bench Tasks Zhu, X.; Qi, B.; Zhang, K.; Long, X.; and Zhou, B. 2023.
and Whether Chain-of-Thought Can Solve Them. CoRR, PaD: Program-aided Distillation Specializes Large Models
abs/2210.09261. in Reasoning. CoRR, abs/2305.13888.
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux,
M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar,
F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, G.
2023. LLaMA: Open and Efficient Foundation Language
Models. CoRR, abs/2302.13971.
Varshney, N.; Parmar, M.; Patel, N.; Handa, D.; Sarkar, S.;
Luo, M.; and Baral, C. 2023. Can NLP Models Correctly
Reason Over Contexts that Break the Common Assump-
tions? CoRR, abs/2305.12096.
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu,
Y.; Fan, L.; and Anandkumar, A. 2023a. Voyager: An
Open-Ended Embodied Agent with Large Language Mod-
els. CoRR, abs/2305.16291.
Wang, P.; Li, L.; Chen, L.; Song, F.; Lin, B.; Cao, Y.; Liu, T.;
and Sui, Z. 2023b. Making Large Language Models Better
Reasoners with Alignment. arXiv:2309.02144.
Wang, X.; Li, S.; and Ji, H. 2022. Code4Struct: Code Gener-
ation for Few-Shot Structured Prediction from Natural Lan-
guage. CoRR, abs/2210.12810.
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.;
Xia, F.; Chi, E. H.; Le, Q. V.; and Zhou, D. 2022. Chain-
of-Thought Prompting Elicits Reasoning in Large Language
Models. In NeurIPS.
Wiegreffe, S.; Marasovic, A.; and Smith, N. A. 2021. Mea-
suring Association Between Labels and Free-Text Ratio-
nales. In Moens, M.; Huang, X.; Specia, L.; and Yih,
S. W., eds., Proceedings of the 2021 Conference on Em-
pirical Methods in Natural Language Processing, EMNLP
2021, Virtual Event / Punta Cana, Dominican Republic, 7-
11 November, 2021, 10266–10284. Association for Compu-
tational Linguistics.
Xie, Y.; Kawaguchi, K.; Zhao, Y.; Zhao, X.; Kan, M.; He, J.;
and Xie, Q. 2023. Decomposition Enhances Reasoning via
Self-Evaluation Guided Decoding. CoRR, abs/2305.00633.
Yang, Z.; Qin, J.; Chen, J.; Lin, L.; and Liang, X. 2022. Log-
icSolver: Towards Interpretable Math Word Problem Solv-
ing with Logical Prompt-enhanced Learning. In Goldberg,
Y.; Kozareva, Z.; and Zhang, Y., eds., Findings of the Asso-
ciation for Computational Linguistics: EMNLP 2022, Abu
Dhabi, United Arab Emirates, December 7-11, 2022, 1–13.
Association for Computational Linguistics.
Yuan, Z.; Yuan, H.; Li, C.; Dong, G.; Tan, C.; and Zhou, C.
2023. Scaling Relationship on Learning Mathematical Rea-
soning with Large Language Models. arXiv:2308.01825.
Zhang, H.; Zhang, Y.; Li, L. E.; and Xing, E. P. 2022. The
Impact of Symbolic Representations on In-context Learning
for Few-shot Reasoning. CoRR, abs/2212.08686.
Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.;
Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; Du, Y.; Yang, C.;
Chen, Y.; Chen, Z.; Jiang, J.; Ren, R.; Li, Y.; Tang, X.; Liu,
Z.; Liu, P.; Nie, J.; and Wen, J. 2023. A Survey of Large
Language Models. CoRR, abs/2303.18223.

You might also like