When do PoT works for Reasoning
When do PoT works for Reasoning
When do PoT works for Reasoning
In-Distribution Out-of-Distribution
Figure 2: We utilize complexity-impacted reasoning score (CIRS) to measure the complexity of code reasoning steps. We first
synthesize data and employ CIRS to analyze the complexity distribution of the code reasoning data. Then, we analyze and split
the data into three different subsets. Next, we validate the performance on different model parameters. Finally, we leverage the
auto-synthesizing and stratifying algorithm and evaluate its performance on the filtered data with the most effective complexity.
Logical Complexity We define code logical complexity and high cyclomatic complexity. We note that high cyclo-
ScoreLC integrating the difficulty D and cyclomatic com- matic complexity indicates that the program code has com-
plexity V , which is inspired by Halstead Complexity Met- plex judgement logic, potentially leading to lower quality.
rics (Halstead 1977) and McCabe’s Cyclomatic Complexity It might be difficult to test and maintain those code with
(McCabe 1976). high cyclomatic complexity. Generally, by integrating the
difficulty and cyclomatic complexity, both the complexity of
the operators, operands, and control flow of the code can be
ScoreLC (Rc ) = Sigmoid(D(Rc ) × V (Rc )) (4) taken into account. Next, we conduct experimental analysis
where Difficulty D(Rc ) denotes the difficulty for solving to empirically study the rationality of our method.
the problem and V (Rc ) means cyclomatic complexity of the
rationale Rc . To represent the effort required to comprehend 4 Experimental settings
the program, the Difficulty D(Rc ) is defined as: In order to conduct an unbiased evaluation of all model per-
n N formances, we use zero-shot and few-shot settings for eval-
1 2 uation. For zero-shot setting, we directly presenting mathe-
D(Rc ) = · (5)
2 n2 matical problems to the model for solution generation, with-
out any demonstrations in the input. For few-shot setting,
where n1 denotes the number of distinct operators and
we choose 3-shot for evaluation where we select three in-
N2 denotes the total number of operands in the code. n2
context examples with rationales. Our criterion for evalua-
denotes the number of distinct operands in the code rationale
tion is that the answer is considered ultimately correct only
Rc . In this formula, the term (n1 /2) represents the average
if the code executor’s answer is correct.
complexity of operators, while the term (N2 /n2 ) represents
the average complexity of operands. In Section 5, we conduct an empirical analysis of the vari-
ations in different model sizes and complexities in zero-shot
To consider the complexity of the logical loops (code con-
setting. We construct our own test dataset because there are
trol flow), we define the cyclomatic complexity V (Rc ) as:
no publicly available benchmarks up until now. Model eval-
uation is performed on AsDiv (Miao, Liang, and Su 2020),
V (Rc ) = E − N + 2 (6) GSM8K (Cobbe et al. 2021), MultiArith (Roy and Roth
where E denotes the number of edges in the control flow 2015), and SVAMP (Patel, Bhattamishra, and Goyal 2021),
graph in the code and N denotes the number of nodes in with a selection of 500 instances randomly chosen from each
the control flow graph. We employ the Sigmoid function original testset to form the new testsets. We chose gpt-3.5-
to constrain the values of code logical complexity. There is turbo as the main benchmark model and accuracy (Acc) as
a significant correlation between potential program errors our evaluation metric.
In Section 6, we train the model based on the LLaMA-7B After obtaining well-generated code data, we utilize
(Version 1.0) (Touvron et al. 2023). Vicuna (Chiang et al. CIRS (Section 3) and manually split the data into different
2023) and Falcon (Almazrouei et al. 2023) are selected as subsets based on the analysis of code complexity distribu-
the main comparison models and accuracy (Acc) is chosen tion. We put the visualized results in the supplement. Based
as the evaluation metric again. Apart from the datasets used on different complexity scores, we name the partitioned sub-
in the in-distribution setting, the model’s performance is also sets as low (lower score samples), medium (medium score
evaluated on MATH (Hendrycks et al. 2021) and BigBench- samples) and high (high score samples).
Hard (Suzgun et al. 2022) in the out-of-distribution setting.
It should be noted that we only choose level-1 problems in 5.2 Impacts of different complexity score
MATH. We utilize algorithmic and multi-step arithmetic
reasoning tasks in BIG-Bench Hard. The detailed experi- To compare the impact of different code complexities on the
mental setup is shown in the supplementary. reasoning capability of LLMs, we train three models based
on LLaMA (Version 1.0) from 7 billion to 65 billion param-
5 Empirical Analysis eters. We randomly select 1,700 instances from each sub-
set (low, medium, high) to build the training and validation
In this section, we empirically analyze the impact of differ- dataset for fair comparisons. Results are shown in Figure 3.
ent forms of code data. Specifically, we synthesize a totally
(1) Optimal level of code is crucial to the reasoning
new dataset and manually partition it using our CIRS in
abilities of program-of-thought prompting. From the re-
Section 5.1. In Section 5.2, we discuss the impact of code
sults across the four datasets, we note that the model per-
data with different complexities on the reasoning abilities
forms optimally when the complexity of the code data is
for LLMs. Then we analyze the characteristics of code data
in mid-range. This suggests that the learnable symbolic lan-
with varying complexities in Section 5.3. Finally, we con-
guage is crucial to the reasoning abilities of program-aided
duct more ablation analysis in Section 5.4 and 5.5.
prompting. The reasoning behind this is that data with overly
simplistic complexity, is too simple for LLMs, leading to
5.1 Data synthesizing
less noticeable effects. Conversely, when the complexity es-
calates significantly, the logical semantics and nested struc-
Seed source Seed size Data size tures become difficult to comprehend or learn, which could
AQuA 97,467 10,160 adversely impact the reasoning capabilities of LLMs.
GSM8K 7,644 12,812 (2) The larger the number of parameters, the more
MultiArith 600 12,163
ASDiv 2,306 13,554
significant the gain in LLM’s reasoning capabilities. It
SVAMP 3,954 12,441 is evident that as the model size increases from 7 billion to
ALL 61,130 65 billion , the effectiveness of its reasoning capability im-
proves. In fact, after fine-tuning, most 65 billion parameter
Table 1: Statistics of seeds and the generated data size. models can achieve results comparable to those of gpt-3.5-
turbo. It suggests that having a sufficient number of param-
To fairly explore the impact of the variations in different eters is crucial for substantial reasoning capabilities in lan-
complexity scores, it is necessary to avoid errors caused by guage models. Furthermore, when the language model is
the dataset itself and generate entirely new forms of code large enough, the difference in results across various com-
data. The sources of seed data include the training set of plexities is minimal. This indicates that LLMs with vast pa-
GSM8K (Cobbe et al. 2021), MultiArith (Roy and Roth rameters are more prone to symbolic data and inherently
2015), Asdiv (Miao, Liang, and Su 2020), SVAMP (Pa- have the potential to yield strong reasoning capabilities.
tel, Bhattamishra, and Goyal 2021) and AQuA (Ling et al. (3) Current LLMs have limitations in their under-
2017). In Table 1, we have synthesized over 60,000 samples standing capabilities for reasoning. We observe that when
from five seed datasets. For each dataset, we generate ap- data complexity is extremely high, the performance of
proximately 10,000 samples. We choose as many datasets as LLMs tends to decrease. It reflects that there is an inherent
possible to ensure the diversity of mathematical problems. limit to the reasoning capabilities of large language models.
Then, we design a pipeline that can automatically gen- We argue that: (1) The current architecture of LLMs (such
erate high-quality code corpus by leveraging ChatGPT. As as decoder-only LLM) has limited ability to understand com-
shown in Figure 2, we apply a template to define the for- plex knowledge, which also restricts the emergence of their
mat and then allow the API to continuously rewrite new reasoning capabilities. The prerequisite for large models to
questions and their corresponding code-format solutions. demonstrate powerful reasoning abilities is their ability to
In the construction of templates, we randomly select three comprehend the structures and logical knowledge embed-
problems from the seed datasets each time. Next, we au- ded in complex data. Therefore, it is necessary to explore
tomatically filter out the generations that do not conform model structures with stronger reasoning abilities in future
to Python syntax standards, which results in a collection of research. (2) Further enhancement in reasoning power re-
high-quality mathematical problems. For all generated data, quires the reliance on external tools. We know that the scope
we randomly sampled 10% and verified its correctness by of reasoning problems is quite broad, not only mathematical
manual checks and automated validation with GPT-4, ensur- reasoning, but also including commonsense or more com-
ing the accuracy within a reasonable margin of error. plex logical reasoning tasks. Therefore, relying solely on the
low low
gpt-3.5-turbo 74.4 medium
75 medium 75.8
74.4 75
Accuracy(%)
Accuracy(%)
high high
71.8 70.6 71.2
gpt-3.5-turbo 65.8
57.4
55.8
60.4 61.2 63.0 49.0
60 50
53.4 37.4
50.2 34.6 33.8 34.0
27.8 25.8
41.8 20.2
45 38.8
25 17.0
15.0
Accuracy (%)
Accuracy (%)
high 97.8
gpt-3.5-turbo 84.8 89.8 94.0 high
84.2 80.2
79.0
79.2 75 gpt-3.5-turbo 71.0 72.2 70.0
75 67.2
64.2 59.6
62.8
63.4
54.8 55.2 55.8
41.0
50 47.6
50 55.2 51.6
26.8
25
35.6
25 21.4
Figure 3: Evaluation performance on dataset GSM8K, MultiArith, ASDiv and SVAMP. We train three models (low, medium,
high) whose datasets contain the same number of samples for fair comparison. We use Accuracy (%) as the evaluation metrics.
Figure 4: As the CIRS score increases, there is a greater presence of logical and structural information in the code.
LLM itself is not enough to resolve all issues at once; the as- and structurally, logical insufficient problems.
sistance of more powerful external tools is required. • Simple but direct programming. As CIRS score in-
creases in the code reasoning steps, the presence of pro-
5.3 The characteristics of different CIRS scores. gramming languages with simple logical semantics and
In Figure 4, we investigate the characteristics of different structures also escalates. These samples typically involve
CIRS scores. The different subsets of CIRS scores exhibit simple and straightforward logical operations.
distinct structural and logical differences. Inspired by (Ha- • Complex programming. Samples with exceedingly
ladyna 1997; Conklin 2005) and AoPS3 , we also find the high scores contain substantial amounts of structural
results of different complexity scores correspond to the cog- function definitions or reasoning processes, which sug-
nitive level of difficulty for reasoning problems. gests the presence of numerous complex conditional
• Textual, minimal programming. Samples with lower statements and function structures. These samples are
CIRS scores contain little structural information. Al- typically highly challenging mathematical problems.
though they do contain some intermediary reasoning pro-
5.4 Excluding the effect of the complexity
cesses, these are primarily represented in flat textual de-
scriptions. These samples typically correspond to simpler distribution itself
To negate the potential skew from data distribution itself,
3
https://artofproblemsolving.com/ such as enhanced performance in the mid-range data due to
low medium Rationale-Code 79.0
Rationale-Textual
75
high invalid
5.4% 48.7% 2.6%
60.4
Accuracy (%)
54.8
CIRS-low 34.6 23.7 25.5 41.8
50
17.8% 61.1% 9.3% 35.8
34.8 27.8
CIRS-medium 53.4 46.1 47.3
19.4
25
16.9% 42.9% 9.2%
Table 2: Results of mathematical reasoning tasks. † We choose algorithmic and multi-step arithmetic reasoning tasks in BIG-
Bench Hard. *Here we use Falcon-Instruct which is fine-tuned on instruction datasets.
ting. Our model perform best (the same parameters) in both et al. 2023) leverages code prompting methods for informa-
zero-shot and few-shot prompting. It is worth noting that our tion extraction tasks. Madaan et al. (2022) frames the task of
approach demonstrates comparable effectiveness to Chat- structured commonsense reasoning as code generation. (Zhu
GPT on BigBench-Hard in zero-shot setting. For MATH et al. 2023) distills LLMs into specialized, compact models
dataset, we notice that our model still outperforms the base- for reasoning tasks by program-aided prompting.
line models. But our model are much worse than ChatGPT
which is due to limitation of code data itself. Reasoning with Large Language Models The research
on reasoning abilities is a core issue in NLP (Qiao et al.
6.3 Usage2: CIRS-based Code Filtering 2023; Huang and Chang 2022; Zhao et al. 2023). The suc-
cess of LLMs have progressively achieved a series break-
throughs in various tasks or domains (Imani, Du, and Shri-
Models Parameters Acc. vastava 2023; Yang et al. 2022; Zhang et al. 2022; Chen
Alpaca 7B 24.0 et al. 2023). Some research studies (Gendron et al. 2023;
Code-LLaMA 7B 50.0 Liu et al. 2023; Varshney et al. 2023; Yuan et al. 2023;
Code (CIRS)-LLaMA 7B 55.0 Schwartz et al. 2020) are focusing on analyzing the capa-
bilities of large models themselves. (Wang et al. 2023b) im-
Table 3: Results of CIRS-based code filtering tasks. proves LLMs reasoning abilities by fine-tuning alignment
paradigm. More and more research efforts (Fu et al. 2023b;
To validate the effectiveness of our approach in code- Mukherjee et al. 2023) are being devoted to unveiling the
related tasks, we use the Algorithm 1 to filter a batch of code origin of a model’s reasoning abilities or focus on enhanc-
instruction data. We first split the Code Alpaca (Chaudhary ing the capability of smaller models. Some works (Wiegr-
2023) into train and test dataset. We leverage the whole train effe, Marasovic, and Smith 2021; Xie et al. 2023) gener-
dataset to train LLaMA-7B and the trained model is Code- ate rationales to enhance model interpretability. To measure
LLaMA. For fair comparison, we filter the train dataset and reasoning capabilities, (Fu et al. 2023c) propose a selection
get the subset with much more high-quality code instruc- scheme based on complexity prompting. (Fu et al. 2023a) is
tions. We train Code (CIRS)-LLaMA based on the filtered an open-source evaluation suite that measures LLMs’ multi-
data. The results illustrate that Code (CIRS)-LLaMA demon- step reasoning performance. Different from previous work,
strates effective performance in pure code generation tasks. our work is the first to analyze the reasoning capabilities of
We can conclude that the optimized structures and logical large language models from code data.
semantics is most beneficial for LLM’s reasoning abilities.
8 Discussion and Conclusion
7 Related Work What kind of data format is crucial for LLM’s reasoning
Program-aided Prompting Program-of-thoughts (Chen abilities? We explore the reasoning abilities for program-
et al. 2022a) prompting delegates computation steps to an of-thought prompting and the results indicate that code data
external language interpreter and (Gao et al. 2022) generates with optimal level of code, characterized by certain logical
programs as the intermediate reasoning steps. (Cheng et al. and structural qualities, is the key factor. Code data is effi-
2023) is a neural-symbolic framework that maps the task in- cient because it is inherently semi-structured and abundant
put to a program. Similarly, (Hu et al. 2023) is a neural sym- in the natural world. We can prove that: (1) The local struc-
bolic prompting method for complex reasoning tasks. Some tural properties of the data are crucial for improving reason-
methods such as (Wang, Li, and Ji 2022; Li et al. 2023; Bi ing abilities, which aligns with (Prystawski and Goodman
2023). The logical coherence or a certain amount of knowl- A.; Sutskever, I.; and Amodei, D. 2020. Language Mod-
edge circuitry inherent in the data is necessary. (2) Overly els are Few-Shot Learners. In Larochelle, H.; Ranzato, M.;
complex structural information and logic are ‘too difficult Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neu-
to learn’ for LLMs. The experimental results of this work ral Information Processing Systems 33: Annual Conference
demonstrate that knowledge of optimal level complexity is on Neural Information Processing Systems 2020, NeurIPS
most effective because it is learnable for most large language 2020, December 6-12, 2020, virtual.
models. Meanwhile, we also find that as the number of pa- Chaudhary, S. 2023. Code Alpaca: An Instruction-following
rameters in language models increases, their understanding LLaMA model for code generation. https://github.com/
of complex knowledge also improves. sahil280114/codealpaca.
In this work, we introduce CIRS to measure the rela-
tion between code reasoning steps and reasoning abilities. Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto,
By considering both structural and logical attributes of code H. P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brock-
data, we use AST to encode the structural information and man, G.; Ray, A.; Puri, R.; Krueger, G.; Petrov, M.; Khlaaf,
encode structural feature by difficulty and cyclomatic com- H.; Sastry, G.; Mishkin, P.; Chan, B.; Gray, S.; Ryder, N.;
plexity. Through an empirical analysis, we find that optimal Pavlov, M.; Power, A.; Kaiser, L.; Bavarian, M.; Winter, C.;
level of code languages plays a crucial role in the reason- Tillet, P.; Such, F. P.; Cummings, D.; Plappert, M.; Chantzis,
ing abilities of program-of-thought prompting. We develop F.; Barnes, E.; Herbert-Voss, A.; Guss, W. H.; Nichol, A.;
the auto-synthesizing and stratifying algorithm that applies Paino, A.; Tezak, N.; Tang, J.; Babuschkin, I.; Balaji, S.;
mathematical reasoning and code generation tasks. Exten- Jain, S.; Saunders, W.; Hesse, C.; Carr, A. N.; Leike, J.;
sive results prove the effectiveness of the proposed method. Achiam, J.; Misra, V.; Morikawa, E.; Radford, A.; Knight,
In the future, we will expand this work to more scenarios M.; Brundage, M.; Murati, M.; Mayer, K.; Welinder, P.; Mc-
such as commonsense or logical reasoning tasks and train Grew, B.; Amodei, D.; McCandlish, S.; Sutskever, I.; and
powerful reasoning models with low computational cost. Zaremba, W. 2021. Evaluating Large Language Models
Trained on Code. CoRR, abs/2107.03374.
Acknowledgements Chen, W.; Ma, X.; Wang, X.; and Cohen, W. W. 2022a. Pro-
gram of Thoughts Prompting: Disentangling Computation
We would like to express gratitude to the anonymous re- from Reasoning for Numerical Reasoning Tasks. CoRR,
viewers for their kind comments. This work was supported abs/2211.12588.
by the National Natural Science Foundation of China (No.
62206246), the Fundamental Research Funds for the Central Chen, X.; Zhang, N.; Xie, X.; Deng, S.; Yao, Y.; Tan,
Universities (226-2023-00138), Zhejiang Provincial Natu- C.; Huang, F.; Si, L.; and Chen, H. 2022b. Know-
ral Science Foundation of China (No. LGG22F030011), Prompt: Knowledge-aware Prompt-tuning with Synergis-
Ningbo Natural Science Foundation (2021J190), Yongjiang tic Optimization for Relation Extraction. In Laforest, F.;
Talent Introduction Programme (2021A-156-G), CCF- Troncy, R.; Simperl, E.; Agarwal, D.; Gionis, A.; Herman,
Baidu Open Fund, and Information Technology Center and I.; and Médini, L., eds., WWW ’22: The ACM Web Confer-
State Key Lab of CAD&CG, Zhejiang University, and NUS- ence 2022, Virtual Event, Lyon, France, April 25 - 29, 2022,
NCS Joint Laboratory (A-0008542-00-00). 2778–2788. ACM.
Chen, Z.; Zhang, W.; Huang, Y.; Chen, M.; Geng, Y.; Yu,
References H.; Bi, Z.; Zhang, Y.; Yao, Z.; Song, W.; Wu, X.; Yang,
Y.; Chen, M.; Lian, Z.; Li, Y.; Cheng, L.; and Chen, H.
Almazrouei, E.; Alobeidli, H.; Alshamsi, A.; Cappelli, A.;
2023. Tele-Knowledge Pre-training for Fault Analysis.
Cojocaru, R.; Debbah, M.; Goffinet, E.; Heslow, D.; Lau-
arXiv:2210.11298.
nay, J.; Malartic, Q.; Noune, B.; Pannier, B.; and Penedo,
G. 2023. Falcon-40B: an open large language model with Cheng, Z.; Xie, T.; Shi, P.; Li, C.; Nadkarni, R.; Hu,
state-of-the-art performance. Y.; Xiong, C.; Radev, D.; Ostendorf, M.; Zettlemoyer, L.;
Anil, R.; Dai, A. M.; Firat, O.; Johnson, M.; Lepikhin, D.; Smith, N. A.; and Yu, T. 2023. Binding Language Mod-
Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; Chu, els in Symbolic Languages. In The Eleventh International
E.; Clark, J. H.; Shafey, L. E.; and et al. 2023. PaLM 2 Conference on Learning Representations, ICLR 2023, Ki-
Technical Report. arXiv:2305.10403. gali, Rwanda, May 1-5, 2023. OpenReview.net.
Bi, Z.; Chen, J.; Jiang, Y.; Xiong, F.; Guo, W.; Chen, H.; Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.;
and Zhang, N. 2023. CodeKGC: Code Language Model Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica,
for Generative Knowledge Graph Construction. CoRR, I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot
abs/2304.09048. Impressing GPT-4 with 90%* ChatGPT Quality.
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Cobbe, K.; Kosaraju, V.; Bavarian, M.; Hilton, J.; Nakano,
Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to
A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, Solve Math Word Problems. CoRR, abs/2110.14168.
T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, Conklin, J. 2005. A taxonomy for learning, teaching, and
C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; assessing: A revision of Bloom’s taxonomy of educational
Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, objectives complete edition.
Fu, Y.; Ou, L.; Chen, M.; Wan, Y.; Peng, H.; and Khot, Ling, W.; Yogatama, D.; Dyer, C.; and Blunsom, P. 2017.
T. 2023a. Chain-of-Thought Hub: A Continuous Effort to Program Induction by Rationale Generation: Learning to
Measure Large Language Models’ Reasoning Performance. Solve and Explain Algebraic Word Problems. In Barzilay,
CoRR, abs/2305.17306. R.; and Kan, M., eds., Proceedings of the 55th Annual Meet-
Fu, Y.; Peng, H.; Ou, L.; Sabharwal, A.; and Khot, T. 2023b. ing of the Association for Computational Linguistics, ACL
Specializing Smaller Language Models towards Multi-Step 2017, Vancouver, Canada, July 30 - August 4, Volume 1:
Reasoning. CoRR, abs/2301.12726. Long Papers, 158–167. Association for Computational Lin-
guistics.
Fu, Y.; Peng, H.; Sabharwal, A.; Clark, P.; and Khot, T.
2023c. Complexity-Based Prompting for Multi-step Rea- Liu, X.; Yin, D.; Zhang, C.; Feng, Y.; and Zhao, D. 2023.
soning. In The Eleventh International Conference on Learn- The Magic of IF: Investigating Causal Reasoning Abilities
ing Representations, ICLR 2023, Kigali, Rwanda, May 1-5, in Large Language Models of Code. CoRR, abs/2305.19213.
2023. OpenReview.net. Madaan, A.; Zhou, S.; Alon, U.; Yang, Y.; and Neubig, G.
Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; 2022. Language Models of Code are Few-Shot Common-
Callan, J.; and Neubig, G. 2022. PAL: Program-aided Lan- sense Learners. CoRR, abs/2210.07128.
guage Models. CoRR, abs/2211.10435. McCabe, T. J. 1976. A Complexity Measure. IEEE Trans.
Software Eng., 2(4): 308–320.
Gendron, G.; Bao, Q.; Witbrock, M.; and Dobbie, G.
2023. Large Language Models Are Not Abstract Reason- Miao, S.; Liang, C.; and Su, K. 2020. A Diverse Corpus
ers. CoRR, abs/2305.19555. for Evaluating and Developing English Math Word Problem
Solvers. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault,
Haladyna, T. M. 1997. Writing Test Items to Evaluate J. R., eds., Proceedings of the 58th Annual Meeting of the
Higher Order Thinking. ERIC. Association for Computational Linguistics, ACL 2020, On-
Halstead, M. H. 1977. Elements of Software Science (Op- line, July 5-10, 2020, 975–984. Association for Computa-
erating and programming systems series). Elsevier Science tional Linguistics.
Inc. Mukherjee, S.; Mitra, A.; Jawahar, G.; Agarwal, S.; Palangi,
Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, H.; and Awadallah, A. H. 2023. Orca: Progressive Learn-
S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring ing from Complex Explanation Traces of GPT-4. CoRR,
Mathematical Problem Solving With the MATH Dataset. In abs/2306.02707.
Vanschoren, J.; and Yeung, S., eds., Proceedings of the Neu- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
ral Information Processing Systems Track on Datasets and
Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, Patel, A.; Bhattamishra, S.; and Goyal, N. 2021. Are NLP
December 2021, virtual. Models really able to Solve Simple Math Word Problems?
In Toutanova, K.; Rumshisky, A.; Zettlemoyer, L.; Hakkani-
Hu, Y.; Yang, H.; Lin, Z.; and Zhang, M. 2023. Code Tür, D.; Beltagy, I.; Bethard, S.; Cotterell, R.; Chakraborty,
Prompting: a Neural Symbolic Method for Complex Rea- T.; and Zhou, Y., eds., Proceedings of the 2021 Confer-
soning in Large Language Models. CoRR, abs/2305.18507. ence of the North American Chapter of the Association for
Huang, J.; and Chang, K. C. 2022. Towards Reasoning in Computational Linguistics: Human Language Technologies,
Large Language Models: A Survey. CoRR, abs/2212.10403. NAACL-HLT 2021, Online, June 6-11, 2021, 2080–2094.
Huang, W.; Wang, C.; Zhang, R.; Li, Y.; Wu, J.; and Fei- Association for Computational Linguistics.
Fei, L. 2023. VoxPoser: Composable 3D Value Maps Prystawski, B.; and Goodman, N. D. 2023. Why think step-
for Robotic Manipulation with Language Models. CoRR, by-step? Reasoning emerges from the locality of experience.
abs/2307.05973. CoRR, abs/2304.03843.
Huang, W.; Xia, F.; Xiao, T.; Chan, H.; Liang, J.; Florence, Qiao, S.; Ou, Y.; Zhang, N.; Chen, X.; Yao, Y.; Deng, S.;
P.; Zeng, A.; Tompson, J.; Mordatch, I.; Chebotar, Y.; Ser- Tan, C.; Huang, F.; and Chen, H. 2023. Reasoning with
manet, P.; Jackson, T.; Brown, N.; Luu, L.; Levine, S.; Haus- Language Model Prompting: A Survey. In ACL. The As-
man, K.; and Ichter, B. 2022. Inner Monologue: Embod- sociation for Computational Linguistics.
ied Reasoning through Planning with Language Models. In Roy, S.; and Roth, D. 2015. Solving General Arithmetic
Liu, K.; Kulic, D.; and Ichnowski, J., eds., Conference on Word Problems. In Màrquez, L.; Callison-Burch, C.; Su,
Robot Learning, CoRL 2022, 14-18 December 2022, Auck- J.; Pighin, D.; and Marton, Y., eds., Proceedings of the
land, New Zealand, volume 205 of Proceedings of Machine 2015 Conference on Empirical Methods in Natural Lan-
Learning Research, 1769–1782. PMLR. guage Processing, EMNLP 2015, Lisbon, Portugal, Septem-
Imani, S.; Du, L.; and Shrivastava, H. 2023. MathPrompter: ber 17-21, 2015, 1743–1752. The Association for Computa-
Mathematical Reasoning using Large Language Models. tional Linguistics.
CoRR, abs/2303.05398. Schwartz, R.; Stanovsky, G.; Swayamdipta, S.; Dodge, J.;
Li, P.; Sun, T.; Tang, Q.; Yan, H.; Wu, Y.; Huang, X.; and Smith, N. A. 2020. The Right Tool for the Job: Matching
and Qiu, X. 2023. CodeIE: Large Code Generation Mod- Model and Instance Complexities. arXiv:2004.07453.
els are Better Few-Shot Information Extractors. CoRR, Suzgun, M.; Scales, N.; Schärli, N.; Gehrmann, S.; Tay,
abs/2305.05711. Y.; Chung, H. W.; Chowdhery, A.; Le, Q. V.; Chi, E. H.;
Zhou, D.; and Wei, J. 2022. Challenging BIG-Bench Tasks Zhu, X.; Qi, B.; Zhang, K.; Long, X.; and Zhou, B. 2023.
and Whether Chain-of-Thought Can Solve Them. CoRR, PaD: Program-aided Distillation Specializes Large Models
abs/2210.09261. in Reasoning. CoRR, abs/2305.13888.
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux,
M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar,
F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample, G.
2023. LLaMA: Open and Efficient Foundation Language
Models. CoRR, abs/2302.13971.
Varshney, N.; Parmar, M.; Patel, N.; Handa, D.; Sarkar, S.;
Luo, M.; and Baral, C. 2023. Can NLP Models Correctly
Reason Over Contexts that Break the Common Assump-
tions? CoRR, abs/2305.12096.
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu,
Y.; Fan, L.; and Anandkumar, A. 2023a. Voyager: An
Open-Ended Embodied Agent with Large Language Mod-
els. CoRR, abs/2305.16291.
Wang, P.; Li, L.; Chen, L.; Song, F.; Lin, B.; Cao, Y.; Liu, T.;
and Sui, Z. 2023b. Making Large Language Models Better
Reasoners with Alignment. arXiv:2309.02144.
Wang, X.; Li, S.; and Ji, H. 2022. Code4Struct: Code Gener-
ation for Few-Shot Structured Prediction from Natural Lan-
guage. CoRR, abs/2210.12810.
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.;
Xia, F.; Chi, E. H.; Le, Q. V.; and Zhou, D. 2022. Chain-
of-Thought Prompting Elicits Reasoning in Large Language
Models. In NeurIPS.
Wiegreffe, S.; Marasovic, A.; and Smith, N. A. 2021. Mea-
suring Association Between Labels and Free-Text Ratio-
nales. In Moens, M.; Huang, X.; Specia, L.; and Yih,
S. W., eds., Proceedings of the 2021 Conference on Em-
pirical Methods in Natural Language Processing, EMNLP
2021, Virtual Event / Punta Cana, Dominican Republic, 7-
11 November, 2021, 10266–10284. Association for Compu-
tational Linguistics.
Xie, Y.; Kawaguchi, K.; Zhao, Y.; Zhao, X.; Kan, M.; He, J.;
and Xie, Q. 2023. Decomposition Enhances Reasoning via
Self-Evaluation Guided Decoding. CoRR, abs/2305.00633.
Yang, Z.; Qin, J.; Chen, J.; Lin, L.; and Liang, X. 2022. Log-
icSolver: Towards Interpretable Math Word Problem Solv-
ing with Logical Prompt-enhanced Learning. In Goldberg,
Y.; Kozareva, Z.; and Zhang, Y., eds., Findings of the Asso-
ciation for Computational Linguistics: EMNLP 2022, Abu
Dhabi, United Arab Emirates, December 7-11, 2022, 1–13.
Association for Computational Linguistics.
Yuan, Z.; Yuan, H.; Li, C.; Dong, G.; Tan, C.; and Zhou, C.
2023. Scaling Relationship on Learning Mathematical Rea-
soning with Large Language Models. arXiv:2308.01825.
Zhang, H.; Zhang, Y.; Li, L. E.; and Xing, E. P. 2022. The
Impact of Symbolic Representations on In-context Learning
for Few-shot Reasoning. CoRR, abs/2212.08686.
Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.;
Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; Du, Y.; Yang, C.;
Chen, Y.; Chen, Z.; Jiang, J.; Ren, R.; Li, Y.; Tang, X.; Liu,
Z.; Liu, P.; Nie, J.; and Wen, J. 2023. A Survey of Large
Language Models. CoRR, abs/2303.18223.