0% found this document useful (0 votes)
274 views22 pages

From System 1 To System 2: A Survey of Reasoning Large Language Models

This survey explores the transition from fast, intuitive System 1 reasoning to slower, deliberate System 2 reasoning in Large Language Models (LLMs). It highlights advancements in reasoning LLMs, which demonstrate expert-level performance in complex tasks, and discusses their construction, features, and benchmarking. The document aims to provide a comprehensive overview of the field, fostering innovation and tracking developments in reasoning LLMs.

Uploaded by

Guest One
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
274 views22 pages

From System 1 To System 2: A Survey of Reasoning Large Language Models

This survey explores the transition from fast, intuitive System 1 reasoning to slower, deliberate System 2 reasoning in Large Language Models (LLMs). It highlights advancements in reasoning LLMs, which demonstrate expert-level performance in complex tasks, and discusses their construction, features, and benchmarking. The document aims to provide a comprehensive overview of the field, fostering innovation and tracking developments in reasoning LLMs.

Uploaded by

Guest One
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

JOURNAL OF LATEX CLASS FILES, JANUARY 2025 1

From System 1 to System 2: A Survey of


Reasoning Large Language Models
Zhong-Zhi Li∗ , Duzhen Zhang∗ , Ming-Liang Zhang§ , Jiaxin Zhang§ , Zengyan Liu§ , Yuxuan Yao§ ,
Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong,
Zhijiang Guo† , Le Song† , Cheng-Lin Liu† ID , Fellow, IEEE

Abstract—Achieving human-level intelligence requires refining the transition from the fast, intuitive System 1 to the slower, more
deliberate System 2 reasoning. While System 1 excels in quick, heuristic decisions, System 2 relies on logical reasoning for more
accurate judgments and reduced biases. Foundational Large Language Models (LLMs) excel at fast decision-making but lack the depth
arXiv:2502.17419v1 [cs.AI] 24 Feb 2025

for complex reasoning, as they have not yet fully embraced the step-by-step analysis characteristic of true System 2 thinking. Recently,
reasoning LLMs like OpenAI’s o1/o3 and DeepSeek’s R1 have demonstrated expert-level performance in fields such as mathematics
and coding, closely mimicking the deliberate reasoning of System 2 and showcasing human-like cognitive abilities. This survey begins
with a brief overview of the progress in foundational LLMs and the early development of System 2 technologies, exploring how their
combination has paved the way for reasoning LLMs. Next, we discuss how to construct reasoning LLMs, analyzing their features, the
core methods enabling advanced reasoning, and the evolution of various reasoning LLMs. Additionally, we provide an overview of
reasoning benchmarks, offering an in-depth comparison of the performance of representative reasoning LLMs. Finally, we explore
promising directions for advancing reasoning LLMs and maintain a real-time GitHub Repository to track the latest developments. We
hope this survey will serve as a valuable resource to inspire innovation and drive progress in this rapidly evolving field.

Index Terms—Slow-thinking, Large Language Models, Human-like Reasoning, Decision Making in AI, AGI

1 I NTRODUCTION tasks, it is prone to cognitive biases, especially in complex or


uncertain situations, leading to judgment errors. In contrast,
“Don’t teach. Incentivize.”
System 2 relies on logical reasoning and systematic thinking,
—Hyung Won Chung, OpenAI resulting in more accurate and rational decisions [8]–[11]. By
mitigating the biases of System 1, System 2 provides a more
A CHIEVING human-level intelligence requires refining
the transition from System 1 to System 2 reasoning
[1]–[5]. Dual-system theory suggests that human cognition
refined approach to problem-solving [12]–[15].
The development of foundational Large Language Mod-
operates through two modes: System 1, which is fast, auto- els (LLMs)1 has marked a major milestone in Artificial
matic, and intuitive, enabling quick decisions with minimal Intelligence (AI). Models such as GPT-4o [16] and DeepSeek-
effort, and System 2, which is slower, more analytical, and v3 [17] have demonstrated impressive capabilities in text
deliberate [6], [7]. While System 1 is efficient for routine generation, language translation, and a variety of perception
tasks [18]–[28]. These models, trained on extensive datasets
Version: v1 (major update on February 23, 2025) and utilizing advanced algorithms, excel in understanding

Core contribution. § Significant contribution. † Corresponding author. and generating human-like responses. However, despite
Duzhen Zhang, Jiahua Dong, and Le Song are with the Mo-
hamed bin Zayed University of Artificial Intelligence, Abu Dhabi,
their impressive achievements, foundational LLMs operate
UAE (E-mail: [email protected]; [email protected]; in a manner similar to System 1 reasoning, relying on fast,
[email protected]). heuristic-driven decision-making. While they perform ex-
Zhong-Zhi Li, Pei-Jie Wang, Xiuyi Chen, Fei Yin, and Cheng-Lin Liu ceptionally well in providing rapid responses, they often fall
are with the Institute of Automation, Chinese Academy of Sciences,
Beijing, China (E-mail: [email protected]; [email protected], short in scenarios requiring deep, logical analysis and preci-
[email protected]; [email protected]; [email protected]). sion in complex reasoning tasks. This limitation becomes
Ming-Liang Zhang is with the AiShiWeiLai AI Research, Beijing, China (E- especially clear in situations involving intricate problem-
mail: [email protected]).
Zengyan Liu, Yuxuan Yao, and Zhijiang Guo is with City University of
solving, logical analysis, or nuanced understanding, where
Hong Kong and the Hong Kong University of Science and Technology these models do not yet match human cognitive abilities.
(Guangzhou), Guangzhou, China (E-mail: [email protected]; In contrast, reasoning LLMs represent a significant ad-
[email protected]; [email protected]). vancement in the evolution of language models. Models
Jiaxin Zhang is with the University of Strathclyde, Glasgow, UK (E-mail:
[email protected]).
Haotian Xu is with the Xiaohongshu Inc, Beijing, China (E-mail: xuhao- 1. In this paper, “reasoning” refers to answering questions involving
[email protected]). complex, multi-step processes with intermediate steps. Foundational
Yingying Zhang is with the East China Normal University, Shanghai, China LLMs: LLMs with basic reasoning abilities, handling simple or single-
(E-mail: [email protected]). step tasks. Reasoning LLMs: LLMs that excel in complex tasks like cod-
Junhao Zheng is with the South China University of Technology, Guangzhou, ing and mathematical proofs, incorporating a “thinking” process–tasks
China (E-mail: [email protected]). that foundational LLMs struggle with.
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 2

Fig. 1. The recent timeline of reasoning LLMs, covering core methods and the release of open-source and closed-source reproduction projects.

like OpenAI’s o1/o3 [29], [30] and DeepSeek’s R1 [31] are of reasoning LLMs. As illustrated in Figure 2, this survey is
designed to emulate the slower, more deliberate reason- organized as follows:
ing associated with System 2 thinking. Unlike foundational 1) Section 2 offers a concise overview of the progress in
LLMs, reasoning LLMs are equipped with mechanisms for foundational LLMs (Section 2.1) and the early develop-
processing information step-by-step, allowing them to make ment of key System 2 technologies, including symbolic
more accurate and rational decisions. This shift from fast- logic systems (Section 2.2), Monte Carlo Tree Search
thinking, intuitive processes to more methodical, reasoning- (MCTS) (Section 2.3), and Reinforcement Learning (RL)
driven models enables reasoning LLMs to tackle complex (Section 2.4), highlighting how their combination has
tasks, such as advanced mathematics [32]–[37], logical rea- paved the way for reasoning LLMs.
soning [38]–[44], and multimodal reasoning [45]–[47], with 2) Section 3 introduces reasoning LLMs and outlines their
expert-level performance, exhibiting human-like cognitive construction process. Specifically, Section 3.1 presents
abilities. As a result, reasoning LLMs are increasingly seen the characteristics of reasoning LLMs from two per-
as capable of achieving the deep, logical thinking needed spectives: output behavior (Section 3.1.1) and training
for tasks that were once considered beyond AI’s reach. The dynamics (Section 3.1.2), emphasizing their differences
recent timeline of reasoning LLMs is presented in Figure 1. from foundational LLMs. Section 3.2 identifies the core
methods necessary for achieving advanced reasoning
1.1 Structure of the Survey capabilities, focusing on five aspects: Structure Search
This survey offers a comprehensive overview of the key con- (Section 3.2.1), Reward Modeling (Section 3.2.2), Self
cepts, methods, and challenges involved in the development Improvement (Section 3.2.3), Macro Action (Section
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 3

§2.1 Foundational LLMs

§2.2 Symbolic Logic Systems


§2 Foundations of Reasoning LLMs
§2.3 Monte Carlo Tree Search

§2.4 Reinforcement Learning


§3.1.1 Output Behaviour
§3.1 Feature Analysis
§3.1.2 Training Dynamics

§3.2.1 Structure Search

§3.2.2 Reward Modeling


Organizational §3 Blueprinting Reasoning LLMs §3.2 Core Method §3.2.3 Self Improvement
Structure
§3.3 Reasoning LLMs Evolution §3.2.4 Macro Action

§4.1 Benchmark Categories §3.2.5 Reinforcement Fine-Tuning

§4 Benchmarking Reasoning LLMs §4.2 Evaluation Metrics

§4.3 Performance Comparison

Fig. 2. The primary organizational structure of the survey.

3.2.4), and Reinforcement Fine-Tuning (Section 3.2.5). specifically in combination with foundational LLMs–a
Each section delves into the specific characteristics of crucial aspect often overlooked in previous works.
these methods and introduces representative reasoning 3) We present a more thorough and inclusive summary of
LLMs for each approach. Section 3.3 traces the evolu- the core methods necessary for constructing reasoning
tionary stages of reasoning LLMs. LLMs, including but not limited to RL.
3) Section 4 evaluates representative reasoning LLMs.
Specifically, Section 4.1 reviews current mainstream
reasoning benchmarks, covering both plain text and 2 F OUNDATIONS OF R EASONING LLM S
multimodal benchmarks across various task types. Sec-
In this section, we provide a concise overview of the
tion 4.2 outlines the current evaluation metrics, while
progress in foundational LLMs and the early development
Section 4.3 analyzes and compares the performance of
of key System 2 technologies, highlighting critical advance-
mainstream reasoning LLMs with their foundational
ments that, when combined with foundational LLMs, have
counterparts based on these benchmarks.
paved the way for reasoning LLMs. These advancements
4) Section 5 highlights the limitations of existing reasoning
include symbolic logic systems, MCTS, and RL.
LLMs and outlines several promising future develop-
ment directions for these models.
5) Finally, we conclude the paper in Section 6 and provide 2.1 Foundational LLMs
a real-time tracking GitHub Repository to monitor the
latest developments in the field. The development of foundational LLMs saw significant
advancements with the introduction of pretrained Trans-
We hope this survey serves as a valuable resource, fostering formers [18] in 2018-2019, notably through BERT [19] and
innovation and progress in this rapidly evolving domain. GPT [21]. These models leveraged unsupervised pretrain-
ing on vast text corpora, followed by fine-tuning for task-
1.2 Contribution of the Survey specific applications. This approach enabled them to de-
velop a broad language understanding before specializing
Recently, several analyses and replications of specific tech- in tasks such as sentiment analysis, entity recognition, and
nical approaches have been conducted [48]–[55], yet there question answering. BERT’s bidirectional context processing
remains a lack of systematic analysis and organization. improved word understanding, while GPT excelled in text
Research [56] has focused only on slow-thinking methods generation with its unidirectional design.
during testing. Meanwhile, studies [57]–[59] have primarily The release of GPT-2 [22] in 2019, with 1.5 billion param-
concentrated on training or achieving reasoning LLMs, often eters, marked a significant leap in generative performance,
from the perspective of RL. though it also raised ethical concerns. GPT-3 [23], with
Our survey distinguishes itself from and contributes to 175 billion parameters, further demonstrated the power
the existing literature in the following ways: of unsupervised pretraining, excelling in few-shot learning
1) Rather than focusing on a single technical approach, we and performing well across a wide range of NLP tasks. In
offer a comprehensive overview of the key concepts, subsequent years, multimodal models like CLIP [60] and
methods, and challenges involved in reasoning LLMs. DALL-E [61] emerged, integrating text and visual inputs.
2) We summarize the key advancements of early System 2 These models enabled new tasks, such as generating images
and how they have paved the way for reasoning LLMs, from text, and enhanced human-computer interaction.
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 4

By 2023-2024, models such as GPT-4 [62], LLaMA [25], reasoning tasks, as hierarchical planning allows models to
and LLaVA [27] demonstrated advanced capabilities in rea- make high-level decisions before delving into specific prob-
soning, contextual understanding, and multimodal reason- lem details, mirroring symbolic logic’s structured approach.
ing, processing both text and images. The evolution of foun-
dational LLMs has revolutionized AI, enabling more sophis- 2.3 Monte Carlo Tree Search
ticated applications in language comprehension, problem-
solving, and human-machine collaboration. MCTS is a simulation-based search algorithm for decision-
Summary: The development of foundational LLMs has making and planning [79]. It constructs a search tree
progressed from pretrained transformers like BERT to mul- through four steps: Selection, which chooses the child node
timodal models such as GPT-4, enhancing language un- with the highest priority using the UCB1 formula:

derstanding, text generation, and image processing. This wi ln N
UCB1 = +c , (1)
advancement has led to significant breakthroughs in AI, ni ni
improving language comprehension, problem-solving, and
where wi is the total reward of node i, ni is its visit count, N
human-computer interaction. Building on deep learning
is the parent node’s visit count, and c balances exploration
advancements [63]–[66], foundational LLMs can learn exten-
and exploitation. Expansion adds new nodes, Simulation per-
sive world knowledge and semantic relationships from vast
forms random rollouts to evaluate them, and Backpropagation
textual or multimodal data. This enables them to exhibit
updates node statistics. MCTS has been widely used in tasks
emergent capabilities such as In-Context Learning (ICL)
such as optimizing strategies in board games like Go [80]
[67], prompt engineering [68], and Chain-of-Thought (CoT)
and in robotic path planning, where it helps robots navigate
reasoning [2], significantly enhancing their adaptability and
dynamic environments effectively [81].
creative problem-solving abilities.
Summary: MCTS has played a crucial role in the develop-
Despite this progress, foundational LLMs operate simi-
ment of reasoning LLMs, particularly in Structural Search
larly to System 1 reasoning, relying on fast, heuristic-driven
(Section 3.2.1). By simulating potential future reasoning
decision-making and lacking the step-by-step analysis char-
paths and backpropagating estimated rewards, MCTS helps
acteristic of System 2. However, their developments lay
foundational LLMs efficiently identify the most promising,
a solid foundation for future reasoning LLMs–especially
high-reward paths. This process mirrors human-like plan-
when integrated with the following early System 2 technolo-
ning, where future consequences of decisions are considered
gies. This combination paves the way for more versatile,
before taking action. By dynamically exploring multiple
flexible, and human-like reasoning models.
reasoning trajectories, MCTS enables models to avoid get-
ting stuck in suboptimal paths, making it easier to navigate
2.2 Symbolic Logic Systems
complex decision spaces. This integration has significantly
Symbolic logic systems mark the earliest phase of AI, utiliz- enhanced the ability of LLMs to handle intricate and dy-
ing rules and logical principles to represent knowledge and namic reasoning problems, such as those requiring long-
draw conclusions [69], [70]. They are particularly effective in term planning or multi-step logical inferences. It has al-
structured domains, where formal logic ensures precision. lowed LLMs to make more strategic and informed decisions,
Prolog, a logic programming language based on first- improving their overall performance in tasks that involve
order logic, allows users to define facts, rules, and reason nuanced reasoning and strategic exploration.
through queries. It has been pivotal in symbolic reasoning
systems, especially in NLP and expert systems [71]–[73].
Logic-based systems like Prolog employ propositional and 2.4 Reinforcement Learning
predicate logic for formal reasoning [74], [75]. From the RL is a type of machine learning where an agent learns to
1960s to the early 1980s, this approach dominated AI, with make decisions by interacting with an environment and re-
systems like IBM’s LISP [76] for symbolic computation and ceiving feedback in the form of rewards, aiming to maximize
Resolution Theorem Provers [77] for automated reasoning. cumulative rewards over time [82]. Early breakthroughs in
In the 1970s, Marvin Minsky introduced Frames, which or- RL, such as Q-learning [83] and DQNs [84], revolutionized
ganized knowledge into structured frameworks, influencing the field by enabling the handling of complex state spaces
both expert systems and cognitive science [78]. using Deep Neural Networks (DNNs) [85]. These methods
Summary: Symbolic logic systems were pivotal milestones paved the way for scaling RL to real-world tasks, where
in early AI development. Based on formal logic, they ex- traditional tabular approaches fell short. The advent of deep
celled in well-defined problems, particularly in structured RL marked a significant step forward, combining the power
environments. However, they also exposed the limitations of deep learning with RL to process high-dimensional in-
of rigid, rule-based systems. Despite these constraints, sym- puts, such as images and unstructured data.
bolic logic remains foundational to the progress of AI. A landmark achievement in deep RL was AlphaGo,
Recent advancements in reasoning LLMs have greatly which demonstrated RL’s potential by defeating a world
enhanced the emulation of human-like System 2 cogni- champion in the complex game of Go through self-play
tive processes through sophisticated thought architectures, [86]. This success highlighted deep RL’s ability to thrive
known as Macro Action frameworks (Section 3.2.4). By in environments with large, continuous action spaces and
combining symbolic templates or rules with foundational uncertainty. Building on this, AlphaZero advanced the ap-
LLMs, macro actions have significantly improved their rea- proach by mastering multiple board games—chess, Go, and
soning capabilities. Integrating macro actions into founda- Shogi—using self-play, MCTS, and DNNs [87]. AlphaZero’s
tional LLMs has transformed their ability to handle complex ability to learn entirely from scratch, without prior human
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 5

Traditional Reasoning Models Reasoning LLMs

Large-scale Probabilistic
Rules, Logical Deterministic
Deduction Rigid
Training Corpus Reasoning
Approach Data-driven & Probabilistic Reasoning
Rule-based & Symbolic Logic
PPO / DPO / GRPO
Errors from Explicitly Learn Policy Feedbacks from
Encoded Adaptability Take Action
Human Experts Environment
Knowledge & Learning
Manual Knowledge Engineering Self-Improvement & Adaptive Reasoning

Structured Following Logical


Problem- Tree Search like Self-correct
Frameworks Premises Solving MCTS, ToT Mistakes
Strategy
Linear & Deductive Reasoning Exploratory & Multi-path Reasoning
Finance Medical Law
or or Generality
& Scalability
Math & Code & Chat
Small Scale & Domain-Specific
Scalable & Generalize across Tasks

Fig. 3. A comprehensive comparison of traditional reasoning models and reasoning LLMs. Reasoning LLMs offer significant advantages over
traditional models in areas such as training approaches, adaptability and learning, problem-solving strategies, and generality and scalability.

knowledge, showcased RL’s power in environments requir- models. The integration of RL has led to significant advance-
ing long-term strategy and planning. ments in reasoning LLMs, as demonstrated by DeepSeek-R1
AlphaStar further expanded the boundaries of deep RL [31], offering more flexible and efficient solutions.
by excelling in the real-time strategy game StarCraft II.
Unlike board games, StarCraft II presents dynamic, partially
observable environments and demands multi-step, real- 3 B LUEPRINTING R EASONING LLM S
time decision-making [88]. AlphaStar’s success in this do- In this section, we first analyze the features of reasoning
main demonstrated deep RL’s capacity to adapt to complex LLMs from both output behavior and training dynamics
decision-making scenarios that require both strategic plan- perspectives. We then provide a detailed overview of the
ning and tactical execution. These advancements in RL and core methods that enable their advanced reasoning capa-
deep RL have greatly expanded AI’s potential, transitioning bilities. Finally, we summarize the evolution of reasoning
from well-defined, static environments to dynamic, complex LLMs. A comprehensive comparison of traditional reason-
settings that demand continuous learning and adaptation. ing models and reasoning LLMs is shown in Figure 3.
Summary: Deep RL has proven highly effective in solving
complex decision-making tasks. AlphaGo exemplifies this
by learning strategies through self-play and defeating the 3.1 Analysis of the Features of Reasoning LLMs
world champion in Go. This self-play concept laid the foun- 3.1.1 Output Behaviour Perspective
dation for Self Improvement technology (Section 3.2.3) in Explore and Planning Structure: Recent empirical studies
reasoning LLMs, both relying on continuous feedback and have revealed that reasoning LLMs demonstrate a strong
adjustments to optimize strategies. tendency for exploratory behavior in their output structures,
In RL, reward shaping has been crucial, especially for especially when compared to models such as WizardMath
multi-step reasoning tasks [89]. By adjusting the reward [90] and DeepSeekMath [91], which primarily rely on con-
signal to provide more granular feedback during intermedi- ventional CoT reasoning approaches. This exploratory be-
ate steps, it helps agents navigate complex decision-making havior is evident in the formulation of novel hypotheses
paths. This concept inspired the development of Reward and the pursuit of alternative solution paths. Research by
Modeling (Section 3.2.2), particularly the process reward [49] suggests that slow-thinking models engage in a la-
model, in reasoning LLMs. This model offers step-by-step tent generative process, particularly noticeable during the
supervision to identify and correct errors in the reasoning prediction of subsequent tokens. This claim is supported
process. By mimicking human reasoning, the process re- by [31], which observes that similar behaviors naturally
ward model ensures more robust and interpretable results, arise during RL scale training. Furthermore, the Quiet-STaR
especially in tasks like mathematical problem-solving and framework [92] introduces an auxiliary pre-training phase
code generation, where step-by-step evaluation is critical. focused on next-token prediction, highlighting the critical
Moreover, RL itself is a powerful tool for reasoning role of internal deliberation and exploratory mechanisms
LLMs (Section 3.2.5). With a reward mechanism, RL guides prior to content generation. Collectively, these findings un-
foundational LLMs to find optimal solutions, especially in derscore the complex and dynamic nature of reasoning
dynamic reasoning problems. Its simplicity and efficiency processes in advanced LLMs, emphasizing the interaction
make RL invaluable for training and optimizing reasoning between exploration and structured reasoning within their
LLMs, enhancing the intelligence and self-evolution of AI operational frameworks.
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 6

Verification and Check Structure: Analysis of OpenAI’s Reinforcement


o1 [29] and o3 [30] models indicates that their reasoning Fine-Tuning
frameworks incorporate both macro-level actions for long-
term strategic planning and micro-level actions, including
“Wait”, “Hold on”, “Alternatively”, and “Let’s pause”. These
micro actions facilitate meticulous verification and itera-
tive checking processes, ensuring precision in task execu- Structure Macro
tion. Such a dual-layered approach underscores the mod- Search Action
els’ capacity to balance overarching goals with granular,
Reasoning LLMs
detail-oriented operations, thereby enhancing their overall
functionality and reliability. To emulate this characteristic,
Marco-o1 [93], during the MCTS process for constructing
Long-CoT, assigns each tree node the state of “Wait! Maybe
I made some mistakes! I need to rethink from scratch”, thereby
facilitating the reflective nature of Long-CoT. Huatuo-o1 [94] Slow-thinking
employs a multi-agent framework to address the issue of
incorrect CoT generation during validation. This is achieved
by incorporating a prompt with “Backtracking” and “Correc-
tion” functionalities, which enables the correction process.
Longer Inference Length & Time: Recent research [49]–
Reward Self
[52] indicates that reasoning LLMs often generate outputs
exceeding 2000 tokens to tackle complex problems in coding Modeling Improvement
and mathematics. However, this extended output length can
sometimes lead to overthinking, where the model spends Fig. 4. The core methods enabling reasoning LLMs.
excessive time on a problem without necessarily improving
the solution. Studies [49] highlight that while autoregressive
generation and Classic CoT can effectively solve simpler
problems, they struggle with more complex tasks. Research Sparse Training Method: Contrary to conventional wisdom,
[95], [96] shows that in multimodal domains, many prob- the development of effective reasoning LLMs does not re-
lems demand careful observation, comparison, and delib- quire extensive datasets or dense reward signals. For exam-
eration. Additionally, Search-o1 [97] suggests that slow- ple, STILL2 [51] demonstrated impressive performance us-
thinking mechanisms are particularly beneficial in areas ing only 5,000 distilled samples, while Sky-T1 [99] achieved
requiring external knowledge or where potential knowledge performance parity with QwQ [98] using just 17,000 Long-
conflicts arise. In medical scenarios, complex problems, such CoT samples. Similarly, RedStar [54] achieved exceptional
as those requiring test-time scaling techniques, demonstrate results across both textual and multimodal tasks with only
significant improvements [52]. 4,000 core LongCoT samples. In comparison to simple CoT,
Overly Cautious & Simple Problem Trap: Currently, rea- Slow-thinking Supervised Fine-Tuning (SFT) data exhibits
soning LLMs have demonstrated strong performance in do- remarkable sample efficiency, often delivering comparable
mains such as competitive-level mathematics [31], [54], [98], results with just 1/100th of the sample size. Additionally,
[99], complex coding [100], medical question answering [52], research [103] emphasizes the significant training potential
[94], and multilingual translation [93], [101]. These scenarios of online RL scaling algorithms, suggesting that non-dense
require the model to perform fine-grained analysis of the RL supervision and even rule-based reward structures are
problem and execute careful logical reasoning based on sufficient for achieving high performance.
the given conditions. Interestingly, even for straightforward Parameter Characteristic: Training LLMs for slow-thinking,
problems like “2+3=?”, reasoning LLMs can exhibit over- as characterized by the LongCoT approach, results in rel-
confidence or uncertainty. Recent research [102] notes that atively uniform gradient norms across different layers. In
o1-like models tend to generate multiple solution rounds for contrast, fast-thinking, exemplified by the simplified CoT
easier math problems, often exploring unnecessary paths. method, generates larger gradient magnitudes in the earlier
This behavior contrasts with the lack of diverse exploratory layers, along with significant variability in gradient norms
actions for simpler questions, indicating a potential ineffi- across layers. Empirical evidence suggests that larger mod-
ciency in the model’s reasoning process. els, particularly those exceeding 30 billion parameters, are
more compatible with reasoning LLMs training due to their
3.1.2 Training Dynamic Perspective enhanced capacity for complex reasoning. Additionally, ex-
Amazing Data Efficiency: Unlike traditional approaches periments conducted by RedStar [54] show that the benefits
that focus on expanding instruction sets with uniformly of data scaling vary across model sizes, with scaling effects
distributed difficulty levels, Studies [52], [54] suggest that being more pronounced and effective in larger models. This
constructing Slow-thinking CoT datasets with a focus on finding is supported by Deepseek-R1’s research [31], which
hard samples leads to better generalization in fields like demonstrates that a 670-billion-parameter model achieves
medicine and mathematics. This approach diverges from performance metrics closely approximating those of the o1
the conventional practice of collecting diverse and evenly benchmark, highlighting the scalability advantages of larger
distributed instruction datasets. architectures in advanced reasoning tasks.
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 7

TABLE 1
Summary of Structure Search method.

Category Reasoning LLMs Characteristic


Reasoning Steps as Nodes RAP [14], ORM [104], Forest-of-Thought [105] Actions represent intermediate reasoning steps.
Actions

Token-level Decisions CodeTree [106], SPaR [107], TreeBoN [108] Actions involve generating tokens.
Task-specific Structures CWM [109], LLM-MCTS [110] Actions are domain-specific.
Correction and Exploration RethinkMCTS [111], MCTSr [112] Actions emphasize revisiting, refining, or backtracking to improve previous reasoning steps.
Outcome-based Rewards MC-NEST [113] Correctness or validity of the final outcome.
Rewards

Stepwise Evaluations RAP [14], SRA-MCTS [114] Rewards are assigned at intermediate steps.
Self-evaluation Mechanisms SPaR [107], TreeBoN [108], MindStar [115] Rewards rely on the model’s own confidence.
Domain-specific Criteria LLM-MCTS [110], SR-MCTS [116] Rewards are tailored to specific tasks.
Iterative Preference Learning LLaMA-Berry [117], Marco-o1 [93], ReST-MCTS* [118] Rewards derive from comparing multiple solutions.

3.2 Core Method evaluated and refined. In fields like instructional alignment,
In this section, we provide an overview of the core methods frameworks such as SPaR [107] and Marco-o1 [93] leverage
that drive the advanced reasoning capabilities of reasoning MCTS to refine responses and align reasoning trajectories
LLMs, as shown in Figure 4. These include Structure Search, with human preferences or desired outcomes. Additionally,
Reward Modeling, Self Improvement, Macro Action, and task-specific implementations like HuatuoGPT-o1 [94] un-
Reinforcement Fine-Tuning. We also highlight representa- derscore MCTS’s crucial role in navigating highly special-
tive reasoning LLMs for each method. ized domains, such as medical reasoning, where accuracy
and robustness are paramount.
3.2.1 Structure Search MCTS also enables models to go beyond single-pass
Reasoning LLMs aim to achieve high accuracy and depth reasoning methods, such as CoT or Tree-of-Thought, by
in solving complex problems by emulating the deliberate incorporating mechanisms to revisit, critique, and refine
and methodical nature of human reasoning. However, de- reasoning steps dynamically [111], [121]. This iterative ca-
spite recent advancements, current foundational LLMs face pability is essential for tackling tasks with vast decision
inherent limitations when addressing intricate reasoning spaces or those requiring long-term planning, where ear-
tasks. These limitations arise from their lack of an inter- lier decisions can significantly impact final outcomes. By
nal world model to simulate environmental states, their allowing LLMs to simulate, evaluate, and refine multiple
inability to predict the long-term outcomes of reasoning reasoning paths, MCTS introduces a level of adaptability
paths, and their failure to iteratively refine reasoning steps and strategic exploration that traditional approaches lack.
based on future states or rewards [8]. As a result, these As shown by AlphaZero-like tree-search [104] and Search-o1
shortcomings hinder foundational LLMs from effectively [97], MCTS enables reasoning LLMs to not only achieve bet-
balancing exploration and exploitation in vast reasoning ter performance on specific tasks but also exhibit enhanced
spaces, creating challenges in tasks that require multi-step generalization capabilities across diverse domains.
reasoning, such as complex mathematics, logical inference, The integration of MCTS into LLMs depends on defining
or strategic decision-making [119]. actions and rewards to guide reasoning path exploration
MCTS, a powerful search and optimization algorithm, and assess quality. As shown in Table 1, we classify the
effectively addresses these challenges by providing a struc- actions in prior work into four categories:
tured framework to explore and evaluate reasoning paths 1) Reasoning Steps as Nodes: Actions represent inter-
systematically. It operates by constructing a reasoning tree, mediate reasoning steps or decisions, such as select-
where each node represents a reasoning state, and ac- ing rules, applying transformations, or generating sub-
tions expand the tree by considering potential next steps. questions [14], [104], [105], [119].
Through the simulation of future states and the iterative 2) Token-level Decisions: Actions involve generating to-
backpropagation of estimated rewards, MCTS allows foun- kens or sequences (e.g., the next word, phrase, or code
dational LLMs to efficiently identify high-reward reasoning snippet) [106]–[108], [122].
paths, mirroring human planning processes. This approach 3) Task-specific Structures: Actions are domain-specific,
aligns with the core principles of reasoning LLMs, where such as moving blocks in blocksworld, constructing
thorough analysis and deliberate exploration are essential geometry in geometry problem-solving, or modifying
for generating well-reasoned outputs. Recent methods, such workflows in task planning [109], [110], [123].
as RAP [14], enhance foundational LLMs by integrating 4) Self-correction and Exploration: Actions focus on re-
MCTS with a world model, enabling the system to itera- visiting, refining, or backtracking to improve previous
tively refine intermediate reasoning steps and improve fu- reasoning steps [111], [112], [124].
ture predictions. Similarly, Forest-of-Thought [105] utilizes Additionally, as shown in Table 1, we classify the reward
MCTS to dynamically explore multiple reasoning trajecto- design into five categories:
ries, revisiting flawed paths and refining outcomes. 1) Outcome-based Rewards: Rewards focus on the cor-
The application of MCTS in reasoning tasks extends rectness or validity of the final outcome or solution,
beyond traditional problem-solving to highly specialized including the validation of reasoning paths or task
domains. For example, frameworks like SRA-MCTS [114] success [113], [119], [123].
and MC-NEST [120] showcase the utility of MCTS in tack- 2) Stepwise Evaluations: Rewards are assigned at inter-
ling technical challenges such as code generation and math- mediate steps based on the quality of each step or its
ematical reasoning, where intermediate steps are iteratively contribution toward the final outcome [14], [104], [114].
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 8

TABLE 2
Summary of Reward Modeling method.

Model Refinement
Category Methods Data Source Applications Characteristic
Strategy Learning
DIVERSE [127] Prompting Fine-tuning SFT Multiple Reasoning Tasks Weighted Voting Verifier
MATH-SHEPHERD [128] Sampling Feedback-guided SFT & RL Math Reasoning Correctness Score Assignment
ORM AutoPSV [129] Prompting Feedback-guided SFT Math / Commonsense Reasoning Automated Process Supervision
Implicit PRMs [130] Sampling Fine-tuning SFT & RL Math Reasoning Obtaining PRM from ORM
OVM [131] Sampling Feedback-guided SFT Math Reasoning Guided Decoding
ReST-MCTS∗ [132] Sampling Self-training SFT & RL Multiple Reasoning Tasks MCTS and Self-training
OmegaPRM [133] MCTS with Binary Search Feedback-guided SFT Math Reasoning Divide-and-Conquer MCTS
MCTS
ReARTeR [134] Sampling Feedback-guided SFT & RL QA Retrieval-Augmented Generation
Consensus Filtering [135] MCTS Data Construction Feedback-guided SFT Math Reasoning Consensus Filtering Mechanism
ORPS [136] Sampling Feedback-guided SFT Code Generation Supervising Outcome Refinement
PRM Step-DPO [137] Sampling Feedback-guided SFT & RL Math Reasoning Step-wise Preference Pairs
AdaptiveStep [138] Response Dividing Feedback-guided SFT Math Reasoning, Code Generation Dividing Reasoning Steps

3) Self-evaluation Mechanisms: Rewards rely on the


model’s own confidence or self-assessment (e.g., like- O
lihood, next-word probability, or confidence scores) R q s1 s2 … sn a
[107], [108], [115]. M
4) Domain-specific Criteria: Rewards are tailored to spe-
cific tasks, such as symmetry and complexity in ge- r
ometry or alignment with human preferences in text
generation [110], [116], [123].
5) Iterative Preference Learning: Rewards are derived P
from comparing multiple solutions or reasoning paths, R q s1 s2 … sn a
guiding learning dynamically [93], [117], [118]. Question M
Summary: Despite its advantages, structure search-based Reasoning steps
(i.e., MCTS) reasoning LLMs often suffer from substantial r1 r2 rn ra
Answer
computational overhead due to the large number of sim- Reward
ulations required. This makes them less suitable for tasks
that demand real-time decision-making or operate under
resource constraints [125]. Additionally, the effectiveness Fig. 5. The comparison between ORM and PRM for assessing a com-
plete solution trajectory. ORM only provides a single reward based on
of MCTS is highly dependent on well-designed reward the correctness of the final answer, while PRM evaluates the quality of
mechanisms and action definitions, which can vary signif- each reasoning step throughout the process.
icantly across different domains, thus posing challenges to
its generalizability [126].
mathematical problems, its benefits have recently driven
3.2.2 Reward Modeling applications in other fields. For instance, ORPS [136] uti-
Two primary training paradigms are used to tackle multi- lizes PRM to address complex code generation challenges,
step reasoning tasks: outcome supervision and process su- while Step-DPO [137] combines process supervision with
pervision. Outcome supervision emphasizes the correctness the Direct Preference Optimization (DPO) algorithm [143]
of the final answer at a higher level of granularity, and the to improve long-chain mathematical reasoning. A summary
resulting model is referred to as the Outcome Reward Model of Reward Modeling method is presented in Table 2.
(ORM) [32], [139]. In contrast, process supervision provides Summary: Despite the advantages of PRMs, they present
step-by-step labels for the solution trajectory, evaluating several challenges. The primary difficulty is obtaining pro-
the quality of each reasoning step. The resulting model is cess supervision-labeled data, which is often both costly and
known as the Process Reward Model (PRM) [37], [140], time-consuming. To address concerns related to scalability,
[141]. The main distinction between ORM and PRM is efficiency, and accuracy, researchers have explored vari-
illustrated in Figure 5. ous automated annotation methods. For example, MATH-
PRM offers significant advantages [128], [142] in com- SHEPHERD [128] utilizes the correctness of the final an-
plex reasoning tasks for several key reasons. First, it pro- swer to define the quality of intermediate steps based on
vides fine-grained, step-wise supervision, allowing for the their potential to lead to the correct outcome, automating
identification of specific errors within a solution path. This the step-wise data collection process. ReST-MCTS∗ [132]
feature is especially valuable for RL and automated error combines process reward guidance with MCTS to generate
correction. Second, PRM closely mirrors human reasoning higher-quality reasoning traces through extensive rollouts.
behavior, which relies on accurate intermediate steps to Similarly, OmegaPRM [133] employs the MCTS framework
reach correct conclusions. Unlike ORM, PRM avoids situ- while introducing a divide-and-conquer algorithm for auto-
ations where incorrect reasoning can still lead to a correct fi- mated process supervision data generation. Another novel
nal answer, thus ensuring more robust and interpretable rea- approach involves using ORM to train a PRM. Yuan et
soning. While PRM has primarily been applied to complex al. [130] propose training a PRM implicitly by leveraging
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 9

TABLE 3
Summary of Self Improvement method.

Model Refinement
Stage Methods Data Source Applications
Feedback Strategy
STaR [144] Few-shot Language Model SFT QA, Arithmetic Reasoning
Quiet-STaR [92] Token-level Exploration Language Model RL QA, Arithmetic Reasoning
V-STaR [145] Sampling Verifier SFT Arithmetic Reasoning, Code Generation
B-STaR [146] Sampling Reward Model SFT Arithmetic Reasoning, Code Generation
rStar-Math [147] MCTS Data Construction Reward Model SFT Arithmetic Reasoning
Training ReST [148] Sampling Reward Model RL Machine Translation
ReST-EM [149] Sampling Language Model EM for RL Arithmetic Reasoning, Code Generation
ReST-MCTS* [132] Sampling Reward Model SFT, RL Reasoning
ENVISIONS [150] Sampling Environment Guided SFT Web Agents, Reasoning
RISE [151] Sampling Reward Function Weighted SFT Arithmetic Reasoning
STIC [152] Few-shot Language Model SFT Vision Language Model Tasks
SIRLC [153] Question Answeing Language Model RL Reasoning, Translation, Summary
AlpacaFarm [154] Existing Data Language Model SFT None (Intrinsic Evaluation)
Self-Refine [155] Independent of Training Data Language Model Few-shot Demonstration Code Generation, Sentiment Reversal, Acronym Generation
Self-Check [156] Independent of Training Data Language Model Step Check QA, Arithmetic Reasoning
Inference
CRITIC [157] Independent of Training Data Language Model External Tools QA, Arithmetic Reasoning, Detoxification
ROSE [158] Independent of Training Data Language Model Distributed Prompt Safety, Knowledge
Self-Verification [159] Independent of Training Data Language Model Re-Ranking Arithmetic Reasoning
SelfEval-Decoding [160] Independent of Training Data Language Model Beam Search Aritnmetic/Symbolic Reasoning
IPS [161] Independent of Training Data Language Model Constrained Decoding Dialogue
Control-DAG [162] Independent of Training Data Language Model Constrained Decoding Dialogue, Open-domain Generation
Look-Back [163] Independent of Training Data Language Model Contrastive Decoding Alleviating Repetitions
LeCo [164] Independent of Training Data Language Model Constrained Decoding QA, Reasoning

ORM training on cheaper datasets, under mild reward trajectories. Quiet-STaR [92] explores at the token level, in-
parameterization assumptions. They also provide theoret- troducing concepts like meta-tokens and non-myopic loss to
ical guarantees for the performance of this implicit PRM, enhance supervision. Additionally, ReST-MCTS* [132] and
demonstrating its practicality and cost-effectiveness. rStar-Math [147] generate training data through MCTS.
In addition to data collection, PRMs face challenges Improvement strategies also exhibit significant diversity.
related to trustworthiness [134], categorized as follows: For instance, STaR and its derivatives, such as V-STaR [?]
1) Lack of Explanations: Current PRMs often generate and B-STaR [146], combine filtering with SFT. ReST and its
scores for reasoning steps without sufficient explana- variants typically introduce innovative reward calculation
tions, limiting interpretability and hindering their use- methods to enhance RL training for policy models. RISE
fulness in refining reasoning during test-time. [151] incorporates external feedback, recording rewards and
2) Bias in Training Data: Data collection methods, such refining responses through distillation during the improve-
as MCTS, tend to introduce distributional biases, as- ment process. Notably, rStar-Math [147] demonstrates that
signing disproportionately higher scores to the majority small models have achieved System 2 reflective capabilities
of questions. As a result, PRMs struggle to effectively through self-evolving training approaches.
identify erroneous reasoning steps. Test-time self improvement leverages the consistency
3) Early-Step Bias: PRMs show lower accuracy in pre- of a model’s internal knowledge to correct hallucinations
dicting rewards for earlier reasoning steps compared to during inference. These approaches can be categorized
those closer to the final answer. This issue stems from into three main types: methods that refine answers using
the increased randomness and uncertainty associated prompts [155], [156], approaches that utilize external tools
with the initial steps in the reasoning process. [157], and techniques that leverage logits without the need
for external tools or prompts [163], [164].
3.2.3 Self Improvement
Reasoning LLMs exemplify a progression from weak to 3.2.4 Macro Action
strong supervision, while traditional CoT fine-tuning faces Recent advancements in LLMs have driven progress in emu-
challenges in scaling effectively. Self improvement, using the lating human-like System 2 cognitive processes via sophisti-
model’s exploration capabilities for self-supervision, gradu- cated thought architectures, often referred to as macro action
ally enhances LLMs performance in tasks such as translation frameworks. These structured reasoning systems go beyond
[148], mathematical reasoning [144], [149], and multimodal traditional token-level autoregressive generation by intro-
perception [152]. This approach fosters exploration and ap- ducing hierarchical cognitive phases, such as strategic plan-
plication within reasoning LLMs [147], [165]. A summary of ning, introspective verification, and iterative refinement.
Self Improvement method is presented in Table 3. This approach not only enhances the depth of reasoning but
Training-based self improvement in LLMs can be cat- also broadens the solution space, enabling more robust and
egorized based on exploration and improvement strate- diverse problem-solving pathways. A summary of Macro
gies. The exploration phase focuses on data collection to Action method is presented in Table 4.
facilitate subsequent training improvements, with notable We classify the progress of macro action into two aspects:
variations in approach. STaR [144] uses few-shot examples 1) Test-time Scaling through Macro Action Operational-
for data gathering, while ReST [148], ReST-EM [149], and ization: Recent research identifies two key method-
ENVISIONS [150] rely on multiple samplings of complete ologies for improving reasoning performance during
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 10

TABLE 4
Summary of Macro Action method.

Action Attribute
Methods Usage Representative Action
Action Source Action Number Learning Reflection Modality
Self-Check [166] Verification Human-Designed 4 ICL ✓ T Target Extraction, Information Collection, Step Regeneration, Result Comparsion
LeMa [167] Synthetic Data Human-Designed 3 ICL & SFT ✓ T Incorrect Step Recognition, Explanation, Correct Solution:
REFINER [168] Verification/Exploration Human-Designed 2 ICL & SFT ✓ T Critic, Generate
HiICL-MCTS [169] Exploration Human-Designed 5 ICL ✓ T System Analysis, One-Step Thought, Divide and Conquer, ..., Self-Reflection and Refinement
SUPERCORRECT [170] Distill In-Context Learning Dynamic SFT & RL ✗ T –
ReasonFlux [171] Synthetic Data/Exploration Human-Designed ∼500 ICL & SFT & RL ✗ T –
rStar [172] Exploration Human-Designed 5 ICL & RL ✓ T One-step thought, Propose Next Sub-question & Answer, ..., Rephrase question
LLaMA-Berry [173] Exploration Human-Designed 2 ICL & RL ✓ T Reflection, Error Re-correction
Huatuo-o1 [94] Synthetic Data Human-Designed 4 ICL & SFT ✓ T Backtracking, Exploring New Paths, Verification, Correction
Marco-o1 [93] Verification Human-Designed 1 ICL & SFT ✓ T Reflection
BoT [174] Exploration In-Context Learning Dynamic ICL ✗ T Solving Quadratic Equation, Array Sorting, ..., Search Algorithms)
rStar-Math [147] Exploration In-Context Learning 1 ICL & RL ✓ T Python comment
Mulberry [175] Synthetic Data In-Context Learning 1 ICL & SFT ✓ T Reflection
LLaVA-CoT [176] Synthetic Data/Exploration Human-Designed 4 SFT ✗ I T Summary, Caption, Reasoning, Conclusion
LLaMAV-o1 [177] Verification/Exploration Human-Designed 4173 Curriculum Learning ✓ I T Detailed Caption Generation, Logical Reasoning, ... Final Answer Generation
AtomThink [178] Synthetic Data/Exploration In-Context Learning >100 SFT & RL ✓ I T Variable Definition, Calculations, Graphs Analysis , ..., Verification
RedStar [54] Distill Human-Designed 2 SFT ✓ I T Wait, Alternately
Auto-CoT [179] Exploration In-Context Learning 2 ICL ✗ T Question clustering, Demonstration Sampling
PoT [180] Verification In-Context Learning 1 ICL ✗ T Code language conversion
PAL [181] Verification In-Context Learning 1 ICL ✗ T Code language conversion
Decomposed Prompt [182] Exploration Human-Designed 3 ICL ✗ T Peoblem Split, Subproblem Solving, Answer Merge
Least-to-Most [183] Exploration Human-Designed 2 ICL ✗ T Problem Decomposition, Subproblem Solving

inference and test-time scaling. HiICL-MCTS [169] em- for specific domains or complex tasks. Unlike general SFT,
ploys a deliberate search through seed data to gener- RFT focuses on optimizing the model’s reasoning process by
ate action-chain templates consisting of macro actions, using a reward mechanism to guide the model’s evolution,
thereby facilitating an action-chain-guided approach to thereby enhancing its reasoning capabilities and accuracy.
test-time reasoning. ReasonFlux [171] utilizes an iter- The core of RFT lies in improving the model’s performance
ative test-time scaling framework, harnessing external in a specific domain with minimal high-quality training
high-level thought templates to iteratively refine and data [188], an appropriate reward model [189], and a stable
update the current CoT. optimization process [190]. A summary of RFT method is
2) Macro Action-Enhanced Data Synthesis Paradigms: A presented in Table 5.
key application of macro actions in complex reasoning DeepSeek-R1 [31], which employs a verifier reward-
is in the synthesis of reasoning data. In data synthesis based strategy, has shown significant performance improve-
and training frameworks, macro action architectures ments compared to traditional methods like SoS [191]. Key
enhance reasoning diversity and generalization. Recent advantages include:
research has shown that integrating or synthesizing a 1) Simplified Training Pipeline: RL supervision stream-
CoT process with macro actions within the reasoning lines data construction and training processes, eliminat-
sequence can significantly improve the data efficiency ing the need for complex stepwise search mechanisms.
of the reasoning chain. For instance, LLaVA-CoT [176] 2) Enhanced Scalability: Online RL training facilitates
enhances CoT data synthesis by externalizing interme- efficient scaling on large datasets, particularly for com-
diate reasoning steps across multiple modalities. Atom- plex reasoning tasks.
Think [178] generates the AMATH-SFT dataset using 3) Emergent Properties: DeepSeek-R1 [31] demonstrates
a structured g1 prompt [184], achieving superior per- unique emergent capabilities, such as Long-CoT reason-
formance on long-horizon reasoning tasks compared to ing, which are difficult to achieve through SFT alone.
traditional CoT approaches. CoAct [185] introduces a Despite its strengths, RFT faces the following challenges:
dual-agent collaborative reasoning framework, where 1) Unclear Mechanism behind Reasoning: The underly-
a global planning agent executes overarching macro- ing mechanisms driving the reasoning improvements in
actions, while a local execution agent carries out specific DeepSeek-R1 remain poorly understood. For example,
sub-actions within those broader actions. while DeepSeek-R1 exhibits emergent properties (e.g.,
Macro actions also play a crucial role in enhancing Self- “Emergent Length Increasing”, “Aha moments”), stud-
improvement frameworks. rStar-Math [147] utilizes high- ies such as [219] suggest that capabilities like Long-CoT
level deliberate search through Code-augmented CoT, gen- might already exist in the base model, rather than solely
erating diverse and reliable solutions while achieving proac- emerging from RL training. Furthermore, performance
tive search capabilities. Satori [186] integrates CoT with gains observed in smaller models (e.g., Qwen-Math-
RL, incorporating “<reflect>”-style macro actions to diversify 2B/7B [220]) occur without noticeable “Aha moments”,
exploration and alleviate policy saturation in online RL en- complicating causal interpretations.
vironments. Huatuo-o1 [94] combines hierarchical planning 2) Reward Model Saturation: Many existing RL algo-
with domain-specific knowledge bases to improve medical rithms face reward model saturation, typically mani-
reasoning. Additionally, ReasonFlux [171] dynamically re- fested as exploration collapse after around 100 train-
configures reasoning templates (e.g., breaking down calcu- ing steps. Although DeepSeek-R1 alleviates this issue
lus problems into symbolic and numeric phases) to align through specialized reward formatting, methods like
with the problem structure. ReFT [189] and Satori [186] propose alternating sam-
pling and SFT distillation to combat reward hacking
3.2.5 Reinforcement Fine-Tuning and exploration collapse.
Reinforcement Fine-Tuning (RFT) [187] is an innovative 3) Unstable Long-CoT Generation: Long reasoning
technique recently introduced by OpenAI, designed to en- chains generated by RFT are prone to instability, includ-
able developers and engineers to fine-tune existing models ing context overflow, failure to return final answers, and
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 11

TABLE 5
Summary of RFT method.

Model Attribute Incentivize Attibute


Methods Application & Benchmark
Foundational LLMs Modality Reward Type Algorithm Learning Incentivize Sample
Reason RFT Project
DeepSeek-R1-Zero [31] DeepSeek-V3 T Rule-Outome-Reward GPRO RL 800K Multiple Tasks
DeepSeek-R1 [31] DeepSeek-V3 T Rule-Outcome-Reward GPRO RL & SFT 800K Multiple Tasks
Kimi v1.5 [192] – I T Rule-Outcome-Reward PPO∗ RL & SFT – Multiple Tasks
ReFT [189] Galactica, CodeLLama T Rule-Outcome-Reward PPO∗ RL & SFT 3k/7k/8k/15k GSM8k/SVAMP/MathQA
RFTT [193] LLaMA-3-3/8B-Instruct,Qwen-2.5-7B-Instruct T Rule-Outcome-Reward Reinforce++ RL & SFT 1.2K Multiple Math Task
Satori [186] Qwen-2.5-Math-7B T Rule-Outcome-Reward PPO RL & SFT 66K Multiple Math Task
QCLASS [194] Llama-2-7B-Chat T Process-Reward QNet RL & SFT 1.9K/1.5K/3.3K WebShop, ALFWorld, SciWorld
PRIME [195] Qwen2.5-Math-7B T Rule-Process-Outcome-Reward PPO RL & SFT 150K Math, Code Tasks
DeepScaleR [196] DeepSeek-R1-Distill-Qwen-1.5B T Rule-Outcome-Reward Iteratively GPRO RL 40K Multiple Math Task
PURE [197] Qwen2.5-Math-7B T Rule-Process-Outcome-Reward PPO+RLOO RL 8K Multiple Math Task
SimpleRL [103] Qwen2.5-Math-7B T Rule-Outcome-Reward PPO RL 8K Multiple Math Task
Open-R1 [198] Qwen2.5-1.5B-Instruct T Rule-Outcome-Reward GPRO RL & SFT 8K Multiple Math, Code Task
TinyZero [199] Qwen2.5-0.5B/3B T Rule-Outcome-Reward GPRO RL – CountDown Task
Ota-Zero [200] Qwen-2.5-Series, DeepSeek-Series, Rho, Llama-3.x T Rule-Outcome-Reward GRPO RL 0.5K CountDown Task
Ota [201] RHO-1b/Qwen2.5-3B T Rule-Outcome-Reward GPRO/PPO RL 7.5K GSM8K
LIMR [202] Qwen-Math-7B T Rule-Outcome-Reward PPO RL 1.3K Multiple Math Task
Critic-RL [203] Qwen2.5-Coder-32B T Rule-Outcome-Reward GPRO∗ RL & SFT 18.8K Multiple Code Task
Logic-R1 [204] Qwen2.5-7B-Instruct-1M T Rule-Outcome-Reward REINFORCE++∗ RL 5K Multiple Math, Logic Task
Online-DPO-R1 [205] Qwen2.5-MATH-7B T Rule-Outcome-Reward DPO RL& SFT 207.5K Multiple Math Task
OpenReason-Zero [206] Qwen-2.5-7B/32B T Rule-Outcome-Reward PPO RL 57K Multiple Math Task, GPQA, MMLU
RLHF-V [207] OmniLMM-12B I T Process-Reward DDPO RL 1.4K Multiple Tasks
RLAIF [208] PaLM 2 Extra-Small T Rule-Outome-Reward RLAIF RL – Summary and Conversation Generation
MM-RLHF [209] LLaVA-onevision-7B I T V Process-Reward MM-DPO RL 120K MM-RLHF-RewardBench/SafetyBench
Align-DS-V [210] LLaVA-v1.5-7B,Qwen2-VL I T V Process-Reward PPO, DPO RL & SFT 200K Align-Anything, Eval-Anything
R1V [211] Qwen2-VL,Qwen2.5-VL I T Rule-Outome-Reward GRPO RL 70K/70K/8K Multiple Tasks
VLM-R1 [212] Qwen2.5-VL I T Rule-Outome-Reward GRPO RL 120K Multiple Tasks
LMM-R1 [213] Qwen2.5-VL I T Rule-Outome-Reward PPO/RLOO RL 8K Multiple Tasks
Open-R1-Video [214] Qwen2-VL-7B I T V Rule-Outome-Reward GRPO RL 4K Multiple Tasks
Easy-R1 [215] Qwen2.5-VL I T Rule-Outome-Reward GRPO RL 3K Multiple Tasks
Analysis RFT Project
Demystify-LongCoT [216] Llama-3.1-8B, Qwen2.5 -7B-Math T Rule-Outcome-Reward PPO/Reinforce++ RL & SFT 7.5K Multiple Math, MMLU
RLHF-Scale [217] GLM4-9B T Process-Reward PPO RL 11K Multiple Tasks
MD-CoT [218] – – Process-Reward PPO RL – –

sensitivity to reward shaping [102]. For instance, meth- CoT) arise from RL optimization or are latent traits
ods like [216] inadvertently introduce cosine reward of the base model. Systematic studies on reward de-
functions, which degrade performance with increased sign principles (e.g., sparse vs. dense rewards, multi-
iterations. O1-Prune [221] uses post-hoc length pruning objective balancing) should be conducted to avoid un-
techniques [192] (via RL/SFT) to stabilize outputs. intended behaviors such as reward hacking.
Future directions for RFT may include several exciting Summary: RFT presents a promising direction for advanc-
and innovative advancements, such as: ing LLMs reasoning, as evidenced by DeepSeek-R1 [31].
However, challenges such as reward saturation, unstable
1) Efficient and Stable RL Frameworks: There is a need
long reasoning chains, and unclear emergent mechanisms
to develop more robust RL algorithms that prevent re-
require urgent attention. Future efforts should prioritize
ward saturation and exploration collapse. [216] reveals
algorithmic innovation, scalable prompt synthesis, and the-
that REINFORCE++ [222] underperforms when com-
oretical grounding to fully unlock the potential of RL-driven
bined with KL divergence regularization, suggesting
reasoning LLMs.
the need for alternative methods. Future work should
revisit classic RL algorithms in the context of modern
LLMs training to optimize both stability and efficiency. 3.3 Evolutionary of Reasoning LLMs
2) Scaling RFT: Current RL-Supervise models rely on
curated, verifiable prompts selected from large-scale The evolution of reasoning LLMs has progressed by several
datasets. Future research should focus on synthesizing distinct stages, with various strategies developed to over-
high-quality, diverse prompts to improve generaliza- come the limitations of direct autoregressive inference and
tion. [217] shows that merely scaling policy/reward build more advanced slow-thinking reasoning architectures.
models or increasing sample sizes results in diminish- In the early stages, reasoning LLMs primarily focused on
ing returns, while expanding the scope of PRM and enhancing pre-trained LLMs with external reasoning algo-
R1 training data holds greater promise. Hybrid ap- rithms, without altering the underlying model parameters.
proaches, such as combining RL with SFT or curriculum Approaches such as Tree of Thoughts [223] and Reasoning
learning, should be explored to enhance scalability. via Planning [14] utilized LLMs-driven Breadth-First Search,
3) Controlling Long-CoT Stability: Adaptive reward Depth-First Search, and MCTS [79], [105], [108], [224] to
shaping mechanisms are needed to balance reasoning simulate human-like reasoning processes. These methods
length, coherence, and answer correctness. Techniques represented reasoning as tree or graph traversals, where
such as O1-Prune [221] demonstrate the value of post- intermediate reasoning states were depicted as nodes, and
hoc length regularization, but dynamic in-training con- various reasoning strategies produced distinct reasoning
trols are necessary. Hierarchical RL frameworks should paths. The final decision was made through additional vot-
be investigated to decompose long reasoning chains ing mechanisms [3] or Monte Carlo-based value estimation
into manageable sub-tasks, reducing instability. to identify the optimal path.
4) Theoretical and Empirical Analysis: It is essential to However, these externalized slow-reasoning approaches
clarify the relationship between RL training and the introduced several challenges:
capabilities of the base model. For instance, it should 1) Limited Exploration Space: The search-based methods
be determined whether emergent properties (e.g., Long- required predefined constraints on the breadth, depth,
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 12

TABLE 6
Statistics of benchmarks for reasoning LLMs.

Domain Benchmark Venue Language Size Level


AIME 2024 [226] - English 30 Competition
MATH-500 [37] ICLR 2024 English 500 Competition
Math
AMC 2023 [227] - English 30 Competition
Olympiad Bench [228] ACL 2024 English/Chinese 8,476 Competition
Codeforces - English - Expert
Code SWE-bench [229] ICLR 2024 English 2,294 Expert
LiveCodeBench [230] ArXiv 2024 English - Expert
GPQA Diamond [231] COLM 2024 English 448 University
Science
MMLU-Pro [232] NeurIPS 2024 English 12,032 Hybrid
WebShop [233] NeurIPS 2022 English 1,600 Hybrid
WebArena [234] ICLR 2024 English 812 Hybrid
Agent
SciWorld [235] EMNLP 2022 English 7,200 Hybrid
TextCraft [236] NAACL 2024 English 200 Hybrid
JAMA Clinical Challenge [237] NAACL 2025 English 1,524 Expert
Medicine Medbullets [237] NAACL 2025 English 308 Expert
MedQA [238] ArXiv 2020 English/Chinese 61,097 Expert
MMMU [239] CVPR 2024 English 11,500 Hybrid
MathVista [240] ICLR 2024 English 6,141 Middle School
Multimodality MathVision [241] NeurIPS 2024 English 3,040 Middle/High School
CMMaTH [242] COLING 2025 English/Chinese 23,856 Middle/High School
PGPS9K [243] IJCAI 2023 English 9,023 Middle School

and granularity of the search space, which often re- Kimi-k1.5 [192], have demonstrated the potential of RL to
stricted the LLM’s exploration to a narrow reasoning enhance models like DeepSeek-V3 [17], resulting in the
space. Furthermore, the reasoning strategies across dif- emergence of complex behaviors such as long reasoning
ferent child nodes of the same parent node frequently chains, reflective reasoning, and advanced planning ca-
lacked sufficient diversity, further limiting exploration. pabilities. Remarkably, these sophisticated behaviors were
2) Limited Experience Sharing: Exploration experiences achieved through simple RL scaling. SimpleRL [103] sought
and reasoning information across different paths to replicate these capabilities using a streamlined pipeline
could only be assessed based on reward models and minimal codebase, while R1V [211] explored the devel-
or self-consistency among outcomes. Additionally, opment of multimodal reasoning models based on multi-
search-based methods significantly increased compu- modal foundation architectures.
tational overhead, relying on reward models such as Summary: The evolution of reasoning LLMs has shifted
PRM/ORM for tree pruning or speculative decoding from externally augmented reasoning to internally embed-
techniques to accelerate inference. ded reasoning. Recent developments emphasize the poten-
To overcome these limitations, subsequent models such as tial of RL-based scaling to unlock advanced capabilities.
rSTaR [172], LLaMAV-o1 [177], HiICL-MCTS [169], Mul-
berry [175], g1 [184], and Thinking-Claude [225] introduced
4 B ENCHMARKING R EASONING LLM S
richer action spaces. These enhanced action spaces offered
high-level planning cues, broadening the model’s explo- The development of a robust benchmark is crucial for doc-
ration scope and enabling more comprehensive structured umenting the advancements in reasoning LLMs capabilities
search processes. However, this approach necessitated care- and for identifying promising research directions for future
ful design of the action spaces to ensure their effectiveness. progress. Here, we review the benchmarks from three key
With the introduction of models like o1 [29] and QwQ aspects: categories, evaluation metrics, and performance
[98], external reasoning paradigms were internalized within comparisons, while offering our reflections and insights.
the LLM’s context. These models initially performed ex-
ploratory macro-planning to generate an initial reasoning 4.1 Benchmark Categories
path, followed by contextual exploration of alternative
paths. Through mechanisms like “Rethink” and “Verifica- We categorize reasoning benchmarks by task type, which
tion”, these models produced extended reasoning chains. can be broadly divided into math, code, scientific, agent,
To replicate this internalized capability, STILL-1 [224] lin- medical, and multimodal reasoning. The detailed statistics
earized tree search outputs into long reasoning chains with for these benchmarks are presented in Table 6.
attributes such as “Rethink”, “Wait”, and “Explore New
Path”. Similarly, STILL-2 [53] and sky-T1 [99] synthesized 4.1.1 Benchmark Introduction
long reasoning chains using distillation techniques. How- 1) Math Problems: We document the current popular
ever, the linearized reasoning chains derived from search- competition-level mathematical benchmarks to show-
based methods struggled to match the quality of those case the capabilities of reasoning LLMs, including
produced by distillation approaches. AIME 2024 [226], MATH-500 [37], AMC 2023 [227], and
Recent advancements, including DeepSeek-R1 [31] and Olympiad Bench [228].
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 13

Task Types Technical Proposals Reasoning Paradigms

ORM, PRM Solution 1

Code RM@k, Best-of-N, …… Outcome


Math Solution 2 Efficiency
• Elo
• Percentile • Pass@k Self-Consistency
• …… • Cons@k
Greedy Decoding, Beam Search, Solution 3
• …… Process
Major@k, …….
Science Efficiency
Solution k
• Exact Match RL
• Accuracy Cumulative Reward, ……
• …… Sample Efficiency, ……
Conclusion

Fig. 6. Various evaluation metrics of reasoning LLMs divided by task types, technical proposals, and reasoning paradigms.

2) Code Problems: Code problems requires solid founda- capturing the complexities and subtleties of human-like
tion and high logical thinking to evaluate the reasoning reasoning. Furthermore, it is essential to address the issue
ability of reasoning LLMs such as Codeforces, SWE- of data leakage in evaluation processes [247]. Ensuring the
bench [229], and LiveCodeBench [230]. confidentiality and neutrality of evaluation data is critical to
3) Scientific Problems: Scientific benchmarks, i.e., GPQA preserving the integrity and reliability of benchmark results.
Diamond [231] and MMLU-Pro [232], involve multi-
domains reasoning about chemistry, biology, and
4.2 Evaluation Metrics
physics, which requires extensive knowledge accumu-
lation and integrated reasoning. Depending on task types, technical proposals, and rea-
4) Agent Reasoning: Realistic tasks often involve complex soning paradigms, various evaluation metrics have been
planning and tool usage, leading to the creation of agent introduced for reasoning LLMs as shown in Figure 6. These
reasoning benchmarks [244]. For example, WebShop metrics are designed to more accurately assess the model’s
[233] and WebArena [234] focus on web operations, performance in handling complex reasoning tasks, ensuring
while SciWorld [235] and TextCraft [236] are centered that both the quality and coherence of the generated solu-
around scientific research. tions are effectively measured.
5) Medical Reasoning: Medicine fundamentally involves
complex reasoning, spanning tasks from diagnostic de- 4.2.1 Task Types
cision making to treatment planning. Benchmarks of In terms of benchmark categories, mathematical reasoning
JAMA Clinical Challenge [237], Medbullets [237], and typically uses two main metrics: Pass@k and Cons@k. The
MedQA [238] offer model measurements that mimic the Pass@k metric evaluates the model’s ability to generate a
doctor’s disease diagnosis. correct solution within k attempts, measuring the likelihood
6) Multimodal Reasoning: Multimodal reasoning, such of success within a limited number of tries. On the other
as benchmarks of MMMU [239] and MathVista [240], hand, Cons@k assesses whether the model consistently pro-
requires cross-modal thinking in combination with duces correct or logically coherent solutions, highlighting
text and images. Especially for those visual-centered the stability and reliability of its reasoning capabilities. For
problems, in benchmarks MathVision [241], MathVerse code tasks, the key metrics are Elo and Percentile, both
[245], CMMaTH [242], and PGPS9K [243], put forward of which measure the relative skill in generating correct
higher requirements for reasoning LLMs. code compared to other models or human programmers.
In scientific tasks, evaluation generally employs Exact Match
4.1.2 Summary (EM) and Accuracy for fill-in-the-blank and multiple-choice
The field of LLMs has advanced rapidly in recent years, questions, respectively. The EM metric judges whether the
with benchmark performance consistently improving. Sim- model’s output exactly matches the expected solution, while
ple reasoning benchmarks, such as GSM8K [32], MATH-500 Accuracy measures the proportion of correct answers out of
[37], and ScienceQA [246], have approached performance the total number of questions.
saturation. Recent studies on reasoning LLMs [54], [147]
show that models designed for long reasoning chains do not 4.2.2 Technical Proposals
significantly outperform those designed for shorter chains Based on technical routes, the schemes with ORM or PRM
on these benchmarks. This highlights the urgent need to often leverage RM@k and Best-of-N two evaluation indica-
establish new benchmarks that more effectively assess the tors. RM@k measures whether the reward model can rank
reasoning capabilities of reasoning LLMs. Moreover, current the good answer higher in the top k candidates according
benchmarks are limited, focusing mainly on solid reasoning to reward score, and Best-of-N chooses the solution with
tasks. Soft reasoning benchmarks, lacking explicitly defined highest score from N generated reasoning trajectories. Meth-
correct answers, offer a more nuanced evaluation, better ods for self-consistency are evaluated using Greedy Decoding,
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 14

TABLE 7
Performance of Different Models, including Basic LLMs and Reasoning LLMs, on Plain Text Benchmarks. The red denotes the highest result, and
the blue denotes the second highest result.

Math Code General


Model
AIME 2024 MATH-500 LiveCodeBench Codeforces SWE Verified MMLU GPQA-Diamond
(Pass@1) (Pass@1) (Pass@1-CoT) (Percentile) (Resolved) (Pass@1) (Pass@1)
Basic LLMs

GPT-4o [16] 9.3 74.6 34.2 23.6 38.8 87.2 49.9


Claude-3.5-Sonnet [248] 16.0 78.3 33.8 20.3 50.8 88.3 65.0
Gemini-2.0-Pro [249] - 91.8 36.0 - - 86.5 64.7
Deepseek-V3 [17] 39.2 90.2 36.2 58.7 42.0 88.5 59.1
Eurus-2-7B-PRIME [195] 26.7 79.2 - - - - -
InternLM3-8B-Instruct [250] 20.0 83.0 - - - 76.6 37.4
rStar-Math-7B [147] 46.7 81.6 - - - 82.7 54.9
STILL-2-32B [53] 46.7 90.2 - - - - -
Reasoning LLMs

Redstar-code-math [54] 53.3 91.2 - - - - -


Search-o1 [97] 56.7 86.4 33.0 - - - 63.6
QwQ [98] 50.0 90.6 41.9 62.0 - - 54.5
s1-32B [251] 56.7 93.0 - - - - 59.6
OpenAI o1-mini [252] 63.6 90.0 53.8 93.4 41.6 85.2 60.0
LIMO-32B [253] 57.1 94.8 - - - - 66.7
Kimi k1.5 long-CoT [192] 77.5 96.2 62.5 94.0 - - -
DeepSeek-R1 [31] 79.8 97.3 65.9 96.3 49.2 90.8 71.5
OpenAI-o1 [29] 79.2 96.4 63.4 96.6 48.9 91.8 75.7
OpenAI o3-mini [30] 87.3 97.9 84.6 - 49.3 86.9 79.7

TABLE 8 prehensive assessment framework that considers various


Performance of Models, including Basic LLMs and Reasoning LLMs, on aspects of the reasoning process in view of the large in-
Multimodal Benchmarks. The red denotes the highest result, and the
blue denotes the second highest result. ference computation consumption. Current popular eval-
uation frameworks, such as LMMs-Eval [255], OpenCom-
Model MMMU Mathvista Mathvision Olympiadbench
pass [256], and PRMBench [257], lack efficiency and their
GPT-4o [16] 69.1 63.8 30.4 25.9 metrics do not adequately account for the computational
Reasoning LLMs Basic LLMs

Claude-3.5-Sonnet [248] 70.4 65.3 35.6 - and temporal efficiency of the reasoning process. To address
Gemini 2.0 Pro [249] 72.7 - - -
LLaVA-CoT [176] - 54.8 - - these shortcomings, we highly recommend exploring more
QvQ-72B-preview [254] 70.3 71.4 35.9 20.4
Kimi k1.5 long-CoT [192] 70.0 74.9 - - efficient proxy tasks as potential solutions. By identifying
OpenAI-o1 [29] 77.3 71.0 - -
and utilizing tasks that better capture the nuances of long
reasoning chains, we can develop more robust and effective
Beam Search, and Major@k. Greedy Decoding and Beam Search evaluation metrics to enhance the overall assessment frame-
control the randomness of the inference process by limiting work, ensuring that it not only measures the accuracy of the
the sampling range. Major@k selects the solution with the final output but also evaluates the efficiency and coherence
most consistent results from k candidate solutions. In RL, of the reasoning process throughout.
metrics reflect both performance in achieving desired out-
comes and the efficiency of the learning process. For exam- 4.3 Performance Comparison
ple, Cumulative Reward measures the total reward received In this section, we compare the performance of different rea-
by the agent over time, while Sample Efficiency assesses the soning LLMs and their corresponding foundational LLMs
efficiency of the agent’s sample usage during learning. on plain text benchmarks, such as math and code problems,
as well as on multimodal benchmarks. The comprehensive
4.2.3 Reasoning Paradigms real-time leaderboard is available on this website.
For reasoning paradigm of the multi-turn solution gener-
ation in reasoning LLMs, Outcome Efficiency and Process 4.3.1 Performance on Plain Text Benchmarks
Efficiency [102] are proposed recently to evaluate the effi- As shown in Table 7, reasoning LLMs, such as DeepSeek-R1
ciency of long thinking specifically. Outcome Efficiency metric [31] and OpenAI-o1/o3 [29], [30], demonstrate exceptional
empirically evaluates how effectively later solutions con- performance across a wide range of tasks, including math,
tribute to accuracy improvements, formulating as the ratio coding, and other general tasks. These models achieve high
of efficient tokens that contribute to reaching the correct scores on multiple plain-text benchmarks, such as AIME
answer, to all output tokens. Process Efficiency metric eval- 2024, MATH-500, and LiveCodeBench, showcasing their
uates the contribution of later solutions to solution diversity robust text-based reasoning abilities. In contrast, founda-
empirically, concretely representing as the ratio of tokens of tional LLMs, like GPT-4o [62], Claude-3.5-Sonnet [248], and
distinct solutions to all solution tokens. These two indicators DeepSeek-V3 [17], generally perform less effectively than
reveal to the overthinking issue of existing reasoning LLMs reasoning LLMs, particularly in math and coding tasks (e.g.,
to simple problems certainly. AIME 2024 and Codeforces). For example, OpenAI-o1 out-
performs GPT-4o by 69.9% and 73% on these tasks, respec-
4.2.4 Summary tively. Moreover, DeepSeek-R1, based on the DeepSeek-V3
Most of the existing evaluation metrics are judged according architecture, surpasses its predecessor on all benchmarks,
to the final answer. It is imperative to develop a com- further highlighting the advantages of the reasoning LLMs.
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 15

4.3.2 Performance on Multimodal Benchmarks deliberate and thorough reasoning, fast-thinking systems
As shown in Table 8, reasoning LLMs continue to excel rely on prior knowledge for quick responses. Despite efforts
in multimodal tasks. OpenAI-o1 [29] performs strongly in such as the System 1-2 switcher [95], speculative decoding
vision tasks, achieving the highest score of 77.3% on MMMU [258]–[260], and interactive continual learning [261], inte-
and outperforming its corresponding foundational LLM, grating both modes of thinking remains challenging. This
GPT-4o [62], by 7.2% on MathVista. However, the perfor- often leads to inefficiencies in domain-specific tasks and
mance improvement in multimodal tasks is less pronounced underutilized strengths in more complex scenarios.
compared to text-only tasks. This can be attributed in part Future research should focus on developing adaptive
to the limitations of current multimodal reasoning LLM switching mechanisms, joint training frameworks, and co-
techniques, as well as the lack of sufficient datasets to fully evolution strategies to harmonize the efficiency of fast-
assess the multimodal capabilities of reasoning LLMs. thinking systems with the precision of reasoning LLMs.
Achieving this balance is crucial for advancing the field and
4.3.3 Summary creating more versatile AI systems.
In summary, reasoning LLMs show strong performance
across both plain text and multimodal benchmarks, partic- 5.3 Reasoning LLMs For Science
ularly excelling in math and coding tasks, where they out- Reasoning LLMs play a crucial role in scientific research
perform foundational LLMs by a large margin. Although the [262], enabling deep, structured analysis that goes beyond
improvement in multimodal tasks is not as pronounced as in the heuristic-based fast-thinking models. Their value be-
text-only tasks, reasoning LLMs still surpass their counter- comes especially clear in fields that demand complex rea-
parts, highlighting their potential for processing both image soning, such as medicine and mathematics. In medicine,
and text data. These results emphasize the versatility and particularly in differential diagnosis and treatment plan-
effectiveness of reasoning LLMs across a broad spectrum of ning, reasoning LLMs (e.g., inference-time scaling) enhance
reasoning tasks, with potential for further advancements in AI’s step-by-step reasoning, improving diagnostic accuracy
multimodal reasoning techniques. where traditional scaling methods fall short [52]. In mathe-
matics, approaches like FunSearch [263] incorporate slow-
5 C HALLENGES & F UTURE D IRECTIONS thinking principles to push beyond previous discoveries,
showcasing the potential of AI-human collaboration.
Despite the rapid advancements in reasoning LLMs, several
Beyond these fields, reasoning LLMs can foster advance-
challenges persist, limiting their generalizability and practi-
ments in physics, engineering, and computational biology
cal applicability. This section outlines these challenges and
by refining model formulation and hypothesis testing. In-
highlights potential research directions to address them.
vesting in reasoning LLMs research not only bridges the
gap between AI’s computational power and human-like
5.1 Efficient Reasoning LLMs analytical depth but also paves the way for more reliable,
While reasoning LLMs excel at solving complex problems interpretable, and groundbreaking scientific discoveries.
via extended inference, their reliance on long autoregressive
reasoning within large-scale architectures presents signif- 5.4 Deep Integration of Neural and Symbolic Systems
icant efficiency challenges. For example, many problems
Despite significant advancements in reasoning LLMs, their
on platforms like Codeforces require over 10,000 tokens of
limited transparency and interpretability restrict their per-
reasoning, resulting in high latency. As noted in [102], even
formance in more complex real-world reasoning tasks. The
when a reasoning LLM identifies the correct solution early,
reliance on large-scale data patterns and lack of clear rea-
it often spends considerable time verifying its reasoning.
soning pathways makes it challenging to handle intricate
Recent reports, such as Deepseek-R1 [31], suggest that self-
or ambiguous problems effectively. Early symbolic logic
improvement via RL is more effective in larger models,
systems, while less adaptable, offered better explainability
while smaller-scale large language models (SLMs) (e.g., 3B
and clearer reasoning steps, leading to more reliable perfor-
and 7B models as explored by [103] and [199], [216]) struggle
mance in such cases.
to match performance in slow-thinking reasoning tasks.
A promising future direction is the deep integration
Future research should focus on two key areas: (1) in-
of neural and symbolic systems. Google’s AlphaGeome-
tegrating external reasoning tools to enable early stopping
try [264] and AlphaGeometry2 [265] combine reasoning
and verification mechanisms, thus improving the efficiency
LLMs with symbolic engines, achieving breakthroughs in
of long inference chains, and (2) exploring strategies to
the International Olympiad in Mathematics (IMO). In par-
implement slow-thinking reasoning capabilities in SLMs
ticular, AlphaGeometry2 utilizes the Gemini-based model
without sacrificing performance.
[249], [266], [267] and a more efficient symbolic engine, im-
proving performance by reducing rule sets and enhancing
5.2 Collaborative Slow & Fast-thinking Systems key concept handling. The system now covers a broader
A key challenge in reasoning LLMs is the loss of fast- range of geometric concepts, including locus theorems and
thinking capabilities, which results in inefficiencies when linear equations. A new search algorithm and knowledge-
simple tasks require unnecessary deep reasoning. Unlike sharing mechanism accelerate the process. This system
humans, who fluidly switch between fast (System 1) and solved 84% of IMO geometry problems (2000-2024), sur-
slow (System 2) thinking, current reasoning LLMs struggle passing gold medalists’ averages. In contrast, reasoning
to maintain this balance. While reasoning LLMs ensure LLMs like OpenAI-o1 [29] failed to solve any problems. The
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 16

integration of neural and symbolic systems offers a balanced 6 C ONCLUSION


approach, improving both adaptability and interpretability, This paper presents a comprehensive survey that advances
with vast potential for complex real-world reasoning tasks research on reasoning LLMs. We begin with an overview of
beyond mathematical geometry problems. the progress in foundational LLMs and key early System 2
technologies, including symbolic logic, MCTS, and RL, ex-
5.5 Multilingual Reasoning LLMs ploring how each, when combined with foundational LLMs,
Current reasoning LLMs perform well in high-resource lan- has paved the way for reasoning LLMs. We then provide
guages like English and Chinese, demonstrating strong ca- a detailed feature analysis of the latest reasoning LLMs,
pabilities in tasks such as translation and various reasoning examining the core methods that enable their advanced rea-
tasks [93], [101]. These models excel in environments where soning capabilities and highlighting representative models.
large-scale data and diverse linguistic resources are avail- Through a review of mainstream reasoning benchmarks and
able. However, their performance in low-resource languages performance comparisons, we offer valuable insights into
remains limited [268], facing challenges related to data spar- the current state of the field. Looking ahead, we identify
sity, stability, safety, and overall performance. These issues promising research directions and continue to track devel-
hinder the effectiveness of reasoning LLMs in languages opments via our real-time GitHub Repository. This survey
that lack substantial linguistic datasets and resources. aims to inspire innovation and foster progress in the rapidly
Future research should prioritize overcoming the chal- evolving field of reasoning LLMs.
lenges posed by data scarcity and cultural biases in low-
resource languages. Innovations such as parameter shar-
ing across reasoning LLMs and the incremental injection
R EFERENCES
of domain-specific knowledge could help mitigate these [1] W. Hua and Y. Zhang, “System 1+ system 2= better world:
Neural-symbolic chain of logic reasoning,” in Findings of the
challenges, enabling faster adaptation of slow-thinking ca-
Association for Computational Linguistics: EMNLP 2022, 2022, pp.
pabilities to a broader range of languages. This would 601–612.
not only enhance the effectiveness of reasoning LLMs in [2] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le,
these languages but also ensure more equitable access to D. Zhou et al., “Chain-of-thought prompting elicits reasoning in
large language models,” Advances in neural information processing
advanced AI technologies. systems, vol. 35, pp. 24 824–24 837, 2022.
[3] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang,
5.6 Multimodal Reasoning LLMs A. Chowdhery, and D. Zhou, “Self-Consistency Improves Chain
of Thought Reasoning in Language Models,” in The Eleventh
Extending slow-thinking reasoning capabilities from text- International Conference on Learning Representations, 2023.
based domains to multimodal contexts remains a significant [4] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuur-
challenge, especially in tasks requiring fine-grained percep- mans, C. Cui, O. Bousquet, Q. V. Le et al., “Least-to-Most Prompt-
ing Enables Complex Reasoning in Large Language Models,” in
tion [96]. While approaches like Virgo [269] have attempted The Eleventh International Conference on Learning Representations,
to distill text-based slow-thinking reasoning into multi- 2023.
modal LLMs, their performance improvements in tasks [5] E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman, “STaR: Self-
taught reasoner bootstrapping reasoning with reasoning,” in
such as MathVision [241], which demand detailed visual Proc. the 36th International Conference on Neural Information Pro-
understanding, have been marginal. cessing Systems, vol. 1126, 2024.
Key research directions include developing hierarchical [6] J. S. B. Evans, “Heuristic and analytic processes in reasoning,”
reasoning LLMs that enable fine-grained cross-modal un- British Journal of Psychology, vol. 75, no. 4, pp. 451–468, 1984.
[7] D. Kahneman, “Maps of bounded rationality: Psychology for
derstanding and generation, tailored to the unique charac- behavioral economics,” American economic review, vol. 93, no. 5,
teristics of modalities such as audio, video, and 3D data. pp. 1449–1475, 2003.
[8] J. Huang and K. C.-C. Chang, “Towards Reasoning in Large
Language Models: A Survey,” in Findings of the Association for
5.7 Safe Reasoning LLMs Computational Linguistics: ACL 2023, 2023, pp. 1049–1065.
The rapid development of reasoning LLMs like OpenAI-o1 [9] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan,
F. Huang, and H. Chen, “Reasoning with Language Model
[29] and DeepSeek-R1 [31] has led to the rise of superintelli- Prompting: A Survey,” in Proceedings of the 61st Annual Meeting
gent models capable of continuous self-evolution. However, of the Association for Computational Linguistics (Volume 1: Long
this progress brings challenges in safety and control. RL, a Papers), 2023, pp. 5368–5393.
key training method, introduces risks such as reward hack- [10] B. Wang, S. Min, X. Deng, J. Shen, Y. Wu, L. Zettlemoyer, and
H. Sun, “Towards Understanding Chain-of-Thought Prompting:
ing, generalization failures, and language mixing, which An Empirical Study of What Matters,” in Proceedings of the 61st
can lead to harmful outcomes. Ensuring the safety of such Annual Meeting of the Association for Computational Linguistics
systems like DeepSeek-R1 is urgent. While RL enhances (Volume 1: Long Papers), 2023, pp. 2717–2739.
[11] O. Shaikh, H. Zhang, W. Held, M. Bernstein, and D. Yang, “On
reasoning, its uncontrollable nature raises concerns about Second Thought, Let’s Not Think Step by Step! Bias and Toxicity
safely guiding these models. SFT addresses some issues but in Zero-Shot Reasoning,” in Proceedings of the 61st Annual Meeting
is not a complete solution. A hybrid approach combining of the Association for Computational Linguistics (Volume 1: Long
RL and SFT is needed to reduce harmful outputs while Papers), 2023, pp. 4454–4470.
[12] H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y. Liu, and
maintaining model effectiveness [270]. H. Li, “Visual cot: Advancing multi-modal language models with
As these models surpass human cognitive capabilities, a comprehensive dataset and benchmark for chain-of-thought
ensuring their safe, responsible, and transparent use is cru- reasoning,” in The Thirty-eight Conference on Neural Information
cial. This requires ongoing research to develop methods for Processing Systems Datasets and Benchmarks Track, 2024.
[13] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic Chain of
controlling and guiding their actions, thereby balancing AI Thought Prompting in Large Language Models,” in The Eleventh
power with ethical decision-making. International Conference on Learning Representations, 2023.
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 17

[14] S. Hao, Y. Gu, H. Ma, J. Hong, Z. Wang, D. Wang, and Annual Meeting of the Association for Computational Linguistics
Z. Hu, “Reasoning with Language Model is Planning with World (Volume 1: Long Papers), 2023, pp. 4471–4485.
Model,” in Proceedings of the 2023 Conference on Empirical Methods [36] P. Lu, L. Qiu, K.-W. Chang, Y. N. Wu, S.-C. Zhu, T. Rajpurohit,
in Natural Language Processing, 2023, pp. 8154–8173. P. Clark, and A. Kalyan, “Dynamic Prompt Learning via Policy
[15] Y. Zhang, “Meta prompting for agi systems,” arXiv preprint Gradient for Semi-structured Mathematical Reasoning,” in The
arXiv:2311.11482, 2023. Eleventh International Conference on Learning Representations, 2023.
[16] OpenAI, “Hello GPT-4o,” May 2024. [Online]. Available: [37] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee,
https://openai.com/index/hello-gpt-4o/ J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s Verify
[17] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, Step by Step,” in The Twelfth International Conference on Learning
C. Zhang, C. Ruan et al., “Deepseek-v3 technical report,” arXiv Representations, 2024.
preprint arXiv:2412.19437, 2024. [38] F. Yao, C. Tian, J. Liu, Z. Zhang, Q. Liu, L. Jin, S. Li, X. Li,
[18] A. Vaswani, “Attention is all you need,” Advances in Neural and X. Sun, “Thinking like an expert: Multimodal hypergraph-
Information Processing Systems, 2017. of-thought (hot) reasoning to boost foundation modals,” arXiv
[19] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre- preprint arXiv:2308.06207, 2023.
training of Deep Bidirectional Transformers for Language Under- [39] Y. Yao, Z. Li, and H. Zhao, “Beyond Chain-of-Thought, Effec-
standing,” in Proceedings of the 2019 Conference of the North Ameri- tive Graph-of-Thought Reasoning in Language Models,” arXiv
can Chapter of the Association for Computational Linguistics: Human preprint arXiv:2305.16582, 2023.
Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, [40] Y. Wen, Z. Wang, and J. Sun, “Mindmap: Knowledge graph
June 2-7, 2019, Volume 1 (Long and Short Papers), 2019, pp. 4171– prompting sparks graph of thoughts in large language models,”
4186. arXiv preprint arXiv:2308.09729, 2023.
[20] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, [41] B. Lei, C. Liao, C. Ding et al., “Boosting logical reasoning in
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A large language models through a new framework: The graph of
Robustly Optimized BERT Pretraining Approach,” CoRR, vol. thought,” arXiv preprint arXiv:2308.08614, 2023.
abs/1907.11692, 2019. [42] M. Jin, Q. Yu, D. Shu, H. Zhao, W. Hua, Y. Meng, Y. Zhang, and
[21] A. Radford, “Improving language understanding by generative M. Du, “The impact of reasoning step length on large language
pre-training,” 2018. models,” arXiv preprint arXiv:2401.04925, 2024.
[22] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever [43] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski,
et al., “Language models are unsupervised multitask learners,” L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk
OpenAI blog, vol. 1, no. 8, p. 9, 2019. et al., “Graph of thoughts: Solving elaborate problems with
[23] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- large language models,” in Proceedings of the AAAI Conference on
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690.
“Language models are few-shot learners,” Advances in neural [44] P. Cheng, T. Hu, H. Xu, Z. Zhang, Y. Dai, L. Han, and N. Du, “Self-
information processing systems, vol. 33, pp. 1877–1901, 2020. playing Adversarial Language Game Enhances LLM Reasoning,”
[24] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, arXiv preprint arXiv:2404.10642, 2024.
P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Train- [45] H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. Ayyubi, K.-
ing language models to follow instructions with human feed- W. Chang, and S.-F. Chang, “IdealGPT: Iteratively Decomposing
back,” Advances in neural information processing systems, vol. 35, Vision and Language Reasoning via Large Language Models,”
pp. 27 730–27 744, 2022. in Findings of the Association for Computational Linguistics: EMNLP
[25] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, 2023, 2023, pp. 11 289–11 303.
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., [46] P. Wu and S. Xie, “V?: Guided Visual Search as a Core Mechanism
“Llama: Open and efficient foundation language models,” arXiv in Multimodal LLMs,” in Proceedings of the IEEE/CVF Conference
preprint arXiv:2302.13971, 2023. on Computer Vision and Pattern Recognition, 2024, pp. 13 084–
[26] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, 13 094.
B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language [47] Z. Chen, R. Sun, W. Liu, Y. Hong, and C. Gan, “GENOME: Gener-
models,” arXiv preprint arXiv:2303.18223, 2023. ative Neuro-Symbolic Visual Reasoning by Growing and Reusing
[27] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual Instruction Tuning,” in Modules,” in International Conference on Learning Representations,
Thirty-seventh Conference on Neural Information Processing Systems, 2024.
2023. [48] S. Wu, Z. Peng, X. Du, T. Zheng, M. Liu, J. Wu, J. Ma, Y. Li, J. Yang,
[28] D. Zhang, Y. Yu, J. Dong, C. Li, D. Su, C. Chu, and D. Yu, W. Zhou et al., “A Comparative Study on Reasoning Patterns of
“MM-LLMs: Recent Advances in MultiModal Large Language OpenAI’s o1 Model,” arXiv preprint arXiv:2410.13639, 2024.
Models,” in Findings of the Association for Computational Linguis- [49] V. Xiang, C. Snell, K. Gandhi, A. Albalak, A. Singh, C. Blagden,
tics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11- D. Phung, R. Rafailov, N. Lile, D. Mahan et al., “Towards System
16, 2024. Association for Computational Linguistics, 2024, pp. 2 Reasoning in LLMs: Learning How to Think With Meta Chain-
12 401–12 430. of-Though,” arXiv preprint arXiv:2501.04682, 2025.
[29] OpenAI, “Learning to reason with LLMs,” Septem- [50] Y. Qin, X. Li, H. Zou, Y. Liu, S. Xia, Z. Huang, Y. Ye, W. Yuan,
ber 2024. [Online]. Available: https://openai.com/index/ H. Liu, Y. Li et al., “O1 Replication Journey: A Strategic Progress
learning-to-reason-with-llms/ Report–Part 1,” arXiv preprint arXiv:2410.18982, 2024.
[30] ——, “OpenAI o3-mini,” January 2025. [Online]. Available: [51] Z. Huang, H. Zou, X. Li, Y. Liu, Y. Zheng, E. Chern, S. Xia, Y. Qin,
https://openai.com/index/openai-o3-mini/ W. Yuan, and P. Liu, “O1 Replication Journey–Part 2: Surpassing
[31] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, O1-preview through Simple Distillation, Big Progress or Bitter
S. Ma, P. Wang, X. Bi et al., “DeepSeek-R1: Incentivizing Rea- Lesson?” arXiv preprint arXiv:2411.16489, 2024.
soning Capability in LLMs via Reinforcement Learning,” arXiv [52] Z. Huang, G. Geng, S. Hua, Z. Huang, H. Zou, S. Zhang, P. Liu,
preprint arXiv:2501.12948, 2025. and X. Zhang, “O1 Replication Journey–Part 3: Inference-time
[32] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, Scaling for Medical Reasoning,” arXiv preprint arXiv:2501.06458,
M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Train- 2025.
ing verifiers to solve math word problems,” arXiv preprint [53] Y. Min, Z. Chen, J. Jiang, J. Chen, J. Deng, Y. Hu, Y. Tang,
arXiv:2110.14168, 2021. J. Wang, X. Cheng, H. Song, W. X. Zhao, Z. Liu, Z. Wang, and
[33] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large J.-R. Wen, “Imitate, Explore, and Self-Improve: A Reproduction
language models are zero-shot reasoners,” Advances in neural Report on Slow-thinking Reasoning Systems,” arXiv preprint
information processing systems, vol. 35, pp. 22 199–22 213, 2022. arXiv:2412.09413, 2024.
[34] Y. Liu, A. Singh, C. D. Freeman, J. D. Co-Reyes, and P. J. Liu, [54] H. Xu, X. Wu, W. Wang, Z. Li, D. Zheng, B. Chen, Y. Hu,
“Improving large language model fine-tuning for solving math S. Kang, J. Ji, Y. Zhang et al., “RedStar: Does Scaling Long-
problems,” arXiv preprint arXiv:2310.10047, 2023. CoT Data Unlock Better Slow-Reasoning Systems?” arXiv preprint
[35] X. Zhu, J. Wang, L. Zhang, Y. Zhang, Y. Huang, R. Gan, J. Zhang, arXiv:2501.11284, 2025.
and Y. Yang, “Solving Math Word Problems via Cooperative [55] Z. Zeng, Q. Cheng, Z. Yin, B. Wang, S. Li, Y. Zhou, Q. Guo,
Reasoning induced Language Models,” in Proceedings of the 61st X. Huang, and X. Qiu, “Scaling of Search and Learning: A
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 18

Roadmap to Reproduce o1 from Reinforcement Learning Per- [80] S. Gelly and D. Silver, “Monte-Carlo tree search and rapid action
spective,” arXiv preprint arXiv:2412.14135, 2024. value estimation in computer Go,” Artificial Intelligence, vol. 175,
[56] Y. Ji, J. Li, H. Ye, K. Wu, J. Xu, L. Mo, and M. Zhang, “Test- no. 11, pp. 1856–1875, 2011.
time Computing: from System-1 Thinking to System-2 Thinking,” [81] M. Świechowski, K. Godlewski, B. Sawicki, and J. Mańdziuk,
arXiv preprint arXiv:2501.02497, 2025. “Monte Carlo tree search: A review of recent modifications and
[57] M. Besta, J. Barth, E. Schreiber, A. Kubicek, A. Catarino, R. Ger- applications,” Artificial Intelligence Review, vol. 56, no. 3, pp. 2497–
stenberger, P. Nyczyk, P. Iff, Y. Li, S. Houliston et al., “Reasoning 2562, 2023.
Language Models: A Blueprint,” arXiv preprint arXiv:2501.11223, [82] R. S. Sutton, A. G. Barto et al., Reinforcement learning: An introduc-
2025. tion. MIT press Cambridge, 1998, vol. 1, no. 1.
[58] Y. Zhang, S. Mao, T. Ge, X. Wang, A. de Wynter, Y. Xia, W. Wu, [83] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
T. Song, M. Lan, and F. Wei, “LLM as a Mastermind: A Survey of pp. 279–292, 1992.
Strategic Reasoning with Large Language Models,” arXiv preprint [84] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
arXiv:2404.01230, 2024. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
[59] F. Xu, Q. Hao, Z. Zong, J. Wang, Y. Zhang, J. Wang, X. Lan, et al., “Human-level control through deep reinforcement learn-
J. Gong, T. Ouyang, F. Meng et al., “Towards Large Reasoning ing,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
Models: A Survey of Reinforced Reasoning with Large Language [85] R. R. Torrado, P. Bontrager, J. Togelius, J. Liu, and D. Perez-
Models,” arXiv preprint arXiv:2501.09686, 2025. Liebana, “Deep reinforcement learning for general video game
[60] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agar- ai,” in 2018 IEEE Conference on Computational Intelligence and
wal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning Games (CIG). IEEE, 2018, pp. 1–8.
transferable visual models from natural language supervision,” [86] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
in International conference on machine learning. PMLR, 2021, pp. Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershel-
8748–8763. vam, M. Lanctot et al., “Mastering the game of Go with deep
[61] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, neural networks and tree search,” nature, vol. 529, no. 7587, pp.
M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” 484–489, 2016.
in International conference on machine learning. Pmlr, 2021, pp. [87] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
8821–8831. A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering
the game of go without human knowledge,” nature, vol. 550, no.
[62] OpenAI, “GPT-4 Technical Report,” 2023.
7676, pp. 354–359, 2017.
[63] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning [88] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu,
algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds,
pp. 1527–1554, 2006. P. Georgiev et al., “Grandmaster level in StarCraft II using multi-
[64] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, agent reinforcement learning,” nature, vol. 575, no. 7782, pp. 350–
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep 354, 2019.
neural networks for acoustic modeling in speech recognition: [89] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under
The shared views of four research groups,” IEEE Signal processing reward transformations: Theory and application to reward shap-
magazine, vol. 29, no. 6, pp. 82–97, 2012. ing,” in Icml, vol. 99. Citeseer, 1999, pp. 278–287.
[65] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- [90] H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin,
fication with deep convolutional neural networks,” Advances in S. Chen, and D. Zhang, “Wizardmath: Empowering mathemat-
neural information processing systems, vol. 25, 2012. ical reasoning for large language models via reinforced evol-
[66] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. instruct,” arXiv preprint arXiv:2308.09583, 2023.
521, no. 7553, pp. 436–444, 2015. [91] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang,
[67] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, M. Zhang, Y. Li, Y. Wu et al., “Deepseekmath: Pushing the limits
B. Chang et al., “A survey on in-context learning,” in Proceedings of mathematical reasoning in open language models,” arXiv
of the 2024 Conference on Empirical Methods in Natural Language preprint arXiv:2402.03300, 2024.
Processing, 2024, pp. 1107–1128. [92] E. Zelikman, G. Harik, Y. Shao, V. Jayasiri, N. Haber, and N. D.
[68] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, Goodman, “Quiet-star: Language models can teach themselves
A. Elnashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt to think before speaking,” arXiv preprint arXiv:2403.09629, 2024.
pattern catalog to enhance prompt engineering with chatgpt,” [93] Y. Zhao, H. Yin, B. Zeng, H. Wang, T. Shi, C. Lyu, L. Wang,
arXiv preprint arXiv:2302.11382, 2023. W. Luo, and K. Zhang, “Marco-o1: Towards open reasoning mod-
[69] C. I. Lewis, C. H. Langford, and P. Lamprecht, Symbolic logic. els for open-ended solutions,” arXiv preprint arXiv:2411.14405,
Dover publications New York, 1959, vol. 170. 2024.
[70] R. Carnap, Introduction to symbolic logic and its applications. [94] J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and
Courier Corporation, 2012. B. Wang, “Huatuogpt-o1, towards medical complex reasoning
[71] A. Colmerauer, “An introduction to Prolog III,” Communications with llms,” arXiv preprint arXiv:2412.18925, 2024.
of the ACM, vol. 33, no. 7, pp. 69–90, 1990. [95] G. Sun, M. Jin, Z. Wang, C.-L. Wang, S. Ma, Q. Wang, Y. N. Wu,
[72] W. F. Clocksin and C. S. Mellish, Programming in PROLOG. Y. Zhang, and D. Liu, “Visual agents as fast and slow thinkers,”
Springer Science & Business Media, 2003. arXiv preprint arXiv:2408.08862, 2024.
[96] H. Wei, Y. Yin, Y. Li, J. Wang, L. Zhao, J. Sun, Z. Ge, and
[73] K. R. Apt et al., From logic programming to Prolog. Prentice Hall
X. Zhang, “Slow Perception: Let’s Perceive Geometric Figures
London, 1997, vol. 362.
Step-by-step,” arXiv preprint arXiv:2412.20631, 2024.
[74] M. P. Singh, A. S. Rao, and M. P. Georgeff, Formal methods in DAI: [97] X. Li, G. Dong, J. Jin, Y. Zhang, Y. Zhou, Y. Zhu, P. Zhang,
Logic-based representation and reasoning. MIT Press Cambridge, and Z. Dou, “Search-o1: Agentic search-enhanced large reasoning
1999, vol. 8. models,” arXiv preprint arXiv:2501.05366, 2025.
[75] R. G. Jeroslow, “Computation-oriented reductions of predicate to [98] Q. Team, “QwQ: Reflect Deeply on the Boundaries of
propositional logic,” Decision Support Systems, vol. 4, no. 2, pp. the Unknown,” November 2024. [Online]. Available: https:
183–197, 1988. //qwenlm.github.io/blog/qwq-32b-preview/
[76] J. McCarthy, “History of LISP,” in History of programming lan- [99] N. Team, “Sky-T1: Train your own O1 preview model
guages, 1978, pp. 173–185. within $450,” https://novasky-ai.github.io/posts/sky-t1, 2025,
[77] L. Bachmair and H. Ganzinger, “Resolution Theorem Proving.” accessed: 2025-01-09.
Handbook of automated reasoning, vol. 1, no. 02, 2001. [100] Y. Zhang, S. Wu, Y. Yang, J. Shu, J. Xiao, C. Kong, and
[78] M. Minsky et al., “A framework for representing knowledge,” J. Sang, “o1-coder: an o1 replication for coding,” arXiv preprint
1974. arXiv:2412.00154, 2024.
[79] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. [101] J. Wang, F. Meng, Y. Liang, and J. Zhou, “DRT-o1: Optimized
Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, Deep Reasoning Translation via Long Chain-of-Thought,” arXiv
and S. Colton, “A survey of monte carlo tree search methods,” preprint arXiv:2412.17498, 2024.
IEEE Transactions on Computational Intelligence and AI in games, [102] X. Chen, J. Xu, T. Liang, Z. He, J. Pang, D. Yu, L. Song, Q. Liu,
vol. 4, no. 1, pp. 1–43, 2012. M. Zhou, Z. Zhang et al., “Do NOT Think That Much for 2
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 19

+ 3=? On the Overthinking of o1-Like LLMs,” arXiv preprint [124] H. Jiang, Y. Ma, C. Ding, K. Luan, and X. Di, “Towards
arXiv:2412.21187, 2024. Intrinsic Self-Correction Enhancement in Monte Carlo Tree
[103] W. Zeng, Y. Huang, W. Liu, K. He, Q. Liu, Z. Ma, and J. He, Search Boosted Reasoning via Iterative Preference Learning,”
“7B Model and 8K Examples: Emerging Reasoning with Re- 2024. [Online]. Available: https://arxiv.org/abs/2412.17397
inforcement Learning is Both Effective and Efficient,” https: [125] H. Xu, “No Train Still Gain. Unleash Mathematical Reasoning of
//hkust-nlp.notion.site/simplerl-reason, 2025, notion Blog. Large Language Models with Monte Carlo Tree Search Guided
[104] Z. Wan, X. Feng, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and by Energy Function,” CoRR, vol. abs/2309.03224, 2023.
J. Wang, “Alphazero-like tree-search can guide large language [126] M. Kemmerling, D. Lütticke, and R. H. Schmitt, “Beyond games:
model decoding and training,” in Forty-first International Confer- a systematic review of neural Monte Carlo tree search applica-
ence on Machine Learning, 2024. tions,” Appl. Intell., vol. 54, no. 11-12, pp. 1020–1046, 2024.
[105] Z. Bi, K. Han, C. Liu, Y. Tang, and Y. Wang, “Forest-of-Thought: [127] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen,
Scaling Test-Time Compute for Enhancing LLM Reasoning,” “Making large language models better reasoners with step-aware
CoRR, vol. abs/2412.09078, 2024. verifier,” arXiv preprint arXiv:2206.02336, 2022.
[106] J. Li, H. Le, Y. Zhou, C. Xiong, S. Savarese, and D. Sahoo, [128] P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and
“CodeTree: Agent-guided Tree Search for Code Generation with Z. Sui, “Math-shepherd: Verify and reinforce llms step-by-step
Large Language Models,” CoRR, vol. abs/2411.04329, 2024. without human annotations,” in Proceedings of the 62nd Annual
[107] J. Cheng, X. Liu, C. Wang, X. Gu, Y. Lu, D. Zhang, Y. Dong, Meeting of the Association for Computational Linguistics (Volume 1:
J. Tang, H. Wang, and M. Huang, “SPaR: Self-Play with Tree- Long Papers), 2024, pp. 9426–9439.
Search Refinement to Improve Instruction-Following in Large [129] J. Lu, Z. Dou, W. Hongru, Z. Cao, J. Dai, Y. Feng, and Z. Guo,
Language Models,” CoRR, vol. abs/2412.11605, 2024. “Autopsv: Automated process-supervised verifier,” in The Thirty-
[108] J. Qiu, Y. Lu, Y. Zeng, J. Guo, J. Geng, H. Wang, K. Huang, Y. Wu, eighth Annual Conference on Neural Information Processing Systems,
and M. Wang, “TreeBoN: Enhancing Inference-Time Alignment 2024.
with Speculative Tree-Search and Best-of-N Sampling,” CoRR, [130] L. Yuan, W. Li, H. Chen, G. Cui, N. Ding, K. Zhang, B. Zhou,
vol. abs/2410.16033, 2024. Z. Liu, and H. Peng, “Free process rewards without process
[109] N. Dainese, M. Merler, M. Alakuijala, and P. Marttinen, “Gener- labels,” arXiv preprint arXiv:2412.01981, 2024.
ating Code World Models with Large Language Models Guided [131] F. Yu, A. Gao, and B. Wang, “OVM, Outcome-supervised Value
by Monte Carlo Tree Search,” CoRR, vol. abs/2405.15383, 2024. Models for Planning in Mathematical Reasoning,” in Findings of
[110] Z. Zhao, W. S. Lee, and D. Hsu, “Large Language Models the Association for Computational Linguistics: NAACL 2024, 2024,
as Commonsense Knowledge for Large-Scale Task Planning,” pp. 858–875.
in Advances in Neural Information Processing Systems 36: Annual [132] D. Zhang, S. Zhoubian, Z. Hu, Y. Yue, Y. Dong, and J. Tang, “Rest-
Conference on Neural Information Processing Systems 2023, NeurIPS mcts*: Llm self-training via process reward guided tree search,”
2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. arXiv preprint arXiv:2406.03816, 2024.
[111] Q. Li, W. Xia, K. Du, X. Dai, R. Tang, Y. Wang, Y. Yu, and [133] L. Luo, Y. Liu, R. Liu, S. Phatale, H. Lara, Y. Li, L. Shu, Y. Zhu,
W. Zhang, “RethinkMCTS: Refining Erroneous Thoughts in L. Meng, J. Sun et al., “Improve Mathematical Reasoning in
Monte Carlo Tree Search for Code Generation,” CoRR, vol. Language Models by Automated Process Supervision,” arXiv
abs/2409.09584, 2024. preprint arXiv:2406.06592, 2024.
[112] D. Zhang, X. Huang, D. Zhou, Y. Li, and W. Ouyang, “Accessing [134] Z. Sun, Q. Wang, W. Yu, X. Zang, K. Zheng, J. Xu, X. Zhang,
GPT-4 level Mathematical Olympiad Solutions via Monte Carlo S. Yang, and H. Li, “ReARTeR: Retrieval-Augmented Rea-
Tree Self-refine with LLaMa-3 8B,” CoRR, vol. abs/2406.07394, soning with Trustworthy Process Rewarding,” arXiv preprint
2024. arXiv:2501.07861, 2025.
[113] G. Rabby, F. Keya, P. Zamil, and S. Auer, “MC-NEST – Enhancing [135] Z. Zhang, C. Zheng, Y. Wu, B. Zhang, R. Lin, B. Yu,
Mathematical Reasoning in Large Language Models with a D. Liu, J. Zhou, and J. Lin, “The lessons of developing pro-
Monte Carlo Nash Equilibrium Self-Refine Tree,” 2024. [Online]. cess reward models in mathematical reasoning,” arXiv preprint
Available: https://arxiv.org/abs/2411.15645 arXiv:2501.07301, 2025.
[114] B. Xu, Y. Lin, Y. Li, and Y. Gao, “SRA-MCTS: Self-driven Rea- [136] Z. Yu, W. Gu, Y. Wang, Z. Zeng, J. Wang, W. Ye, and S. Zhang,
soning Augmentation with Monte Carlo Tree Search for Code “Outcome-Refining Process Supervision for Code Generation,”
Generation,” CoRR, vol. abs/2411.11053, 2024. arXiv preprint arXiv:2412.15118, 2024.
[115] J. Kang, X. Z. Li, X. Chen, A. Kazemi, and B. Chen, “MindStar: [137] X. Lai, Z. Tian, Y. Chen, S. Yang, X. Peng, and J. Jia, “Step-dpo:
Enhancing Math Reasoning in Pre-trained LLMs at Inference Step-wise preference optimization for long-chain reasoning of
Time,” CoRR, vol. abs/2405.16265, 2024. llms,” arXiv preprint arXiv:2406.18629, 2024.
[116] P. Kadam, “GPT-Guided Monte Carlo Tree Search for Sym- [138] Y. Liu, J. Lu, Z. Chen, C. Qu, J. K. Liu, C. Liu, Z. Cai, Y. Xia,
bolic Regression in Financial Fraud Detection,” CoRR, vol. L. Zhao, J. Bian et al., “AdaptiveStep: Automatically Divid-
abs/2411.04459, 2024. ing Reasoning Step through Model Confidence,” arXiv preprint
[117] D. Zhang, J. Wu, J. Lei, T. Che, J. Li, T. Xie, X. Huang, S. Zhang, arXiv:2502.13943, 2025.
M. Pavone, Y. Li, W. Ouyang, and D. Zhou, “LLaMA-Berry: [139] F. Yu, A. Gao, and B. Wang, “Outcome-supervised veri-
Pairwise Optimization for O1-like Olympiad-Level Mathematical fiers for planning in mathematical reasoning,” arXiv preprint
Reasoning,” CoRR, vol. abs/2410.02884, 2024. arXiv:2311.09724, 2023.
[118] D. Zhang, S. Zhoubian, Y. Yue, Y. Dong, and J. Tang, “ReST- [140] J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang,
MCTS*: LLM Self-Training via Process Reward Guided Tree A. Creswell, G. Irving, and I. Higgins, “Solving math word prob-
Search,” CoRR, vol. abs/2406.03816, 2024. lems with process-and outcome-based feedback,” arXiv preprint
[119] Y. Xie, A. Goyal, W. Zheng, M. Kan, T. P. Lillicrap, K. Kawaguchi, arXiv:2211.14275, 2022.
and M. Shieh, “Monte Carlo Tree Search Boosts Reasoning via [141] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen,
Iterative Preference Learning,” CoRR, vol. abs/2405.00451, 2024. “Making language models better reasoners with step-aware veri-
[120] G. Rabby, F. Keya, P. Zamil, and S. Auer, “MC-NEST - Enhanc- fier,” in Proceedings of the 61st Annual Meeting of the Association for
ing Mathematical Reasoning in Large Language Models with Computational Linguistics (Volume 1: Long Papers), 2023, pp. 5315–
a Monte Carlo Nash Equilibrium Self-Refine Tree,” CoRR, vol. 5333.
abs/2411.15645, 2024. [142] Z. Wu, Y. Hu, W. Shi, N. Dziri, A. Suhr, P. Ammanabrolu, N. A.
[121] J. Y. Koh, S. McAleer, D. Fried, and R. Salakhutdinov, “Tree Search Smith, M. Ostendorf, and H. Hajishirzi, “Fine-grained human
for Language Model Agents,” CoRR, vol. abs/2407.01476, 2024. feedback gives better rewards for language model training,”
[122] J. Liu, A. Cohen, R. Pasunuru, Y. Choi, H. Hajishirzi, Advances in Neural Information Processing Systems, vol. 36, pp.
and A. Celikyilmaz, “Don’t throw away your value model! 59 008–59 033, 2023.
Generating more preferable text with Value-Guided Monte- [143] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon,
Carlo Tree Search decoding,” 2024. [Online]. Available: and C. Finn, “Direct preference optimization: Your language
https://arxiv.org/abs/2309.15028 model is secretly a reward model,” Advances in Neural Information
[123] C. Zhang, J. Song, S. Li, Y. Liang, Y. Ma, W. Wang, Y. Zhu, and Processing Systems, vol. 36, 2024.
S. Zhu, “Proposing and solving olympiad geometry with guided [144] E. Zelikman, Y. Wu, J. Mu, and N. D. Goodman, “STaR: Boot-
tree search,” CoRR, vol. abs/2412.10673, 2024. strapping Reasoning With Reasoning,” in Advances in Neural
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 20

Information Processing Systems 35: Annual Conference on Neural Linguistics: EMNLP 2023, Singapore, December 6-10, 2023. Asso-
Information Processing Systems 2022, NeurIPS 2022, New Orleans, ciation for Computational Linguistics, 2023, pp. 2550–2575.
LA, USA, November 28 - December 9, 2022, 2022. [160] Y. Xie, K. Kawaguchi, Y. Zhao, X. Zhao, M.-Y. Kan, J. He, and
[145] A. Hosseini, X. Yuan, N. Malkin, A. Courville, A. Sordoni, Q. Xie, “Self-Evaluation Guided Beam Search for Reasoning,”
and R. Agarwal, “V-STaR: Training Verifiers for Self-Taught 2023. [Online]. Available: https://arxiv.org/abs/2305.00633
Reasoners,” 2024. [Online]. Available: https://arxiv.org/abs/ [161] Y. Yao, H. Wu, Q. Xu, and L. Song, “Fine-grained Conversational
2402.06457 Decoding via Isotropic and Proximal Search,” 2023. [Online].
[146] W. Zeng, Y. Huang, L. Zhao, Y. Wang, Z. Shan, and Available: https://arxiv.org/abs/2310.08130
J. He, “B-STaR: Monitoring and Balancing Exploration and [162] J. Chen, W. Lin, J. Mei, and B. Byrne, “Control-DAG: Constrained
Exploitation in Self-Taught Reasoners,” 2024. [Online]. Available: Decoding for Non-Autoregressive Directed Acyclic T5 using
https://arxiv.org/abs/2412.17256 Weighted Finite State Automata,” 2024. [Online]. Available:
[147] X. Guan, L. L. Zhang, Y. Liu, N. Shang, Y. Sun, Y. Zhu, F. Yang, https://arxiv.org/abs/2404.06854
and M. Yang, “rStar-Math: Small LLMs Can Master Math [163] N. Xu, C. Zhou, A. Celikyilmaz, and X. Ma, “Look-back
Reasoning with Self-Evolved Deep Thinking,” 2025. [Online]. Decoding for Open-Ended Text Generation,” 2023. [Online].
Available: https://arxiv.org/abs/2501.04519 Available: https://arxiv.org/abs/2305.13477
[148] Ç. Gülçehre, T. L. Paine, S. Srinivasan, K. Konyushkova, [164] Y. Yao, H. Wu, Z. Guo, B. Zhou, J. Gao, S. Luo, H. Hou, X. Fu, and
L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu, L. Song, “Learning From Correctness Without Prompting Makes
W. Macherey, A. Doucet, O. Firat, and N. de Freitas, “Rein- LLM Efficient Reasoner,” CoRR, vol. abs/2403.19094, 2024.
forced Self-Training (ReST) for Language Modeling,” CoRR, vol. [165] T. Anthony, Z. Tian, and D. Barber, “Thinking Fast and Slow
abs/2308.08998, 2023. with Deep Learning and Tree Search,” 2017. [Online]. Available:
[149] A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, https://arxiv.org/abs/1705.08439
X. Garcia, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, A. Kumar, [166] N. Miao, Y. W. Teh, and T. Rainforth, “Selfcheck: Using llms to
A. Alemi, A. Rizkowsky, A. Nova, B. Adlam, B. Bohnet, zero-shot check their own step-by-step reasoning,” arXiv preprint
G. Elsayed, H. Sedghi, I. Mordatch, I. Simpson, I. Gur, J. Snoek, arXiv:2308.00436, 2023.
J. Pennington, J. Hron, K. Kenealy, K. Swersky, K. Mahajan, [167] S. An, Z. Ma, Z. Lin, N. Zheng, J.-G. Lou, and W. Chen, “Learn-
L. Culp, L. Xiao, M. L. Bileschi, N. Constant, R. Novak, ing from mistakes makes llm better reasoner,” arXiv preprint
R. Liu, T. Warkentin, Y. Qian, Y. Bansal, E. Dyer, B. Neyshabur, arXiv:2310.20689, 2023.
J. Sohl-Dickstein, and N. Fiedel, “Beyond Human Data: Scaling [168] Z. Li, X. Hu, A. Liu, K. Zheng, S. Huang, and H. Xiong, “Refiner:
Self-Training for Problem-Solving with Language Models,” 2024. Restructure retrieval content efficiently to advance question-
[Online]. Available: https://arxiv.org/abs/2312.06585 answering capabilities,” arXiv preprint arXiv:2406.11357, 2024.
[150] F. Xu, Q. Sun, K. Cheng, J. Liu, Y. Qiao, and Z. Wu, “Interac- [169] J. Wu, M. Feng, S. Zhang, F. Che, Z. Wen, and J. Tao, “Beyond ex-
tive Evolution: A Neural-Symbolic Self-Training Framework For amples: High-level automated reasoning paradigm in in-context
Large Language Models,” CoRR, vol. abs/2406.11736, 2024. learning via mcts,” arXiv preprint arXiv:2411.18478, 2024.
[151] Y. Qu, T. Zhang, N. Garg, and A. Kumar, “Recursive [170] L. Yang, Z. Yu, T. Zhang, M. Xu, J. E. Gonzalez, B. Cui, and S. Yan,
Introspection: Teaching Language Model Agents How to Self- “Supercorrect: Supervising and correcting language models with
Improve,” 2024. [Online]. Available: https://arxiv.org/abs/2407. error-driven insights,” arXiv preprint arXiv:2410.09008, 2024.
18219 [171] L. Yang, Z. Yu, B. Cui, and M. Wang, “ReasonFlux: Hierarchical
[152] Y. Deng, P. Lu, F. Yin, Z. Hu, S. Shen, Q. Gu, J. Zou, K.-W. Chang, LLM Reasoning via Scaling Thought Templates,” 2025. [Online].
and W. Wang, “Enhancing Large Vision Language Models Available: https://arxiv.org/abs/2502.06772
with Self-Training on Image Comprehension,” 2024. [Online]. [172] Z. Qi, M. Ma, J. Xu, L. L. Zhang, F. Yang, and M. Yang, “Mutual
Available: https://arxiv.org/abs/2405.19716 reasoning makes smaller llms stronger problem-solvers,” arXiv
[153] J. Pang, P. Wang, K. Li, X. Chen, J. Xu, Z. Zhang, and Y. Yu, preprint arXiv:2408.06195, 2024.
“Language Model Self-improvement by Reinforcement Learning [173] D. Zhang, J. Wu, J. Lei, T. Che, J. Li, T. Xie, X. Huang, S. Zhang,
Contemplation,” in The Twelfth International Conference on Learn- M. Pavone, Y. Li et al., “Llama-berry: Pairwise optimization for
ing Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. o1-like olympiad-level mathematical reasoning,” arXiv preprint
OpenReview.net, 2024. arXiv:2410.02884, 2024.
[154] Y. Dubois, C. X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, [174] L. Yang, Z. Yu, T. Zhang, S. Cao, M. Xu, W. Zhang, J. E. Gonzalez,
C. Guestrin, P. Liang, and T. B. Hashimoto, “AlpacaFarm: A and B. Cui, “Buffer of Thoughts: Thought-Augmented Reasoning
Simulation Framework for Methods that Learn from Human with Large Language Models,” arXiv preprint arXiv:2406.04271,
Feedback,” in Advances in Neural Information Processing Systems 2024.
36: Annual Conference on Neural Information Processing Systems [175] H. Yao, J. Huang, W. Wu, J. Zhang, Y. Wang, S. Liu, Y. Wang,
2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Y. Song, H. Feng, L. Shen et al., “Mulberry: Empowering mllm
2023. with o1-like reasoning and reflection via collective monte carlo
[155] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegr- tree search,” arXiv preprint arXiv:2412.18319, 2024.
effe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, [176] G. Xu, P. Jin, L. Hao, Y. Song, L. Sun, and L. Yuan, “LLaVA-o1:
B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and Let Vision Language Models Reason Step-by-Step,” arXiv preprint
P. Clark, “Self-Refine: Iterative Refinement with Self-Feedback,” arXiv:2411.10440, 2024.
in Advances in Neural Information Processing Systems 36: Annual [177] O. Thawakar, D. Dissanayake, K. More, R. Thawkar, A. Heakl,
Conference on Neural Information Processing Systems 2023, NeurIPS N. Ahsan, Y. Li, M. Zumri, J. Lahoud, R. M. Anwer
2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. et al., “LlamaV-o1: Rethinking Step-by-step Visual Reasoning in
[156] N. Miao, Y. W. Teh, and T. Rainforth, “SelfCheck: Using LLMs LLMs,” arXiv preprint arXiv:2501.06186, 2025.
to Zero-Shot Check Their Own Step-by-Step Reasoning,” in The [178] K. Xiang, Z. Liu, Z. Jiang, Y. Nie, R. Huang, H. Fan, H. Li,
Twelfth International Conference on Learning Representations, ICLR W. Huang, Y. Zeng, J. Han et al., “AtomThink: A Slow Think-
2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. ing Framework for Multimodal Mathematical Reasoning,” arXiv
[157] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen, preprint arXiv:2411.11930, 2024.
“CRITIC: Large Language Models Can Self-Correct with Tool- [179] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of
Interactive Critiquing,” in The Twelfth International Conference on thought prompting in large language models. arxiv 2022,” arXiv
Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, preprint arXiv:2210.03493.
2024. OpenReview.net, 2024. [180] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of
[158] Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, “ROSE Doesn’t Do thoughts prompting: Disentangling computation from reasoning
That: Boosting the Safety of Instruction-Tuned Large Language for numerical reasoning tasks,” arXiv preprint arXiv:2211.12588,
Models with Reverse Prompt Contrastive Decoding,” 2024. 2022.
[Online]. Available: https://arxiv.org/abs/2402.11889 [181] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan,
[159] Y. Weng, M. Zhu, F. Xia, B. Li, S. He, S. Liu, B. Sun, K. Liu, and G. Neubig, “Pal: Program-aided language models,” in Inter-
and J. Zhao, “Large Language Models are Better Reasoners with national Conference on Machine Learning. PMLR, 2023, pp. 10 764–
Self-Verification,” in Findings of the Association for Computational 10 799.
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 21

[182] T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, inforcement learning on the base model,” https://github.com/
P. Clark, and A. Sabharwal, “Decomposed prompting: A Open-Reasoner-Zero/Open-Reasoner-Zero, 2025.
modular approach for solving complex tasks,” arXiv preprint [207] T. Yu, Y. Yao, H. Zhang, T. He, Y. Han, G. Cui, J. Hu, Z. Liu,
arXiv:2210.02406, 2022. H.-T. Zheng, M. Sun et al., “Rlhf-v: Towards trustworthy mllms
[183] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuur- via behavior alignment from fine-grained correctional human
mans, C. Cui, O. Bousquet, Q. Le et al., “Least-to-most prompting feedback,” in Proceedings of the IEEE/CVF Conference on Computer
enables complex reasoning in large language models,” arXiv Vision and Pattern Recognition, 2024, pp. 13 807–13 816.
preprint arXiv:2205.10625, 2022. [208] H. Lee, S. Phatale, H. Mansoor, K. R. Lu, T. Mesnard, J. Ferret,
[184] B. Klieger et al., “g1: Using Llama-3.1 70b on Groq to C. Bishop, E. Hall, V. Carbune, and A. Rastogi, “Rlaif: Scaling
create o1-like reasoning chains,” 2024. [Online]. Available: reinforcement learning from human feedback with ai feedback,”
https://github.com/bklieger-groq/g1 2023.
[185] X. Hou, M. Yang, W. Jiao, X. Wang, Z. Tu, and W. X. Zhao, [209] Y.-F. Zhang, T. Yu, H. Tian, C. Fu, P. Li, J. Zeng, W. Xie, Y. Shi,
“CoAct: A Global-Local Hierarchy for Autonomous Agent Col- H. Zhang, J. Wu et al., “Mm-rlhf: The next step forward in
laboration,” arXiv preprint arXiv:2406.13381, 2024. multimodal llm alignment,” arXiv preprint arXiv:2502.10391, 2025.
[186] M. Shen, G. Zeng, Z. Qi, Z.-W. Hong, Z. Chen, W. Lu, G. Wornell, [210] J. Ji, J. Zhou, H. Lou, B. Chen, D. Hong, X. Wang,
S. Das, D. Cox, and C. Gan, “Satori: Reinforcement Learning with W. Chen, K. Wang, R. Pan, J. Li, M. Wang, J. Dai, T. Qiu,
Chain-of-Action-Thought Enhances LLM Reasoning via Autore- H. Xu, D. Li, W. Chen, J. Song, B. Zheng, and Y. Yang,
gressive Search,” arXiv preprint arXiv:2502.02508, 2025. “Align anything: Training all-modality models to follow
[187] OpenAI, “Reinforcement fine-tuning,” 2024. instructions with language feedback,” 2024. [Online]. Available:
[188] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, https://arxiv.org/abs/2412.15838
X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large [211] L. Chen, L. Li, H. Zhao, Y. Song, and Vinci, “R1-V: Reinforcing
language models,” ACM Transactions on Intelligent Systems and Super Generalization Ability in Vision-Language Models with
Technology, vol. 15, no. 3, pp. 1–45, 2024. Less Than $3,” https://github.com/Deep-Agent/R1-V, 2025, ac-
[189] L. Trung, X. Zhang, Z. Jie, P. Sun, X. Jin, and H. Li, “Reft: cessed: 2025-02-02.
Reasoning with reinforced fine-tuning,” in Proceedings of the 62nd [212] H. Shen, Z. Zhang, Q. Zhang, R. Xu, and T. Zhao, “Vlm-r1: A
Annual Meeting of the Association for Computational Linguistics stable and generalizable r1-style large vision-language model,”
(Volume 1: Long Papers), 2024, pp. 7601–7614. https://github.com/om-ai-lab/VLM-R1, 2025, accessed: 2025-
[190] interconnects.ai, “Blob reinforcement fin-tuning,” 2024. 02-15.
[191] K. Gandhi, D. Lee, G. Grand, M. Liu, W. Cheng, A. Sharma, and [213] Y. Peng, G. Zhang, X. Geng, and X. Yang, “Lmm-r1,” https://
N. D. Goodman, “Stream of search (sos): Learning to search in github.com/TideDra/lmm-r1, 2025, accessed: 2025-02-13.
language, 2024,” URL https://arxiv. org/abs/2404.03683, 2024. [214] X. Wang and P. Peng, “Open-r1-video,” https://github.com/
[192] K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, Wang-Xiaodong1899/Open-R1-Video, 2025.
C. Du, C. Liao et al., “Kimi k1. 5: Scaling Reinforcement Learning [215] Y. Zheng, J. Lu, S. Wang, and Y. Xiong, “EasyR1: An Effi-
with LLMs,” arXiv preprint arXiv:2501.12599, 2025. cient, Scalable, Multi-Modality RL Training Framework,” https:
[193] K. Zhang, Q. Yao, B. Lai, J. Huang, W. Fang, D. Tao, M. Song, //github.com/hiyouga/EasyR1, 2025.
and S. Liu, “Reasoning with reinforced functional token tuning,”
[216] E. Yeo, Y. Tong, M. Niu, G. Neubig, and X. Yue, “Demystify-
arXiv preprint arXiv:2502.13389, 2025.
ing Long Chain-of-Thought Reasoning in LLMs,” arXiv preprint
[194] Z. Lin, Y. Tang, X. Yao, D. Yin, Z. Hu, Y. Sun, and arXiv:2502.03373, 2025.
K.-W. Chang, “QLASS: Boosting Language Agent Inference
[217] Z. Hou, P. Du, Y. Niu, Z. Du, A. Zeng, X. Liu, M. Huang, H. Wang,
via Q-Guided Stepwise Search,” 2025. [Online]. Available:
J. Tang, and Y. Dong, “Does RLHF Scale? Exploring the Impacts
https://arxiv.org/abs/2502.02584
From Data, Model, and Method,” arXiv preprint arXiv:2412.06000,
[195] G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y. Fan, T. Yu,
2024.
Q. Xu, W. Chen et al., “Process Reinforcement through Implicit
[218] J. Kim, D. Wu, J. Lee, and T. Suzuki, “Metastable Dynamics of
Rewards,” arXiv preprint arXiv:2502.01456, 2025.
Chain-of-Thought Reasoning: Provable Benefits of Search, RL
[196] M. Luo, S. Tan, J. Wong, X. Shi, W. Tang, M. Roongta, C. Cai,
and Distillation,” 2025. [Online]. Available: https://arxiv.org/
J. Luo, T. Zhang, E. Li, R. A. Popa, and I. Stoica, “DeepScaleR:
abs/2502.01694
Surpassing O1-Preview with a 1.5B Model by Scaling RL,” 2025,
notion Blog. [219] Z. Liu, C. Chen, W. Li, T. Pang, C. Du, and M. Lin, “There May
[197] J. Cheng, L. Li, G. Xiong, J. Shao, and Y. Lv, “Stop gamma decay: Not be Aha Moment in R1-Zero-like Training — A Pilot Study,”
Min-form credit assignment is all process reward model needs https://oatllm.notion.site/oat-zero, 2025, notion Blog.
for reasoning,” 2025, notion Blog. [220] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu,
[198] H. Team, “Open r1: A fully open reproduction of deepseek-r1.” F. Huang, H. Wei et al., “Qwen2. 5 technical report,” arXiv preprint
https://github.com/huggingface/open-r1, 2025, github Project. arXiv:2412.15115, 2024.
[199] J. Pan, J. Zhang, X. Wang, L. Yuan, H. Peng, and A. Suhr, [221] H. Luo, L. Shen, H. He, Y. Wang, S. Liu, W. Li, N. Tan, X. Cao, and
“TinyZero,” https://github.com/Jiayi-Pan/TinyZero, 2025, ac- D. Tao, “O1-Pruner: Length-Harmonizing Fine-Tuning for O1-
cessed: 2025-01-24. Like Reasoning Pruning,” arXiv preprint arXiv:2501.12570, 2025.
[200] Z. Liu, C. Chen, W. Li, T. Pang, C. Du, and M. Lin, “There may not [222] J. Hu, “REINFORCE++: A Simple and Efficient Approach
be aha moment in r1-zero-like training — a pilot study,” https: for Aligning Large Language Models,” arXiv preprint
//oatllm.notion.site/oat-zero, 2025, notion Blog. arXiv:2501.03262, 2025.
[201] Z. Liu, C. Chen, C. Du, W. S. Lee, and M. Lin, “Oat: A [223] J. Muralidharan and T. Thomas, “Deliberate Problem-solving
research-friendly framework for llm online alignment,” with a Large Language Model as a Brainstorm Aid Using a
[https://github.com/sail-sg/oat](https://github.com/sail- Checklist for Prompt Generation,” The Journal of the Association
sg/oat), 2025. of Physicians of India, vol. 72, no. 5, pp. 89–90, 2024.
[202] X. Li, H. Zou, and P. Liu, “Limr: Less is more for rl scaling,” arXiv [224] J. Jiang, Z. Chen, Y. Min, J. Chen, X. Cheng, J. Wang, Y. Tang,
preprint arXiv:2502.11886, 2025. H. Sun, J. Deng, W. X. Zhao et al., “Technical Report: Enhancing
[203] Z. Xie, L. Chen, W. Mao, J. Xu, L. Kong et al., “Teaching language LLM Reasoning with Reward-guided Tree Search,” arXiv preprint
models to critique via reinforcement learning,” arXiv preprint arXiv:2411.11694, 2024.
arXiv:2502.03492, 2025. [225] F. Lyu et al., “Thinking Claude,” 2024. [Online]. Available:
[204] T. Xie, Z. Gao, Q. Ren, H. Luo, Y. Hong, B. Dai, J. Zhou, K. Qiu, https://github.com/richards199999/Thinking-Claude
Z. Wu, and C. Luo, “Logic-rl: Unleashing llm reasoning with [226] AI-MO, “Aime 2024,” https://huggingface.co/datasets/
rule-based reinforcement learning,” 2025. [Online]. Available: AI-MO/aimo-validation-aime, 2024.
https://arxiv.org/abs/2502.14768 [227] ——, “Amc 2023,” https://huggingface.co/datasets/AI-MO/
[205] H. Zhang, J. Yao, C. Ye, W. Xiong, and T. Zhang, “Online-dpo-r1: aimo-validation-amc, 2024.
Unlocking effective reasoning without the ppo overhead,” 2025, [228] C. He, R. Luo, Y. Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han,
notion Blog. Y. Huang, Y. Zhang et al., “Olympiadbench: A challenging bench-
[206] J. Hu, Y. Zhang, Q. Han, D. Jiang, and H.-Y. S. Xiangyu Zhang, mark for promoting agi with olympiad-level bilingual multi-
“Open-reasoner-zero: An open source approach to scaling re- modal scientific problems,” arXiv preprint arXiv:2402.14008, 2024.
JOURNAL OF LATEX CLASS FILES, JANUARY 2025 22

[229] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, the Association for Computational Linguistics: EMNLP 2024, 2024,
and K. R. Narasimhan, “SWE-bench: Can Language Models pp. 528–541.
Resolve Real-world Github Issues?” in The Twelfth International [248] Claude, “Claude 3.5 Sonnet,” June 2024. [Online]. Available:
Conference on Learning Representations, 2024. [Online]. Available: https://www.anthropic.com/news/claude-3-5-sonnet
https://openreview.net/forum?id=VTF8yNQM66 [249] G. DeepMind, “Gemini 2.0 Pro,” October 2024. [Online].
[230] N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, Available: https://deepmind.google/technologies/gemini/pro/
A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic [250] I. Team, “InternLM2 Technical Report,” 2024.
and contamination free evaluation of large language models for [251] N. Muennighoff, Z. Yang, W. Shi, X. L. Li, L. Fei-Fei, H. Hajishirzi,
code,” arXiv preprint arXiv:2403.07974, 2024. L. Zettlemoyer, P. Liang, E. Candès, and T. Hashimoto, “s1:
[231] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Simple test-time scaling,” arXiv preprint arXiv:2501.19393, 2025.
Pang, J. Dirani, J. Michael, and S. R. Bowman, “GPQA: [252] OpenAI, “OpenAI o1-mini,” September 2024.
A Graduate-Level Google-Proof Q&A Benchmark,” in First [Online]. Available: https://openai.com/index/
Conference on Language Modeling, 2024. [Online]. Available: openai-o1-mini-advancing-cost-efficient-reasoning/
https://openreview.net/forum?id=Ti67584b98 [253] Y. Ye, Z. Huang, Y. Xiao, E. Chern, S. Xia, and P. Liu,
[232] Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra, S. Guo, W. Ren, “LIMO: Less is More for Reasoning,” 2025. [Online]. Available:
A. Arulraj, X. He, Z. Jiang et al., “Mmlu-pro: A more robust and https://arxiv.org/abs/2502.03387
challenging multi-task language understanding benchmark,” [254] Q. Team, “QVQ: To See the World with Wisdom,”
arXiv preprint arXiv:2406.01574, 2024. December 2024. [Online]. Available: https://qwenlm.github.
[233] S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Webshop: Towards io/blog/qvq-72b-preview/
scalable real-world web interaction with grounded language [255] K. Zhang, B. Li, P. Zhang, F. Pu, J. A. Cahyono, K. Hu, S. Liu,
agents,” Advances in Neural Information Processing Systems, vol. 35, Y. Zhang, J. Yang, C. Li, and Z. Liu, “LMMs-Eval: Reality Check
pp. 20 744–20 757, 2022. on the Evaluation of Large Multimodal Models,” 2024. [Online].
Available: https://arxiv.org/abs/2407.12772
[234] S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar,
[256] O. Contributors, “OpenCompass: A Universal Evaluation
X. Cheng, Y. Bisk, D. Fried, U. Alon et al., “WebArena: A
Platform for Foundation Models,” https://github.com/
Realistic Web Environment for Building Autonomous Agents,”
open-compass/opencompass, 2023.
arXiv preprint arXiv:2307.13854, 2023. [Online]. Available:
[257] M. Song, Z. Su, X. Qu, J. Zhou, and Y. Cheng, “PRMBench:
https://webarena.dev
A Fine-grained and Challenging Benchmark for Process-Level
[235] R. Wang, P. Jansen, M.-A. Côté, and P. Ammanabrolu, Reward Models,” arXiv preprint arXiv:2501.03124, 2025. [Online].
“ScienceWorld: Is your Agent Smarter than a 5th Grader?” 2022. Available: https://arxiv.org/pdf/2501.03124
[Online]. Available: https://arxiv.org/abs/2203.07540 [258] Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from
[236] A. Prasad, A. Koller, M. Hartmann, P. Clark, A. Sabharwal, transformers via speculative decoding,” in International Confer-
M. Bansal, and T. Khot, “ADaPT: As-Needed Decomposition and ence on Machine Learning, 2023, pp. 19 274–19 286.
Planning with Language Models,” in Findings of the Association [259] X. Ning, Z. Lin, Z. Zhou, Z. Wang, H. Yang, and Y. Wang,
for Computational Linguistics: NAACL 2024, 2024, pp. 4226–4252. “Skeleton-of-thought: Large language models can do parallel
[237] H. Chen, Z. Fang, Y. Singla, and M. Dredze, “Benchmarking decoding,” Proceedings ENLSP-III, 2023.
Large Language Models on Answering and Explaining Challeng- [260] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang,
ing Medical Questions,” arXiv preprint arXiv:2402.18060, 2024. R. Y. Y. Wong, A. Zhu, L. Yang, X. Shi et al., “SpecInfer:
[238] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and Accelerating Generative Large Language Model Serving with
P. Szolovits, “What disease does this patient have? a large-scale Tree-based Speculative Inference and Verification,” arXiv preprint
open domain question answering dataset from medical exams,” arXiv:2305.09781, 2023.
Applied Sciences, vol. 11, no. 14, p. 6421, 2021. [261] B. Qi, X. Chen, J. Gao, D. Li, J. Liu, L. Wu, and B. Zhou,
[239] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, “Interactive continual learning: Fast and slow thinking,” in Pro-
D. Jiang, W. Ren, Y. Sun et al., “Mmmu: A massive multi- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
discipline multimodal understanding and reasoning benchmark Recognition, 2024, pp. 12 882–12 892.
for expert agi,” in Proceedings of the IEEE/CVF Conference on [262] Y. Zheng, S. Sun, L. Qiu, D. Ru, C. Jiayang, X. Li, J. Lin, B. Wang,
Computer Vision and Pattern Recognition, 2024, pp. 9556–9567. Y. Luo, R. Pan et al., “OpenResearcher: Unleashing AI for Accel-
[240] P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.- erated Scientific Research,” arXiv preprint arXiv:2408.06941, 2024.
W. Chang, M. Galley, and J. Gao, “MathVista: Evaluating Mathe- [263] B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P.
matical Reasoning of Foundation Models in Visual Contexts,” in Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi
International Conference on Learning Representations (ICLR), 2024. et al., “Mathematical discoveries from program search with large
[241] K. Wang, J. Pan, W. Shi, Z. Lu, M. Zhan, and H. Li, “Measuring language models,” Nature, vol. 625, no. 7995, pp. 468–475, 2024.
multimodal mathematical reasoning with math-vision dataset,” [264] T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong, “Solving
arXiv preprint arXiv:2402.14804, 2024. olympiad geometry without human demonstrations,” Nature,
[242] Z.-Z. Li, M.-L. Zhang, F. Yin, Z.-L. Ji, J.-F. Bai, Z.-R. Pan, F.-H. vol. 625, no. 7995, pp. 476–482, 2024.
Zeng, J. Xu, J.-X. Zhang, and C.-L. Liu, “Cmmath: A chinese [265] Y. Chervonyi, T. H. Trinh, M. Olšák, X. Yang, H. Nguyen,
multi-modal math skill evaluation benchmark for foundation M. Menegali, J. Jung, V. Verma, Q. V. Le, and T. Luong, “Gold-
models,” arXiv preprint arXiv:2407.12023, 2024. medalist Performance in Solving Olympiad Geometry with Al-
phaGeometry2,” arXiv preprint arXiv:2502.03544, 2025.
[243] M.-L. Zhang, F. Yin, and C.-L. Liu, “A Multi-Modal Neural
[266] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut,
Geometric Solver with Textual Clauses Parsed from Diagram,”
J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini:
in IJCAI, 2023.
a family of highly capable multimodal models,” arXiv preprint
[244] Z. Xi, Y. Ding, W. Chen, B. Hong, H. Guo, J. Wang, D. Yang, arXiv:2312.11805, 2023.
C. Liao, X. Guo, W. He, S. Gao, L. Chen, R. Zheng, Y. Zou, T. Gui, [267] G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati,
Q. Zhang, X. Qiu, X. Huang, Z. Wu, and Y.-G. Jiang, “AgentGym: G. Tanzer, D. Vincent, Z. Pan, S. Wang et al., “Gemini 1.5:
Evolving Large Language Model-based Agents across Diverse Unlocking multimodal understanding across millions of tokens
Environments,” 2024. of context,” arXiv preprint arXiv:2403.05530, 2024.
[245] R. Zhang, D. Jiang, Y. Zhang, H. Lin, Z. Guo, P. Qiu, A. Zhou, [268] N. Chen, Z. Zheng, N. Wu, L. Shou, M. Gong, Y. Song, D. Zhang,
P. Lu, K.-W. Chang, P. Gao et al., “MathVerse: Does Your Multi- and J. Li, “Breaking language barriers in multilingual math-
modal LLM Truly See the Diagrams in Visual Math Problems?” ematical reasoning: Insights and observations,” arXiv preprint
arXiv preprint arXiv:2403.14624, 2024. arXiv:2310.20246, 2023.
[246] P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, [269] Y. Du, Z. Liu, Y. Li, W. X. Zhao, Y. Huo, B. Wang, W. Chen, Z. Liu,
P. Clark, and A. Kalyan, “Learn to Explain: Multimodal Reason- Z. Wang, and J.-R. Wen, “Virgo: A preliminary exploration on
ing via Thought Chains for Science Question Answering,” in The reproducing o1-like mllm,” arXiv preprint arXiv:2501.01904, 2025.
36th Conference on Neural Information Processing Systems (NeurIPS), [270] M. Parmar and Y. Govindarajulu, “Challenges in Ensuring AI
2022. Safety in DeepSeek-R1 Models: The Shortcomings of Reinforce-
[247] Y. Li, Y. Guo, F. Guerin, and C. Lin, “An open-source data ment Learning Strategies,” arXiv preprint arXiv:2501.17030, 2025.
contamination report for large language models,” in Findings of

You might also like