Reasoning Language Models - A Blueprint
Reasoning Language Models - A Blueprint
Abstract—Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI’s o1 and o3,
DeepSeek-V3, and Alibaba’s QwQ, have redefined AI’s problem-solving capabilities by extending large language models (LLMs) with
advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures—uniquely combining
Reinforcement Learning (RL), search heuristics, and LLMs—present accessibility and scalability challenges. To address these, we
propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all
arXiv:2501.11223v2 [cs.AI] 22 Jan 2025
RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies
(e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based
and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent
tools). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing
how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint’s
versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and
experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models,
and the importance of familiar training distributions. Finally, we discuss scalable RLM cloud deployments and we outline how RLMs can
integrate with a broader LLM ecosystem. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and
fosters innovation, aiming to mitigate the gap between “rich AI” and “poor AI” by lowering barriers to RLM development and
experimentation.
Index Terms—Reasoning Language Model, Large Reasoning Model, Survey of Reasoning Language Models, Survey of RLMs, RLM,
LRM, Reasoning LLMs, Reinforcement Learning for LLMs, MCTS for LLMs, Large Language Model, LLM, Generative AI.
HPC
Petascale era Titan P100 GH200
A100 GB200
Piz Daint Tianhe-2 V100 Start of the Alps
Exascale era
Breakthroughs Breakthroughs
in compute in compute Breakthroughs in
The ongoing growth of compute power and data processing capabili�es of supercomputers and resources enabled resources enabled compute resources
high performance systems, previously driven by Moore's law and now by the massively parallel breakthroughs the introduc�on enabled the
processing capabili�es of GPUs, TPUs, and AI accelerators. in RL models of LLMs introduc�on of RLMs
AlphaZero
Deep OpenAI
Q-Network
RL
Value model Policy model
AlphaGo Five AlphaFold
(neural network) the branching factor is (neural network)
identical for all nodes AlphaZero MuZero DreamerV3
π π
π π
v=0.001 v=0.11 v=0.04 v=0.09
RL models for board games
become a pillar of RLMs
π π π π
π
LLM
the path, which keeps sta�s�cs for each (state, ac�on)-pair (edge). At the end, it
Transformer GPT-3
chooses the most promising ac�on from the root and prepares the next move. GPT-2 LaMDA LLaMA
LLM
PaLM GPT-4o
GPT-4
RLM Transformer
RLM
autoregressive autoregressive
token generation token generation
π π x1
π π TS-LLM: the first proposal
Numbers are blue Look up Quicksort Sor�ng is simple Split into two sets to use AlphaZero-like tree DeepSeek o3
search to enhance LLM's
v=0.001 v=0.11 v=0.04 v=0.09 π π
training & decoding
π π π
Quicksort sorts Split into two sets "3,2,4,5" & "7,12,5,6" "3,4" & "5,2"
Pick Pivot
numbers
v=0.08 v=0.01 v=0.05 v=0.12 v=0.01 2010 2015 2020 2025
Fig. 2: The history of RLMs. This class of models has been the result of the development of three lines of works: (1) Reinforcement Learning based models such as
AlphaZero [128], (2) LLM and Transformer based models such as GPT-4o [109], and (3) the continuous growth of compute power and data processing capabilities of
supercomputers and high performance systems.
The technical foundations of RLMs remain opaque and imize the clarity and comprehensiveness, we present the
complex, compounding the accessibility challenge. Emerg- blueprint using three perspectives: (1) architecture diagrams
ing analyses suggest that their design likely integrates el- and descriptions, (2) detailed mathematical formulations,
ements such as Monte Carlo Tree Search (MCTS) or Beam and (3) in-depth algorithmic specifications. By employing
Search, reinforcement learning (RL), process-based super- these complementary perspectives, we aim to provide a
vision (PBS) [83], [83], [143], [143], and advanced in-context clear and actionable guide for developing RLMs tailored to
learning (ICL) techniques like Chain-of-Thought (CoT) [152] specific applications, settings, and constraints.
or Tree of Thoughts (ToT) [161], and possibly even retrieval- Our blueprint comprehensively encompasses the poten-
augmented generation (RAG) [13], [53], [78], [79]. tial building blocks of RLMs, offering a flexible and modular
Additionally, these architectures employ multiple spe- framework. It incorporates a variety of reasoning structures,
cialized subcomponents—such as synthetic data generation such as chains, trees, graphs, and even higher-order struc-
engines and policy, value, and reward models—trained tures such as hierarchical (or nested) trees, along with nu-
through some form of novel loss functions and possibly sev- merous operations that transform and advance the reason-
eral fine-tuning schemes. However, the intricate interplay of ing process. The blueprint supports different granularities
these components and their integration into a cohesive and of reasoning steps, ranging from individual tokens to full
effective architecture remains poorly understood. Here, the sentences or structured segments. Additionally, it enables
“holy-grail question” is: what is the detailed design of an RLM diverse training schemes, including Outcome-Based Super-
and how to make it simultaneously achieve effectiveness (i.e., high vision (OBS) and PBS, and the related Outcome & Process
accuracy in delivered answers), low cost, and scalability? Reward Models (ORMs & PRMs). Next, in order to illustrate
To help answer this question and to address the above the capability of the blueprint to accommodate novel design
challenges, we propose a comprehensive blueprint for ideas, we describe several novel schemes and how they
constructing, analyzing, and experimenting with RLMs fit within the blueprint. One such example is Trace-Based
(contribution #1; a roadmap of all the contributions and the Supervision (TBS), which extends PBS by incorporating
paper is in Figure 1). Our approach identifies and crystal- labeled traces of traversal paths through entire reasoning
lizes the fundamental building blocks of RLMs, organizing structures, rather than just linear chains of reasoning steps.
them into a cohesive framework. This blueprint is presented By unifying all these components, our blueprint serves as
with increasing levels of granularity, starting from high- a versatile toolbox for constructing RLMs—ranging from
level overview, finishing at low-level details that can be simple models to sophisticated designs—tailored to specific
directly harnessed when implementing. Further, to max- reasoning tasks and performance objectives.
3
Examples: OpenAI o1, OpenAI o3, QwQ, DeepSeek-V3 Large Language Models (LLMs) Reasoning Language Models (RLMs)
See §2.1.1 See §2.2
See §2.1.1 See §2.1.2 See §2.1.3 Capable of System 1 Thinking; Capable of System 2 Thinking;
Pillar 1: Pillar 2: Pillar 3: can do Interpola�on (see §2.3) be�er at Extrapola�on (see §2.3)
Large Language Reinforcement High-Performance
Models (LLMs) Learning (RL) Compu�ng (HPC) Examples: GPT-4o, LLaMA, Qwen Examples: o1, o3, DeepSeek-V3, QwQ
autoregressive autoregressive
token generation token generation autoregressive
token generation
Fig. 3: Hierarchy of language models (right) and the three pillars of RLMs (left).
We conduct a broad analysis of existing reasoning optimize performance across diverse applications. By bridg-
schemes (contribution #2), demonstrating how they fit into ing the gap between conceptual advancements and practical
our blueprint as special cases. This analysis encompasses implementations, this work seeks to accelerate progress in
not only standard MCTS and reinforcement learning-based the field, unlock new possibilities for intelligent systems
designs, such as LLaMA-Berry [169], but also models like across research, industry, and education, and to mitigate the
QwQ [140]. Additionally, we include paradigms diverging risk of the growing gap between “rich AI” and “poor AI”.
from standard MCTS, such as Journey Learning [113] or
Beam Search, which redefines reasoning through implicit
2 E VOLUTION & F OUNDATIONS OF RLM S
long-chain structures, and advanced structured prompt-
ing techniques like CoT [152], ToT [161], and Graph of We first summarize the evolution and foundations of rea-
Thoughts [9]. We also consider reasoning utilities such as soning language models. Figure 2 shows an overview of the
Retrieval-Augmented Generation (RAG) and data stores, history of the development of these models.
tools, and others. By mapping these diverse approaches to
one blueprint, we showcase its versatility and expressive 2.1 Basic Pillars of Reasoning LMs
power, highlighting its ability to unify a wide range of The development of reasoning-capable LLMs represents a
reasoning methodologies within a coherent framework. convergence of three critical threads: (1) advances in LLMs
To demonstrate the utility of our framework, we in- such as GPT-4, (2) RL designs such as AlphaZero, and (3)
troduce x1, a modular and user-friendly implementation2 High-Performance Computing (HPC) resources. Together,
designed to simplify the process of developing and exper- these threads have shaped models capable of efficient Sys-
imenting with new RLM architectures, covering not only tem 2 Thinking – a level of reasoning that combines explicit
training and inference, but also synthetic data generation deliberation with novel problem-solving abilities, distinct
(contribution #3). We design x1 to facilitate supporting var- from the intuitive, fast, and automatic heuristics of System 1
ious optimizations, design decisions, and overall scalability, Thinking. Figure 2 compares example designs in these pillars
such as batch processing, making it a well-suited founda- while Figure 3 (left side) further discusses the details of
tion of experimentation infrastructure. We also discuss key these pillars.
aspects of deployment in cloud environments, ensuring that
x1 can be seamlessly integrated into modern infrastructure 2.1.1 Large Language Models: A Reservoir of Knowledge
for both research and production use cases. LLMs such as GPT-4o [109] or Llama [50] represent an
By providing both theoretical insights and practical extraordinary leap in the field of AI, constituting a vast
tools, this work aims to democratize access to advanced repository of world knowledge encoded directly in their
RLMs, enabling researchers and practitioners to design, weights. Trained on huge corpora of text from diverse
train, and deploy sophisticated reasoning models with re- sources, LLMs are capable of understanding and generating
duced complexity and cost. Our blueprint offers a clear and human language with remarkable fluency. However, their
adaptable framework that lowers the barriers to entry, fos- reasoning abilities largely align with the fast, automatic, and
tering broader experimentation and innovation. Addition- intuitive System 1 Thinking. While they can generate co-
ally, the modular implementation of x1 serves as a founda- herent responses and even perform simple reasoning tasks,
tion for rapid prototyping and large-scale experimentation, LLMs have limitations. The reasoning they exhibit is often
empowering users to explore new reasoning paradigms and shallow, rooted in the simple mechanism of predicting the
next most probable token in a sequence rather than engag-
2
https://github.com/spcl/x1 ing in explicit problem-solving or structured analysis. While
4
LLMs may generate plausible-sounding solutions to a prob- models, supporting the combination of vast knowledge, rea-
lem, these outputs are the result of statistical language mod- soning capabilities, and computational scalability – allowing
eling rather than a deliberate, iterative reasoning process. AI evolution to continue beyond the limits of traditional
This distinction highlights the need for integrating more Moore’s Law scaling.
advanced mechanisms capable of explicit reasoning into AI
systems—paving the way for hybrid designs that combine
the knowledge-rich foundation of LLMs with structured 2.2 The Convergence: System 2 Thinking in AI
reasoning methodologies. The intersection of these three threads – LLMs, RL, and HPC
– has culminated in the emergence of models capable of
2.1.2 Reinforcement Learning: Exploring and Innovating
System 2 Thinking. These advanced systems combine the
RL has historically provided a framework for decision- knowledge-rich foundation of LLMs with the exploratory
making and exploration in environments where an agent and optimization capabilities of RL, all supported by the
must learn optimal strategies through trial and error. Land- scalability and performance of modern HPC. The result is a
mark systems like AlphaZero [128] and a long line of others new class of AI models that can engage in explicit, deliberate
such as AlphaGo [127] or MuZero [124] demonstrated the reasoning processes.
profound potential of RL by achieving superhuman per-
These models possess a world model encoded in the
formance in games such as chess, shogi, and Go. Unlike
weights of their LLM components, allowing them to reason
traditional AI systems, AlphaZero began with no embedded
about complex scenarios and contexts. Their RL capabilities
domain knowledge. Instead, it mastered these games purely
combined with the HPC capabilities enable them to navigate
through self-learning, discovering novel strategies that even
truly immense decision spaces, evaluate multiple strategies,
human experts had not considered.
and iteratively refine solutions.
One of the most striking examples of RL’s innovative
capacity came during an AlphaZero match, where the sys-
tem made a move initially deemed a mistake by human 2.3 Interpolation (LLMs) vs. Extrapolation (RLMs)
observers. This move [100] later proved to be both sur-
prising and strategically brilliant, illustrating the capacity Standard LLMs, driven by their autoregressive token pre-
of RL agents to explore unconventional solutions that lie diction mechanism, primarily perform interpolation within
outside the bounds of human intuition. Such capabilities are the vast search space of solutions. They excel at generating
fundamentally rooted in RL’s ability to navigate vast search responses that align with patterns seen in their training data,
spaces effectively. effectively synthesizing knowledge from known contexts.
However, traditional RL systems lacked the ability to However, this process limits them to producing outputs that
encode real-world knowledge or handle complex, multi- remain within the boundaries of their training distribution.
faceted reasoning tasks. This limitation spurred the integra- In contrast, reasoning LMs enable extrapolation beyond
tion of RL principles with LLMs, combining the structured these boundaries. By combining structured exploration, rea-
exploration and optimization capabilities of RL with the soning LMs navigate uncharted areas of the solution space,
knowledge-rich reasoning foundation of language models. generating novel insights and solutions that extend past the
limits of their training data. This enables a shift from basic
2.1.3 HPC: Scalability & Efficiency pattern completion to active problem-solving.
The growth of LLM and RL systems has been propelled
by advancements in High-Performance Computing (HPC).
Initially driven by Moore’s Law, which enabled a doubling 2.4 Hierarchy of Reasoning-Related Models
of transistor density approximately every two years, HPC The evolution of RLMs can be understood as a hierarchical
benefited from both technological advancements and the progression, with earlier models such as GPT-4o being less
economic feasibility of manufacturing smaller transistors. capable in terms of reasoning, and the o1-like architectures
However, as the costs of further miniaturization have risen demonstrating increasing sophistication and explicit reason-
sharply, Moore’s Law has reached practical limits, necessi- ing abilities. This hierarchy reflects the integration of System
tating alternative strategies like parallelism and heteroge- 1 (LLMs) and System 2 (RLMs) Thinking. RLMs can be
neous computing. further divided based on how reasoning is implemented
Modern HPC systems rely heavily on GPUs, TPUs, into Implicit RLMs and Explicit RLMs; the details of this
and AI accelerators for their parallel processing capabil- categorization can be found in Figure 3 (the right side).
ities, alongside CPUs for sequential and general-purpose
tasks. Heterogeneous computing leverages these compo-
2.4.1 Implicit Reasoning Models
nents to optimize task-specific performance. Distributed
frameworks, employing techniques such as data, model, In this subclass, the reasoning structure is embedded
and pipeline parallelism [8], [12], [16], further enable the entirely within the model’s weights. Models such as
training of enormous models across thousands of compute QwQ [140] operate as “black boxes”, where reasoning is im-
nodes. plicit and cannot be explicitly disentangled or manipulated.
Energy efficiency innovations, including sparsity, quanti- While these models exhibit improved reasoning capabilities
zation, and pruning, mitigate the growing energy demands compared to standard LLMs, their reasoning processes are
of scaling AI systems. These advancements ensure that HPC opaque and rely on the internalized patterns learned during
remains a cornerstone for developing and deploying AI training.
5
2.4.2 Explicit Reasoning Models strategies, described below). This approach, inspired by
These models introduce explicit reasoning mechanisms ex- methods used in AlphaZero, ensures that the search process
ternal to the model’s core weights. Examples include de- is both efficient and directed toward promising solutions.
signs such as LLaMA-Berry [169], Marco-o1 [174], and po- The policy model 4 is responsible for generating new
tentially OpenAI’s o3, which incorporate mechanisms like reasoning steps at each node, predicting the next most
explicit MCTS combined with RL for decision-making. This likely and logical steps to expand the reasoning process.
explicit structure enables the model to simulate, evaluate, Meanwhile, the value model 5 evaluates the quality of a
and refine solutions iteratively, facilitating novel problem- reasoning path starting at a given node, helping the system
solving and extrapolation. By separating reasoning from prioritize the most promising steps to follow. Sometimes, a
the static knowledge encoded in the weights, these models reward model3 6 is used instead, to assess the quality of
achieve greater flexibility and interpretability in their rea- an individual specific node and its corresponding reasoning
soning processes. Note that the explicit reasoning can be step. In our blueprint, as detailed in the next section, we
internalized via training making it implicit – we discuss it abstract the models into a more general notion of operators
later in the blueprint. 7 to enable more flexibility in how they are implemented.
The search and reasoning processes continue iteratively
until a terminal step is reached 8 . This terminal step
3 E SSENCE OF R EASONING LM S represents a completion of the reasoning chain that forms
We now describe the general architecture of RLMs, which the final answer to the posed problem. It serves as the leaf
we summarize in Figure 4. In the following sections, we node in the tree, concluding that particular reasoning path.
generalize this description to the full RLM blueprint. This architecture provides a unified framework that
accommodates a wide range of reasoning tasks. Whether
3.1 Basic Architecture, Pipelines, & Concepts reasoning steps are fine-grained (e.g., individual token se-
quences) or coarse-grained (e.g., entire reasoning chains
We now outline the foundational architecture, operational treated as single nodes), the architecture adapts seamlessly.
pipelines, and core concepts. Figure 4 offers three levels of By structuring the search space explicitly and guiding ex-
detail. In general (the top-left part), the whole RLM archi- ploration with policy and value models, the RLM achieves
tecture consists of three main pipelines: inference, training, a level of reasoning capability bridging intuitive pattern
and data generation. The inference serves user requests, recognition and deliberate problem-solving.
using models (e.g., the value or policy model) provided by A detailed specification of the inference pipeline can be
the training pipeline. Data generation mirrors the inference found in Appendix C.1 and in Algorithm 1.
pipeline in its internal design; the main difference is that
it runs independently of the user requests, generating data
3.1.2 Training
that is then used to re-train the models. As such, training
combined with data generation from various domains [121], Training details depend on what model is trained (value,
[168] offers self-learning capabilities and is analogous to the policy, reward, ...). In general, we assume fine-tuning a
self-play setting of AlphaZero [128]. model such as Llama. Here, we follow an approach where
one first harnesses supervised data, usually coming from
3.1.1 Inference existing datasets such as PRM800K [83] 1 , which becomes
The inference process begins when the user provides an a part of the supervised training data 2 used in the su-
input prompt 1 , which typically describes the problem or pervised training pipeline 3 of the framework to train
question to be addressed by the RLM. This input serves some, or all, of the models 4 considered in the blueprint.
as the root of the reasoning process and initiates the con- The second part of the overall training framework in RLMs
struction of a reasoning structure 2 that organizes RLM’s is the unsupervised (self-learning) training pipeline, in
progress. The structure is usually represented as a tree. which training data is being continually generated 5 and
The root of this tree corresponds to the user’s input, and used to improve the models. The data can be obtained from
subsequent nodes are generated to explore the search space inference, assuming quality control [52], but also from a
– the domain of possible reasoning paths or solutions. dedicated synthetic data generation pipeline that mirrors
The purpose of this reasoning structure is to systematically that of the inference. To collect the data, one executes the
investigate potential solutions, progressively refining and respective RLM pipeline for a given input task and gathers
extending reasoning paths to converge on an optimal or the results 6 ; depending on how detailed the gathering
satisfactory answer. process is, the data collected can contain only outcome-
An individual point in the search space, represented as a based labels 7 , process-based labels 8 , or some other
node in the reasoning structure, corresponds to a reasoning variant such as trace-based labels 9 suggested in our
step 3 . A reasoning step is defined as a coherent and blueprint, that generalize process-based samples to samples
self-contained unit of thought – a sequence of tokens that that contain also information about operators applied dur-
advances the solution by either exploring a new branch of ing the task solution process. All this data becomes a part of
the problem or building upon existing progress. These steps the replay buffer 10 and is used in the unsupervised training
form the building blocks of the reasoning process.
3
The details of how the structure evolves are usually We use a naming scheme in which a model used to estimate the quality
of a whole reasoning path starting at a given node, is called the value
governed by the MCTS scheme, enhanced with policy model, while a model used to estimate the quality of a given reasoning
and value models (we also distinguish other reasoning step, is called the reward model.
6
Legend Medium-level overview (§3.1)
References to descrip�ons
Part of the Reasoning Models & Training 1 in text (inference pipeline) One can use external data
pipeline scheme Operators Data References to descrip�ons such as human-prepared External sources
1 in text (training pipelines) user chains of thoughts
provide data
Implicit Explicit RLM Reasoning
High-level overview (§3.1) RLM Scheme (§4.2.2)
executes
Inference
user executes ... Training
Models are used to run inference uses
New self-learning data is generated and used by training uses Data ... ...
Data
Inference Reasoning Scheme Training Genera�on
...
· Policy model uses uses
Inference uses reasoning scheme ... generates
· Value model
More Reasoning uses
Self-Learning ... ...
details Training uses
· Supervised u�li�es
fine-tuning data
Data ... · Replay buffer trains
Genera�on
More
details
Detailed view (§3.1.1, §3.1.2)
user External sources
1 provide data
Implicit 1 Explicit RLM
13 RLM
Reasoning Scheme Reasoning Reasoning Reasoning
Scheme = Structure + Strategy
Inference executes
Input Sort the numbers "3,2,4,5,6,12,5,6"
Numbers
are blue
(can also be used for
Split into two sets 2
data genera�on 5 )
uses uses v=3 v=9 3 v=2 v=6
Look up Sor�ng "3,4" & "5,2" Supervised
Quicksort is simple Instance of
Reasoning fine-tuning
Pick Structure
Pivot v=1 v=2 v=8 v=3 v=5 data
Split into
2
5 Quicksort two sets "3,2,4,5" & "7,12,5,6"
sorts nubers
Data v=3 v=2
Genera�on executes "3,2,4" & "5,2,6"
"3,2,4,5" & "6,12,5,6"
8
Output "2,3,4,5,5,6,6,12"
uses uses Training
Data collec�on Data
6
Reasoning 7 Samples for Outcome-based Supervision
is included
u�li�es Sort the numbers "3,2,4,5,6,12,5,6" "2,3,4,5,5,6,6,12"
into
8 Samples for Process-based Supervision
Tools
Sort the numbers "3,2,4,5,6,12,5,6" Look up Quicksort is included Unsupervised
Databases, RAG Split into two sets "3,2,4" & "5,2,6" "2,3,4,5,5,6,6,12"
into fine-tuning
data
Web access 9 Samples for Trace-based Supervision
is included
(replay
Sort the numbers "3,2,4,5,6,12,5,6" Generate into 10
Agents buffer)
Numbers are blue Evaluate Backtrack Generate
Coding on-the-fly
Look up Quicksort Select "2,...,12"
...
7 Operators
(§4.3) Generate Refine Evaluate Backtrack Select Prune
Fig. 4: Overview of a general RLM design and core concepts. We provide a high-level overview (the top-left part), a more detailed medium-level overview (the
top-right part), and a very detailed diagram showing the inference and training pipelines (the bottom part). A detailed specification of the inference pipeline can be
found in Appendix C.1 and in Algorithm 1. Details on the pipelines for different training phases and paradigms can be found in Appendices C.2 and C.3 as well as in
Algorithms 2–7. The data generation pipeline is detailed in Appendix D.
7
scheme 11 or it can also be used to train 12 a model that 4.1 Overview & Main Components
would become an Implicit RLM 13 .
The blueprint specifies a toolbox of components that can be
A detailed specification of the pipelines for different
used to build an arbitrary RLM. We identify several classes
training phases and paradigms can be found in Appen-
of such components. First, an RLM includes a reasoning
dices C.2 and C.3 as well as in Algorithms 2–7. The data
scheme, which specifies a reasoning structure (e.g., a tree)
generation pipeline is detailed in Appendix D.
together with a reasoning strategy (e.g., MCTS) of how
this structure evolves in order to solve a given input task.
3.2 Encompassing Diverse RLM Architectures Second, there is a set of operators (e.g., Refine) that can
The above-described design is applicable to many RLM be applied to the reasoning structure (as specified by the
designs. However, there are numerous other variants of reasoning strategy) in order to evolve it and make progress
architectures, some of which do not fully conform to this towards solving the input task. Operators are specified
framework. In this section, we discuss these variants, high- based on what they do (i.e., what effect they have on the
lighting how our blueprint accommodates such variations. reasoning structure). How this effect is achieved, depends on
In some RLM designs [169], a single node in the MCTS how a given operator is implemented. Here, many operators
tree could represent an entire reasoning structure, such as rely on neural models (e.g., Policy Model), which – together
a complete chain of reasoning steps. In this case, the ac- with their training paradigms – form the third class of
tion space involves transitioning between different reason- the blueprint components. Finally, we also distinguish a
ing structures rather than individual steps. This approach set of pipelines, i.e., detailed specifications of operations that
changes the nature of the search, as the focus shifts from orchestrate the interaction between the reasoning scheme
iteratively constructing a single reasoning path to evaluating and the operators in order to achieve a specific objective,
and refining entire structures within the search space. Our such as training, inference, or data generation. Hence, an
blueprint accommodates this with the concept of nesting, RLM can be defined as a composition of a reasoning scheme, a
where a node in the reasoning structure can contain another set of operators and associated models, and a set of pipelines.
reasoning structure.
Other architectures introduce even more novel
paradigms. For instance, Journey Learning [113] adds 4.2 Reasoning Scheme
an additional layer of complexity by incorporating a
A reasoning scheme is the part of the blueprint that specifies
transformation step that “rewires” the search or reasoning
the details of the reasoning steps progressing toward the
structure. This transformation consolidates multiple paths
solution, how they are interconnected to form coherent
in the tree, synthesizing them into a new form that is used
chains, trees, or more complex reasoning structures, and
as input for subsequent reasoning iterations.
how these structures evolve in the course of solving the
Despite these variations, our blueprint is sufficiently
input task.
general to encompass all these cases and beyond, as we
illustrate more formally in the following. This generality
ensures that the blueprint is not only applicable to existing 4.2.1 Reasoning Step
designs but also provides a foundation for future innova- A reasoning step is a fundamental unit of the reasoning
tions in RLM development. structure – a sequence of tokens that advances the RLM
towards the solution. Reasoning steps can vary in length,
3.3 Integration with Broader LLM Agent Ecosystems ranging from a single token to entire segments of text. The
The integration of RLMs into broader LLM agent ecosys- variability in their granularity depends on the user design
tems would enable these models to interact dynamically choice. In existing schemes, a reasoning step is typically
with external tools, databases, and resources during exe- conceptualized as a “coherent and self-contained unit of
cution. This interaction can occur within the inference or thought”. For instance, in mathematical proofs, this may
data generation pipeline, leveraging value or policy models correspond to an individual logical argument or deduction.
to extend the reasoning process through access to retrieval- The flexibility in defining reasoning steps allows mod-
augmented generation (RAG), web queries, and specialized els to adapt to different problem domains, balancing fine-
tools. For example, during a reasoning task, the value or the grained and coarse-grained reasoning. Coarse steps, such
reward model could query a database to verify intermediate as logical arguments (or even complete reasoning path-
steps, ensuring factual correctness or retrieving additional ways [169]), simplify preparation and adoption of training
context to refine its reasoning. Similarly, these models could data, enhance interpretability, and – as we discuss in Sec-
utilize computational tools for mathematical or symbolic tion 8 – reduce computational overhead. On the other hand,
computations, thereby expanding the scope and accuracy single-token steps enable the utilization of concepts like
of their reasoning. token entropy [96] to incorporate the model’s uncertainty,
as well as the integration of advanced decoding schemes
(e.g., speculative decoding [77] or contrastive decoding [80])
4 B LUEPRINT FOR R EASONING LM S explicitly into the RLM design. Yet, while making the rea-
We now introduce our RLM blueprint that can be used to soning steps more fine-grained allows for a more detailed
develop novel reasoning models and to provide ground for exploration of solution paths, this increased flexibility re-
analysis, evaluation, and comparison of such designs. We sults in greater computational demands, particularly when
overview the blueprint in Figure 5. combined with search algorithms such as MCTS.
8
1 Reasoning Scheme (§4.2) A toolbox of paradigms for modeling and evolving the reasoning structure
Coarse-grained (e.g., unit of thought) Chain Tree Example: TS-LLM, Graph Example: Nes�ng A node can
(not a par�cularly Tree of Thoughts Graph of Thoughts contain another
good reasoning step) Sort the numbers "3,2,4,5,6,12,5,6"
Input task Input task Input task Input task Input task structure
(input task statement)
Numbers
are blue
(a reasonably good
reasoning step)
Split into two sets
Look up
Quicksort Sor�ng
is simple
Toolbox of
reasoning schemes Toolbox of pipelines 4 Pipelines Inference: §3.1.1, Appendix 3.1
Toolbox of operators
RLM Toolbox of models
Training: §3.1.2, Appendix 3.2 - 3.4
Data genera�on: Appendix D
More details
Supervised fine-tuning Reasoning Policy Op�miza�on on training in
Appendix C
+ =
Example:
Journey
Learning 3.3 Training Data Scope (§4.4.2)
What informa�on does a single training sample contain?
Example 21 Example 1
back- Input:
-track Order the following numbers in ascending order: "3,2,4,5,6,12,5,6"
Input Output
Output:
Example: Backtrack "2,3,4,5,6,12"
Step selected for further scheme in Tree
expansion (e.g., due to a high
score from a Value Model) of Thoughts Value
Fig. 5: A blueprint for reasoning LMs. It consists of four main toolboxes: the reasoning scheme (the top part), operators (the bottom-left part), and models (the
bottom-right part); pipelines are mentioned in the center and detailed in Appendix C.1 and in Algorithm 1 (the inference pipeline), Appendix C.2, Appendix C.3, and
in Algorithms 2–7 (the training pipelines), and in Appendix D (the data generation pipeline).
9
organization could be particularly useful for multi-step reasoning step. For example, it could address ambiguities,
tasks where high-level decisions guide low-level computa- correct errors, and optimize inefficiencies, resulting in a
tions, such as meta-reasoning frameworks [169]. One could more robust version of the step [94]. It could also integrate
harness any other higher-order structures, such as hyper- suggestions from self-critique [122] (evaluates steps to
graphs, motifs, and others [10], [11], [14], [17]. identify weaknesses and suggest targeted improvements),
summarization [178] (condenses key elements into concise
4.2.3 Reasoning Strategy representations to streamline the reasoning structure), or
The reasoning strategy governs how the reasoning structure rephrasing [42] (reformulates steps to improve clarity and
evolves, specifying the process by which new reasoning coherence while preserving their logical integrity).
steps are added and integrated. Example strategies include: • Aggregate This operator combines multiple reasoning
• MCTS [73] A popular approach that balances exploration steps, paths, or structures into the next individual step.
and exploitation by simulating multiple reasoning paths This enables consolidating information or improving co-
and selecting the most promising one based on a scoring herence. It is used in Ensemble Methods [18] or in Graph
function. of Thoughts [9].
• Beam Search [131] A breadth-limited search that keeps • Prune This operator removes nodes or reasoning steps
a fixed number of top-ranked continuations at each step. from the structure that are deemed suboptimal or irrel-
While commonly used for decoding token sequences, evant based on evaluation metrics. It enables optimizing
beam search can also apply to reasoning steps. the reasoning structure in order to, e.g., reduce token costs.
• Ensemble Methods These methods involve aggregating • Restructure The Restructure operator applies arbitrary
multiple independent reasoning strategies, such as com- transformations to the reasoning structure, enabling flex-
bining chains and trees to enhance robustness and accu- ible reorganization of its components. A notable example
racy. One example is Best-of-N [45], [150] – a strategy is the conversion of a reasoning tree into a linear chain by
where multiple independent reasoning paths are gener- rearranging its branches into a sequential series of steps,
ated, and the most effective solution is selected based on as done in Journey Learning [113]. This restructuring
predefined criteria, e.g., accuracy or completeness. An- facilitates the integration of insights from diverse branches
other example is tree ensemble (Forest) [18] where, instead into a cohesive flow, “flattening” it and making it easier
of a single reasoning tree, a reasoning “forest” consists of for the model to process and utilize information within a
multiple disconnected trees, which may eventually con- single, unified context.
verge at a shared solution node. This approach supports Discussion on Diversity In structure operators, there is
diverse reasoning pathways that parallelize exploration. a notion of how diverse the outcomes of the operator are.
Reasoning Strategy vs. Decoding Strategy. It is crucial to For example, when generating k new reasoning steps, one
distinguish reasoning strategies from token-level decoding may want to make the contents of these steps as different to
strategies. While decoding strategies, such as greedy search one another as possible. While different mechanisms to steer
and nucleus sampling [60], generate the internal token se- diversity exist, a typical approach is the use of the policy
quences within a reasoning step, reasoning strategies focus model temperature. We additionally propose to consider
on the higher-level process of integrating and expanding the Diverse Beam Search [144] which promotes diversity by
reasoning steps within the reasoning structure. maintaining multiple diverse candidate sequences during
10
decoding. In MCTS, there is also a distinction between ex- for more efficient assessments. Other methods such as
ploitation (expanding the structure by applying generation embedding-based verification could also potentially be har-
operators within an already established tree branch) and ex- nessed [15].
ploration (generating new branches). Here, one impacts di- Another form of evaluation employs a value estimator,
versity by manipulating the exploitation-exploration trade- which judges a given reasoning step based on its expected
off, as determined by the Upper Confidence Bound for Trees contribution to a correct final outcome. This method evalu-
(UCT) formula [73] or its variants. ates both the correctness of the step and its alignment with
the overall solution goal. Such evaluations can be performed
4.3.2 Traversal Operators through simulations, as in the original MCTS algorithm, or
Traversal operators define how the reasoning process navi- more efficiently using a learned value model [129].
gates through the existing reasoning structure. These opera- A critical aspect of evaluation is the selection of appro-
tors play a crucial role in shaping the flow of reasoning by priate metrics. For instance, in value estimation, an ideal
determining which paths to pursue. metric considers both the correctness of a reasoning step and
the extent of progress it represents toward the final solution,
• Select The Select operator determines which reasoning
ensuring a balanced assessment of its contribution.
steps to pick for further exploration, evaluation, or refine-
ment within the reasoning process. It evaluates existing 4.3.5 Discussion: Test-Time Compute
elements based on predefined criteria, such as heuris-
One of the recent trends in next-generation LLMs [95], [145]
tic scores, likelihood estimates, performance metrics or
is to shift from merely increasing model sizes to enhancing
search strategies like PUCT [117] or UCT [73], selecting the
computational strategies during inference, a concept known
most promising candidates to guide the next stages of
as the test-time compute (TTC). This approach allocates
reasoning. By balancing exploration (considering diverse
additional computational resources during a model’s ex-
alternatives) and exploitation (focusing on high-potential
ecution to improve performance, particularly in complex
paths), the selection operator optimizes resource allocation
reasoning tasks. This methodology mirrors human cognitive
and ensures efficient reasoning progression.
processes, where increased deliberation is applied to more
• Backtrack The Backtrack operator enables the model to
challenging problems.
explicitly return to a previous reasoning step and continue
Recent studies [131] indicate that optimizing test-time
along a different reasoning path. This operator supports
compute can be more effective than merely increasing
error correction, divergence handling, and hypothesis re-
model size. For instance, employing a compute-optimal
vision by abandoning unproductive directions in favor of
strategy—where computational resources are adaptively al-
alternative trajectories. The QwQ model output indicates
located based on the problem’s complexity—can enhance
that the reasoning structures used as training data in this
efficiency by over four times compared to traditional meth-
model harnessed Backtrack.
ods. Moreover, in scenarios where smaller base models
achieve moderate success rates, augmenting test-time com-
4.3.3 Update Operators
pute enables them to outperform models up to 14 times
The Update operator enhances specific parts of the rea- larger.
soning structure without altering the structure itself. A While test-time compute offers significant benefits, it
common example is the backpropagation phase in MCTS, also presents challenges, related to – among others – re-
where evaluation scores are propagated and updated along source allocation (determining the optimal amount of com-
existing reasoning steps to inform future decisions. Another putational resources for each inference task requires sophis-
form of update involves refining the content of individual ticated strategies to balance performance gains against com-
nodes or subsets of nodes, replacing their original versions putational costs), dynamic scaling (implementing adaptive
with improved iterations, such as the “enhance” thought compute strategies necessitates models capable of assessing
transformation in Graph of Thoughts [9]. problem difficulty in real-time and adjusting their computa-
tional efforts accordingly) [97], and hardware implications
4.3.4 Evaluate Operators (the shift towards increased test-time computation may
Evaluate operators take as input a segment of the reasoning influence hardware requirements, putting more pressure
structure and output a value without any modifications to on delivering specialized inference-focused hardware solu-
the structure. They are widely used with reasoning strate- tions).
gies, such as MCTS. Test-Time Compute in the Context of the Blueprint. Our
One important type of evaluation occurs when the rea- blueprint offers mechanisms to dynamically allocate com-
soning structure reaches a terminal state, allowing the full putational resources during inference to improve perfor-
reasoning sequence to be assessed against a known solu- mance, particularly for more complex problems. By lever-
tion—applicable to tasks with definitive answers, such as aging the modular structure of the blueprint, TTC can be ef-
mathematical problems. This terminality evaluation veri- fectively implemented through specific operators designed
fies whether the final step provides a correct and complete for reasoning tasks. We now provide several examples.
solution. • The Generate operator can be used to implement TTC by
One can also evaluate intermediate steps (i.e., non- dynamically increasing the number of next reasoning steps
terminal ones). This can involve estimating the reward generated for harder problems. For simpler tasks, the op-
associated with specific reasoning steps, using heuristics, erator may only generate a minimal set of continuations.
aggregated simulation outcomes, or a trained reward model However, for more complex problems, the operator can
11
be used to create a larger set of potential reasoning steps, RL-based methods such as Proximal Policy Optimization
thereby expanding the search space. (PPO) [125], Direct Preference Optimization (DPO) [115],
• The Refine operator provides another avenue for imple- and reasoning-specific variants like Reasoning Policy Op-
menting TTC by enhancing a given reasoning step multi- timization (RPO) [111]. Several training paradigms also
ple times for harder problems. In this approach, the oper- incorporate self-learning, where the model iteratively im-
ator iteratively improves the quality of a reasoning step, proves by generating and evaluating its own reasoning
addressing ambiguities, rectifying errors, or improving sequences, thereby simulating competitive or cooperative
clarity. For simpler tasks, the operator might only refine reasoning scenarios.
a step once, while for more complex reasoning, it can per-
form multiple enhancement iterations to ensure the output
meets a higher standard of precision and robustness. 4.4.2 Training Data Scope
• The Traversal operators, such as Select, enable the explo-
ration of multiple reasoning paths at test time, offering The training data for RLMs can vary significantly in terms
another key mechanism for implementing TTC [171]. By of how much of the reasoning structure it captures. We
using Select on several next reasoning steps, the model now outline two established approaches, outcome-based
can dynamically expand its search tree for more challeng- supervision (OBS) and process-based supervision (PBS).
ing problems, thereby increasing the diversity and depth More details regarding both OBS and PBS can be found in
of reasoning paths under consideration. For example, in Appendix B.1.
a complex task, the model might select multiple high- In outcome-based supervision (also known as a sparse
probability steps and explore their corresponding contin- training signal) [35], [143] each training sample consists
uations in parallel. This approach facilitates broader ex- solely of the input and the corresponding output. For exam-
ploration of the reasoning space, ensuring that promising ple, in mathematical problem-solving, a sample may include
paths are not prematurely discarded. the task statement and the final solution, labeled as correct
• To efficiently manage the expanded set of possibilities, the or incorrect. This approach is straightforward to implement,
blueprint allows integration with the Aggregate operator. and the required data is relatively easy to collect. However,
This operator evaluates the generated reasoning paths it can limit the model’s reasoning accuracy, as it provides
and selects the most promising ones based on prede- minimal insight into the intermediate steps that led to the
fined criteria, such as the likelihood of correctness or the solution [83].
quality of intermediate steps. This combination ensures
An alternative approach is process-based supervision
that while more computational resources are allocated
(also known as a dense training signal) [83], [147], where
for challenging tasks, only the most relevant paths are
a training sample reflects the entire reasoning structure. In
explored further, optimizing both accuracy and efficiency.
this case, the sample contains not only the input and final
output but also all intermediate reasoning steps, annotated
4.4 Models with labels indicating the quality of each step. This richer
Models are used to implement various types of operators. training data allows the model to learn more granular rea-
Most common are the value model (implementing the value soning patterns, improving its ability to generate accurate
evaluation operator) and the policy model (implementing and interpretable solutions by understanding the reasoning
the generate operator). process in detail. However, such data is much more chal-
Models are further categorized and discussed in detail in lenging to generate or gather [83].
Appendix B; we discuss the variants of the value model (Q OBS vs. PBS By varying the training data scope,
Value model, V Value model), we compare Process Reward developers can strike a balance between ease of data col-
and Outcome Reward models, and we formally identify lection and the depth of reasoning insights provided to the
a new variant of models, the Outcome-Driven Process model, with dense supervision generally offering improved
Reward Model. performance at the cost of increased data complexity. We
detail these, and additional aspects of ORMs and PRMs in
4.4.1 Training Paradigm Pipelines for different training phases and paradigms can be
Each model must be trained according to a specified found in Appendix B, Appendix C.2, Appendix C.3, and in
paradigm, which outlines the methodology for optimizing Algorithms 2–7.
its performance. This paradigm defines key training compo- Trace-based supervision (TBS) is a potential way to
nents such as the loss function, data generation and labeling extend PBS by incorporating detailed information about
procedures, and other critical training details. the sequence of applied operators, including traversal op-
A wide range of training schemes has been developed erators, within the reasoning structure. By capturing the
for models used in RLMs, with early foundational work full trace of how reasoning steps are generated, refined, or
stemming from advancements related to AlphaZero. These revisited, TBS would provide richer supervision that teaches
schemes have since evolved to support the complex require- the model to internalize not just the reasoning steps but also
ments of reasoning tasks within LLMs. Common training the process of navigating and manipulating the reasoning
paradigms include supervised fine-tuning (SFT), where structure itself. This approach could enable the training of
models are trained on reasoning sequences labeled with more powerful Implicit RLMs by guiding them to replicate
q-values; rejection sampling [22], [134], which involves the reasoning dynamics of explicit structures, improving
filtering generated outputs based on quality criteria; and their ability to reason flexibly and efficiently.
12
4.5 Pipelines reasoning chains from it and combining them together into
A pipeline is a detailed specification of operations that an individual long chain. This way, the scheme attempts to
orchestrates the details of the interaction between the rea- harness insights from different tree branches. By maintain-
soning scheme and the operators and models to achieve a ing a chain-based structure, Journey Learning preserves the
specific objective. Typically, an RLM would incorporate a simplicity of linear reasoning while embedding the capacity
single pipeline for inference and a separate pipeline for for self-correction and exploration of multiple hypotheses.
training each model used in an RLM. Moreover, there could Additionally, Journey Learning introduces a pipeline for
also be pipelines for synthetic data generation used for the internalization of such long reasoning chains into its
training models. One can also distinguish a pipeline that weights. This enables the final model to generate such long
trains an Implicit RLM using the provided reasoning traces reasoning chains, possibly containing different reasoning
from the Explicit RLM. branches, directly from its weights, making it an implicit
The details of pipelines depend on arbitrary design RLM.
choices. In Section 3, we provided a general description
of how these pipelines work. In Appendix C, we present 5.2 Implicit RLMs
detailed algorithmic specifications of our pipelines, along Qwens’s QwQ [140] embodies a fully implicit reasoning
with insights into the reasoning behind these design choices. model, characterized by an implicit reasoning structure that
Specifically, the inference pipeline can be found in Ap- is generated autoregressively directly by the model weights.
pendix C.1 and in Algorithm 1. Pipelines for different train- The reasoning strategy in QwQ – as indicated by the
ing phases and paradigms can be found in Appendix C.2, model output – harnesses next-step generation, backtrack-
Appendix C.3, and in Algorithms 2–7. The data generation ing, summarization, and critique generation to derive the
pipeline is detailed in Appendix D. final solution. At each step, the model implicitly generates a
new node within the chain by employing one of these four
implicit generate operators, presumably implemented using
5 E XPRESSING E XISTING S CHEMES
special tokens.
We now showcase the expressivity of our blueprint, by
illustrating how it can be used to model a broad scope of
5.3 Structured Prompting Schemes
existing RLMs and other related works. We summarize the
outcomes of the analysis in Table 1. We start with typical and Finally, we also illustrate that advanced structured prompt-
most prevalent Explicit RLM architectures based on MCTS ing schemes, such as CoT, ToT, and GoT, constitute a fully
and policy and/or value models, where a single reasoning explicit RLM structure without any implicit reasoning than
step is an individual logical argument (Section 5.1). We also what is originally presented in the used LLM, i.e., no models
discuss there schemes that generalize this typical design, nor training or data generation pipelines.
by harnessing nesting or Linearization Structure operators. CoT [152] utilizes an implicit reasoning structure con-
Finally, we study Implicit RLMs (Section 5.2) and various sisting of a chain of reasoning steps. The reasoning strategy
structured prompting schemes such as Cot or ToT (Sec- employed in CoT is oriented towards constructing a single
tion 5.3), showing that they also fit our blueprint. coherent chain of reasoning, culminating in a solitary solu-
tion, thus only needing the generation operator. CoT serves
as the foundational framework for a range of advanced rea-
5.1 Explicit RLMs soning strategies, including prompting methodologies such
We start with the most widespread variant of RLMs that fol- as Self-Consistency and Self-Refinement, among others.
lows the architecture outlined in Section 3.1. These reason- Self-Consistency (SC) [150] extends the CoT frame-
ing models such as TS-LLM [45], AlphaLLM [141], MCTS- work by introducing redundancy into the reasoning pro-
DPO [155], and others [23], [52], [145], [169], [170], [174] cess. It generates multiple reasoning chains and employs a
generally employ an explicit tree structure in which a node majority-voting mechanism to determine the most consis-
represents a distinct reasoning step. The reasoning strategy tent solution, which implements a Select operator from our
is based on the MCTS and focuses on iterative exploration, blueprint.
expansion and evaluation of nodes within the tree. By incor- ToT [161] adopts an explicit reasoning structure orga-
porating value mechanisms—such as prompt-based evalu- nized in a hierarchical, tree-based format. Within this frame-
ation or dedicated value models, the system identifies and work, each node corresponds to a distinct reasoning step,
prioritizes promising branches, facilitating more informed and branching facilitates exploration across multiple infer-
decision-making and refinement of the reasoning process. ential pathways (the Generate operator). Additionally, an
All MCTS based reasoning models implement at least a evaluation operator, implemented via a specialized prompt
next-step generation operator, an evaluation operator, and and the LLM itself, assesses branches of the tree.
the update operator for back-propagating the values. In ad- GoT [9] introduces a more intricate reasoning structure
dition, ReST-MCTS*, LLaMA-Berry, and Marco-o1 support a by employing an explicit graph-based representation. In this
refinement operator to further improve produced reasoning framework, nodes represent individual reasoning steps, and
steps. the graph architecture supports non-linear, interdependent
Journey Learning [113] exhibits two main differences to relationships between these steps. The reasoning strategy in
typical MCTS-based RLMs. First, it harnesses the Lineariza- GoT is orchestrated by an external controller, realized as a
tion Structure operator, in which the tree reasoning structure separate LLM, which guides the exploration, refinement and
is transformed into a chain, by extracting several selected aggregation of the graph’s nodes.
13
TABLE 1: Comparison of RLMs with respect to the provided taxonomy (Section 4 and Figure 5). “Reasoning”: Details of the reasoning approach, specifically
what is its Structure and its Strategy? “Reasoning Operator”: Does a given scheme support operators on the reasoning structure? If yes, which classes (and specific
functionalities) are supported Structure (“Gen.”: generate, “Ref.”: refine, “Agg.”: aggregate, “Pr.”: prune, “Res.”: restructure), Traversal (“Sel”: select, “BT”: backtrack),
Update (“Bp.”: backpropagate), and Evaluation of “Inter.”: intermediate steps and “Final.”: final steps? “Model“: Does a given scheme use models to implement
its operators and if so, which ones (“PM”: policy model, “VM”: value model)? “Pipeline”: Which pipelines are harnessed by a given scheme (“Inf.”: inference, Tr.”:
training, “DG”: data generation)? When describing representations, we use the following abbreviations: “E”: explicit, “I”: implicit. “F”: fine-grained. “C”: coarse-
grained. “ ”: full support (i.e., YES), “ ”: partially [supported], “é”: no support (i.e., NO).
the eois token, the framework enables the explicit identifi- • Resource Optimization The independent allocation of
cation of intermediate reasoning steps, allowing for greater computational resources to the value and policy models
interpretability and precise determination of whether the is inherently supported by the framework’s architecture,
reasoning process is complete or ongoing. This dual-token enhancing efficient resource utilization.
strategy enhances the LLM’s capability to decompose com- • Replication and Distribution The separation of value
plex problems into manageable substeps while ensuring the and policy models facilitates the application of distinct
model recognizes when a solution has been reached. replication and distribution strategies.
7.3.2 Training the Value Model Figure 6 illustrates the implementation of the framework
as a server architecture, demonstrating how these structural
The value model is designed to estimate the sum of the
enhancements contribute to improved scalability and effi-
expected discounted future rewards for a sequence of rea-
ciency. Building on these architectural enhancements, we
soning steps and a newly proposed reasoning step, quanti-
employ the following strategies to further optimize the
fying the value of the node modeling this step. For a given
framework’s efficiency and scalability, focusing on inference
node in the MCTS tree, its value (referred to in the MCTS
and parallelization.
literature as state action value or q-value) is defined as the
In the framework, we incorporate the standard optimiza-
expected cumulative reward discounted by the number of
tions of batching, quantization, and KV caching. Inference
steps required to achieve it. Formally, the q-value Qπ (st , at )
calls are batched in the policy model, enabling simultaneous
for traversing the edge to node st+1 when taking action at
processing of multiple queries. To expedite the reasoning
from st at depth t in the MCTS tree is expressed as
process, the framework creates multiple child nodes in par-
h i allel during the node expansion phase. Specifically, N new
Qπ (st , at ) = E γ T −t r(sT , aT ) | st , at (1) nodes are generated concurrently in each expansion step, re-
N ducing computational overhead and enhancing overall sys-
1 X T −t (i) (i) tem performance. Further optimization of inference speed is
≈ γ r(sT , aT ) (2)
N i=1 achieved through KV caching and quantization. KV caching
where γ is the discount factor, T marks the last reasoning mechanisms mitigate redundant computations, while quan-
step aT that is added resulting the terminal state sT +1 tization techniques reduce the memory consumption of both
containing the complete reasoning structure and rewards policy and value models.
are modeled sparse. The terminal state sT +1 is defined as the
state in which no additional reasoning steps can be added. It
7.5 Blueprint for Efficient Scaling
typically represents the state containing the final solution to
the problem at hand. Accordingly, r(sT , aT ) is the terminal Our blueprint can be deployed to AI HPC systems and
reward. We chose to model rewards sparse, where only the clouds, as both systems provide the performance and re-
final reasoning step receives a non-zero reward, since for sources necessary to scale RLMs. Deployment on HPC
most reasoning tasks, only the final answer can be evaluated systems is straightforward: compute tasks are distributed
against the true solution. As a result, one can only obtain across statically allocated nodes, connected with a low-
a reward signal when the last step is reached. We can latency and high-bandwidth interconnect, and with train-
approximate the q-value by sampling N reasoning chains ing data being available on a high-performance parallel
until the terminal state, as in 2, and averaging the terminal filesystem. On the other hand, the cloud provides many
rewards discounted by the depth required. configurable services that offer different trade-offs between
The q-value model is trained using data from completed performance, cost, and reliability. There, it becomes the
MCTS searches. Initially, when the q-value model is unavail- user’s responsibility to choose the storage options and com-
able, N simulations (complete rollouts) are performed, and pute granularity that provides the best match for expected
the average discounted reward is used to initialize the q- performance and cost. The architecture of our blueprint fits
values for each node. More information can be found in the into the microservice architecture, with a clear separation of
Appendix D.2. compute tasks, data storage, and coordination. This archi-
tecture helps to ease the configuration process, as different
7.4 Enabling Scalability and Efficiency components of the system can be deployed, scaled, and
The current implementation is built to scale to multiple optimized independently. In particular, the separation of
GPUs on multiple nodes. To further enhance the scalabil- value and policy servers allows them to be scaled sepa-
ity and computational efficiency, several architectural and rately according to the complexity of reasoning steps that
operational improvements have been implemented. might require different resource allocations to handle task
One design decision involves the decoupling of the value generation and evaluation.
and policy models. The deployment of dedicated Value and First, we outline the major decisions users must make
Policy servers confers several advantages: before deploying the x1 scaling blueprint:
• Scalability The decoupling of Value and Policy servers • Deployment Training and inference tasks are typically
from the MCTS instance facilitates scalability and the allocated to virtual machines and containers, with the
execution of multiple parallel MCTS instances. latter typically deployed as managed services with an
• Batch Processing The policy server incorporates batching orchestrator such as Kubernetes. There, x1 can benefit
capabilities, allowing the concurrent processing of multi- from modern frameworks like Ray [105] that hide the
ple queries, thereby enhancing throughput. complexity of managing a service in a Kubernetes cluster.
16
Fig. 6: An overview of the x1 framework is presented, highlighting its two-phase training process. In phase 1, the models are initialized, while in phase 2, the models
are iteratively refined by alternating between constructing a sufficient number of MCTS trees and training the models on data derived from these trees.
• Data Storage In the cloud, object storage provides auto- communication protocols [36], [70].
matic bandwidth scalability that allows scale computa- • GPU Management Cloud rental of GPU devices is partic-
tions operating on the same data. To overcome latency and ularly expensive, and procuring a sufficient number of de-
power constraints, data can also be placed in in-memory vices can be challenging, specifically when constrained to
caches like Redis and hybrid solutions that combine disks a single cloud region. Given the large compute and mem-
with flash memory [172]. ory requirements of base models, space-sharing might
• Communication Requirements of the x1 blueprint differ not be feasible. On the other hand, time-sharing of GPU
from classical microservices, that rely on high-level ab- devices between different x1 services could be a viable
stractions like RPC and REST interfaces. RLM must uti- alternative, but it is currently constrained by large memory
lize high-performance network fabrics offered by modern allocations and the cost of swapping model checkpoints
clouds, such as InfiniBand on Azure and Elastic Fab- between CPU and GPU memory. To increase resource
ric Adapter (FBA) on AWS, both capable of achieving utilization, new techniques for efficient GPU checkpoint
throughput of 400 Gb/s [39]. These are also available and restore are needed [47].
to training processes distributed across many GPUs, e.g., • Parameter-Efficient Resource Sharing Resource-sharing
through specializations of the NVIDIA collectives library can be further enhanced by utilizing a shared base model
NCCL. architecture for the policy and value models, while dy-
• Parallelism We apply parallelism at multiple blueprint namically swapping task-specific parameter layers - such
levels, including the classic data, model, and pipeline as Low-Rank Adaptation [62], prefix tuning [81], or other
parallelism. These can scaled horizontally across a larger adapter layers - on the GPU during inference. These
number of virtual machines and containers. On the other modular strategies keep the base model loaded in device
hand, reasoning steps can benefit from elastic scaling, like memory and replace only the lightweight task-specific
in distributed MCTS and Beam Search, where each path layers, eliminating redundant loading and reducing both
can be explored in parallel. There, containers can be allo- latency and memory usage. An example of an RLM, which
cated on the fly to support new paths and deallocated as uses a shared base model with separate additional linear
soon as the parallelism scale of the computation decreases. layers for policy and value model, is AlphaMath [23].
New developments in the machine learning infrastruc- • Cross-Region Deployment Cloud applications are often
ture can significantly impact RLM deployment strategies: deployed in a single region to avoid the performance and
cost of cross-region data access. However, workloads can
• Elastic Compute Computing tasks can be executed on be scheduled globally, suspended, and migrated across re-
ephemeral resources that trade the guaranteed lifetime gions to avoid hardware resource exhaustion and achieve
and reliability for lower costs, such as spot virtual ma- lower carbon emissions [33], [153].
chines [101]. Serverless functions provide elasticity scal-
ability with fine-grained pricing models [37], which can
be a good fit for dynamically generated reasoning steps. 7.6 Example Analysis: Token Probability Distributions
However, serverless functions are stateless and suffer As an illustrative example, we use the framework to directly
from cold starts, which requires optimization techniques leverage the token probability distribution, thereby facilitating
dedicated to LLMs [47]. Furthermore, restricted network the use of associated properties—such as entropy and vari-
communication in functions forces the adoption of new ance—for guiding subsequent reasoning decisions. By fo-
17
Fig. 7: Four examples of model output with highlighted tokens indicating uncertainty levels. The outputs have been color-coded to reflect the confidence levels of
the model’s token predictions. Tokens are highlighted in purple when the highest probability is below 0.8 (indicating lower certainty without significant contention),
in blue when the second-highest probability exceeds 0.1 (indicating contention, where another token is a close alternative), and in red when both conditions are
met (indicating high uncertainty). These examples illustrate varying levels of prediction confidence and contention in reasoning steps, emphasizing regions of high
ambiguity or competition between plausible continuations. This type of visual analysis is useful for identifying points in the reasoning process where the model
lacks confidence or is torn between alternatives, guiding refinements in reasoning strategies and model design. It also helps pinpoint critical areas where additional
supervision or context may improve model performance.
cusing on these probabilistic characteristics, the framework well-supported continuation, this confidence can stream-
can help identify when to expand a given reasoning step. line decision-making and reduce computational overhead.
Using token probability distributions can be used for navi- However, if the model’s confidence is misplaced—perhaps
gating the reasoning based on both coarse and fine steps. due to biases in the training data or a lack of con-
To support this analysis, the x1 implementation includes text—relying on a single dominant token may cause the
scripts that provide insights into token-level metrics, such reasoning process to follow a suboptimal path. In such
as entropy fluctuations and distribution patterns, to inform cases, it’s crucial to assess whether the high-probability
reasoning strategies. token genuinely represents the most logical next step or if
additional validation is needed.
7.6.1 Relevance of Token Probability Distribution • Skewed Distribution with Multiple High-Probability
The token probability distribution provides critical informa- Tokens. In some cases, the distribution may be skewed
tion about the likelihood of different next-step candidates in with a small set of tokens receiving much higher prob-
a reasoning process. By examining this distribution, we can abilities than others. This indicates that the model sees
gain insight into how certain tokens dominate or diversify several plausible continuations, each with a reasonable
the reasoning space, and in turn, guide more informed chance of being correct. While this is generally a posi-
decisions about which step to take next. tive sign—offering a diversity of credible options—it also
We now list a few scenarios where different token distri- complicates the decision-making process. The reasoning
butions offer insights into which reasoning decision is best strategy must weigh the trade-offs between these top can-
to take at a given step. didates, considering not only their individual probabilities
but also how each choice impacts the subsequent reason-
• Flat Token Distribution. A flat probability distribution
ing trajectory. This scenario highlights the need for effec-
occurs when all tokens have roughly equal probabilities. In tive evaluation metrics (like entropy or Gini coefficient) to
this scenario, there is significant uncertainty about which help select the step that contributes most to reaching the
step is the best to choose because no single token stands correct or desired outcome.
out as a clear candidate. This can make the reasoning
process more exploratory, as the model may need to By analyzing token probability distribution and identi-
consider multiple tokens equally and rely on additional fying the cases above and others, reasoning strategies can,
strategies—such as external heuristics or learned poli- for example, improve efficiency (identifying when a distri-
cies—to identify the most promising step. While this can bution is flat allows the reasoning algorithm to focus on
foster exploration, it may also lead to inefficiencies since diversification or introduce additional constraints to narrow
the model might need to evaluate many equally plausible down choices), enhance decision confidence (recognizing
paths before finding an optimal solution. Another decision when one token is dominant can help expedite decisions,
that could be taken in such a scenario, is to delay initiating provided the model’s confidence is well-founded), or foster
a reasoning step till the token distribution changes to be balanced exploration (detecting multiple high-probability
more skewed. tokens facilitates exploring various credible paths without
• Skewed Distribution with One Dominant Token. When
being overly committed to a single option).
one token has a much higher probability than others, the
distribution is highly skewed. This often signals that the 7.6.2 Analyzing Token Probability Distribution
model is confident about the next step in the reasoning To understand the form of a token probability distribution,
process. If the dominant token corresponds to a logical or we examine variance, entropy, VarEntropy, and the Gini
18
coefficient as key metrics that offer distinct perspectives on In Figures 8a and 8d, specific regions emerge where the
the distribution’s shape and characteristics. top two probabilities are very close, while the remaining
Variance provides a broad measure of uncertainty by probabilities are significantly smaller. Such regions likely
reflecting how spread out the probabilities are across the vo- indicate scenarios where forking the reasoning process (e.g.,
cabulary. When variance is low, the probabilities are nearly exploring multiple paths) could disproportionately benefit
uniform, indicating a flat distribution. However, variance future outcomes, as the competing high-probability tokens
alone does not capture the specific structure or shape of the suggest alternative plausible continuations. Conversely, in
distribution. For example, two distributions can have the instances where the first probability is notably high, with
same variance but differ in their overall form, such as one much lower second and remaining probabilities, the model
having multiple minor peaks versus another being nearly exhibits strong confidence in a single continuation. These
uniform with a single dominant token. To address this, we cases are conducive to more deterministic reasoning, as
consider further measures below. forking may be unnecessary.
Entropy has long been a standard measure of uncer- Additionally, regions with a relatively high sum of the re-
tainty and information content in a probability distribu- maining probabilities (close to the top two) highlight flatter
tion. Higher entropy corresponds to greater unpredictabil- distributions with high uncertainty. These scenarios signal
ity—requiring more information to describe the system’s a need for cautious reasoning, where clarification or addi-
state. For instance, if all tokens have nearly equal proba- tional contextual refinement may help reduce ambiguity. For
bilities, the entropy is high, reflecting a flat distribution. In instance, such uncertainty may suggest that the model has
contrast, low entropy occurs when a small number of tokens not yet committed to a specific path and could benefit from
dominate, resulting in a skewed distribution.
P The entropy revisiting earlier reasoning steps to address potential errors
of a distribution is given by H = − i pi log2 (pi ), where or misalignments.
pi is the probability of the i-th token. This metric provides Figure 9 further analyzes these results using metrics such
valuable insight into whether the distribution is diffuse and as variance, entropy, VarEntropy, and the Gini coefficient. In
exploratory or concentrated and decisive. Figure 9a, a zero-shot prompt demonstrates lower uncer-
VarEntropy extends this analysis by measuring the vari- tainty overall, suggesting that it yields more confident pre-
ability of entropy itself, thus offering a dynamic view of how dictions and potentially higher-quality outputs. However,
uncertainty changes. A high VarEntropy combined with low the presence of specific high-probability tokens (e.g., “472”)
entropy often indicates a sharp, focused distribution with a raises concerns about potential data leakage into the training
few dominant outcomes. Conversely, low VarEntropy and set or the tokenizer, which could bias the results. Another
high entropy typically reflect a flat, uniform distribution notable observation is the high uncertainty associated with
where
P no single token stands out. The VarEntropy is defined <thought>tokens, which appear challenging for the model
2
as i pi (| log(pi )| − |H|) . This metric captures the nu- to predict accurately. This highlights the complexity intro-
anced shifts in distribution shape, helping to pinpoint how duced by token granularity, where most words correspond
tightly probabilities cluster around certain tokens versus to single tokens, resulting in a roughly even distribution for
how broadly they spread. the next token across the vocabulary in some contexts.
The Gini Coefficient, traditionally used to measure in- The uncertainty metrics provide actionable insights for
equality, provides another lens on the form of the distribu- reasoning strategy design. For example, cases with high
tion. A perfectly equal distribution has a Gini coefficient of VarEntropy and low entropy indicate a distribution where a
0, signifying that all tokens have identical probabilities. A few outcomes dominate, making tree-based search strate-
Gini coefficient closer to 1 indicates high inequality, where a gies effective. These strategies prioritize exploring high-
few tokens hold most of the probability mass. By visualizing probability outcomes while avoiding unnecessary evalua-
the cumulative distribution of sorted probabilities, the Gini tions of less probable branches. In contrast, low VarEntropy
coefficient highlights how the probability is concentrated or and high entropy reflect a flat distribution where no clear
dispersed. outcome dominates. Such cases could benefit from clarifica-
Together, these metrics—variance, entropy, VarEntropy, tion mechanisms or intermediate step refinements to reduce
and Gini—enable a detailed examination of token prob- ambiguity before proceeding further.
ability distributions. By leveraging each metric’s unique Interestingly, the Gini coefficient often highlights critical
strengths, we can effectively characterize whether a distri- regions more effectively than other metrics. In vital reason-
bution is flat, skewed with a dominant token, or skewed ing areas, it captures the inequality in token probabilities,
across several highly probable tokens, ultimately guiding helping to identify tokens that significantly influence the
more informed decisions in reasoning and model develop- reasoning process. This contrasts with metrics like entropy
ment. and VarEntropy, which may also flag tokens related to
formatting or stylistic choices, providing less task-specific
7.6.3 Example Results utility.
Figure 7 and 8 illustrate example model outputs and their Overall, these visualizations and metrics emphasize the
respective token probability distributions. By analyzing the importance of analyzing token probability distributions to
highest probabilities, the second-highest probabilities, and design effective reasoning strategies. By leveraging the nu-
the sum of the remaining probabilities, we gain valuable anced patterns revealed by these metrics, models can better
insights into the underlying token distribution, which can adapt to uncertainty, balance exploration and exploitation,
subsequently be quantified through the uncertainty metrics and optimize decision-making during the reasoning pro-
discussed earlier. cess.
19
1.0
$\
ceil
l
sq{r\
t
{
20
}}
\
ceir
l
^
2
$,
newee
d
foll to
ow
o the
ope rder
rat of
s
PE(
MD
AS
):
Eva 1
lua .
te
squthe
4
47 .
2$
2
Rou .
nd
up
nea theo
usgi er
are
t
of
20
:
sq$r\
t
{
20
}
app \
int rest
ng
funeiline
ctiog
n
...
rox
roo
ion
c th
xt
e
f_te
n_o
egi
<|b
√ √
(a) To evaluate ⌈ 20⌉2 , we need to follow the order of operations (PEMDAS):1. Evaluate the square root of 20: 20 ≈ 4.472
2. Round up to the nearest integer using the ceiling function: ⌈4.472⌉ = 5
3. Square the result: 52 = 25Therefore, the final answer is 25 .
1.0 0.6
,
so
let
's
cal use
cul a
r
app geto
rox a t
ima n
ue
of
20
4
47 .
are
t
of
20
2
ug /
h>t
ougth
ht
Sin >
ce
dea 'ree
wintg
h
funeiline
ctiog
n
we,
sid d
sm ther
in llese
grteegert
n
equor
al
to
thtaer
squthe
...
val te
are
tho .<
roo
ato
c th
con oul
xt|
<
li
a
a
f_te
s h
n_o
egi
<|b
√
(b) <thought>First, we need to find the square root of 20, so let’s use a calculator to get an approximate value of 20 ≈ 4.472.<thought>
<thought>Since we’re dealing with the ceiling function, we should consider
√ the smallest integer greater than or equal to the square root of 20, which is the ceiling of
20. </thought> √
<thought>We can also consider the perfect squares that are closest to 20, such as 16 and 25, to estimate the value of 20 and then apply the ceiling
function.</thought>
1.0
20
app wh ,
rox ich
ima is
4
47 .
tho .<5
ug /
h>t
ou<gth
>Wht
are
t
of
2
,
so
we
roucan
nea theo
usbi er
funeiline
ctiog
n
l n
exp uate
res the
y
nd
up
nu holet
ng
wh ,
giv h
es
us
n
$\
ceil
l
sq{r\
t
{
20
}}
\
...
roo
res
tel
c th
sio
xt|
i
m
w
f_te
n_o
egi
<|b
(c) <thought>First, we need to find the square root of 20, which is approximately 4.472, so we can round up to the nearest whole number using the ceiling function,
√ which gives us 5.</thought>
<thought>We can evaluate the expression ⌈ 20⌉2 by first finding the square root of 20, then rounding up to the nearest whole number, and finally squaring the
result.</thought> √
<thought>Since ⌈x⌉ is the ceiling function, we can rewrite the expression as ⌈ 20⌉2 = ⌈4.472⌉2 = 52 = 25.</thought>
1.0 0.5
2nd Highest Probability
0.6
Highest Probability
0.4
Sum of the Rest
0.8
0.3 0.4
0.6
0.2
0.4 0.2
0.1
0.2 0.0 0.0
ce
>
ou<gth
ht
Firs>
t
,
newee
cal d
cul to
ate
squthe
tho .<0
ug /
h>t
ou<gth
>Wht
staan
fin byt
of
20
app wh ,
rox ch
ima is
4
47 .
are
t
of
g
squthe
are
t
2
a ,
t nd
rouhen
y
nd
up
nea theo
usbi er
nu holet
ng
funeiline
ctiog
tho .<n
ug /
h>t
ou<gth
A ht
appnoth >
roa er
ch
is
to
...
r
roo
roo
res
tel
c th
2
din
xt|
m
w
_te
_of
n
egi
<|b
Fig. 8: Probabilities of the first 64 tokens of example model outputs. We show the two highest probabilities as well as the sum of the other probabilities.
20
0 1.5 1.000
Gini Coefficient
Variance (1e-6) 4
VarEntropy
0 1.0
Entropy
0.999
0 0.5 2
0.998
0 0.0 0
luaTo
te
eva t|>
$\
ceil
l
sq{r\
t
{
20
}}
\
ceir
l
^
2
$,
newee
d
foll to
ow
ope ordeer
rat of
s
P(
MDE
AS
):
Eva 1
lua .
squthee
:
sq$r\
t
{
20
app }\
4
47 .
2$
2
Rou .
nd
up
nea thteo
usgi er
are
t
of
20
int rest
tg
fucneilinhe
ctiog
n
...
rox
roo
ion
th
n
x
e
f_te
n_o
egi
<|b
√ √
(a) To evaluate ⌈ 20⌉2 , we need to follow the order of operations (PEMDAS):1. Evaluate the square root of 20: 20 ≈ 4.472
2. Round up to the nearest integer using the ceiling function: ⌈4.472⌉ = 5
3. Square the result: 52 = 25Therefore, the final answer is 25 .
0 1.0000
4
Gini Coefficient
Variance (1e-6)
0 2
VarEntropy
Entropy
0.9995
0 1 2
0 0.9990
0 0
>
ou<gth
ht
Firs>
t
,
newee
d
finto
squthde
ima n
ue
of
20
4
4 .
roroe
t
of
20
so,
let
's
cal use
cul a
r
app gto
rox aet
tho .7<2
ug /
h>t
ou<gth
ht
Sin >
ce
dea 'ree
ling
th
fucneilinhe
ctiog
n
,
cosnhouwe
sid ld
sm ther
inalle e
grteegest
a r
n
equor
al
squtheo
thtaer
...
val te
are
ato
t
w
wit
xt|
a
f_te
n_o
egi
<|b
√
(b) <thought>First, we need to find the square root of 20, so let’s use a calculator to get an approximate value of 20 ≈ 4.472.<thought>
<thought>Since we’re dealing with the ceiling function, we should consider
√ the smallest integer greater than or equal to the square root of 20, which is the ceiling of
20. </thought> √
<thought>We can also consider the perfect squares that are closest to 20, such as 16 and 25, to estimate the value of 20 and then apply the ceiling
function.</thought>
0 4 1.0000
3
Gini Coefficient
Variance (1e-6)
0 3
VarEntropy
2 0.9995
Entropy
0 2
1 0.9990
0 1
0 0 0.9985
eva cWae
>
ou<gth
ht
Firs>
t
,
newee
d
finto
squthde
of
app wh 0,
rox ich
tel s
4
47 .
roroe
t
2
s,
weo
rouan
mb e
usi er
fun ilinhe
ctiog
n
es
nd
up
nea thteo
nuwhosl t
tg
wh ,
givich
us
tho .<5
ug /
h>t
ou<gth
> ht
exp uatne
res the
n
$\
ceil
l
sq{r\
t
{
20
}}
\
...
y
ima i
2
sio
xt|
re
a
e
f_te
l
c
n_o
egi
<|b
(c) <thought>First, we need to find the square root of 20, which is approximately 4.472, so we can round up to the nearest whole number using the ceiling function,
√ which gives us 5.</thought>
<thought>We can evaluate the expression ⌈ 20⌉2 by first finding the square root of 20, then rounding up to the nearest whole number, and finally squaring the
result.</thought> √
<thought>Since ⌈x⌉ is the ceiling function, we can rewrite the expression as ⌈ 20⌉2 = ⌈4.472⌉2 = 52 = 25.</thought>
0 1.0000
Gini Coefficient
Variance (1e-6)
0 2 4 0.9995
VarEntropy
Entropy
0 0.9990
1 2
0
0 0 0.9985
ce
>
ou<gth
ht
Firs>
t
,
newee
cal d
cul to
squthee
ou<gth
>Wht
staan
fin bryt
squthge
app wh 0,
rox ich
tel s
4
47 .
are
t
of
tho .2<0
ug /
h>t
roroe
t
of
2
a ,
t nd
rouhen
nea thteo
mb e
usi er
y
nd
up
nuwhosl t
tg
fucneilinhe
cti g
tho .o<n
ug /
h>t
ou<gth
ht
apApnoth >
roa er
to
ch
is
...
roo
ima i
at
din
n
xt|
re
a
_te
_of
n
egi
<|b
Fig. 9: Uncertainty metrics (variance, entropy, VarEntropy, and the Gini coefficient) plotted against the first 64 tokens of the output token sequence.
21
mathematical problem-solving by language models were open-ended problems balanced across six core subjects.
achieved by training on the training subset of this bench- FrontierMath [49] is an expert-level benchmark contain-
mark. ing exceptionally challenging mathematics problems cov-
GSM Symbolic [103] introduces a generator that can ering a wide array of modern mathematical domains. The
use 100 templated questions, which are derived from the dataset size remains undisclosed, but the problems have
questions of the GSM8K dataset. This approach emphasizes been carefully crafted and tested by expert mathematicians.
the limited generalization capabilities of current RLMs and Notably, current state-of-the-art models can solve less then
highlights the importance of templated benchmarks in eval- 2% of the problems, revealing a still significant gap between
uating LLMs’ performance in mathematical reasoning. AI capabilities and human expertise in the field of mathe-
MATH [59] benchmark contains questions ranging in matics.
difficulty from high school to competition-level mathemat- In general, it is recommended to utilize templated ver-
ics, containing 12,500 problems, split into 7,500 for training sions of these benchmarks where available, rather than
and 5,000 for testing. These problems are sourced from relying solely on question-answer (QA) pairs. Templated
various mathematics competitions such as the AMC 10, benchmarks minimize the likelihood of contamination from
AMC 12, and AIME (Level 5). prior exposure during model training, thus providing a
Functional MATH [133] builds upon the MATH dataset more accurate measure of performance [103], [133].
by introducing templated problem formats designed to as- Other related benchmarks include MATH-401 [164],
sess the functional understanding of mathematical concepts MultiArith [118], AddSub [61] CHAMP [98], MathQA [5],
by LLMs. However, the code and templates remain inacces- ARB [123], FIMO [85], Geometry3K [88], GeoQA [26],
sible to the public, limiting its broader adoption. UniGeo [24], miniF2F [175], LeanDojo [159], TheoremQA-
AIME [4], AMC [3], and GaoKao [82] feature mathemat- MATH [29], TRIGO [157], LISA [69], MathVista [87],
ical tasks ranging from Olympiad level to college entrance ChartQA [99], TABMWP [89], MultiHiertt [173], and
level difficulty. The AMC is generally easier, the GaoKao SCIBENCH [148].
offers a broader range of difficulty levels, while the AIME is
likely the most challenging. AIME consists of 30 problems, 9.2 Logical Reasoning
the AMC includes 40 problems and the GaoKao contains Logical reasoning emphasizes formal processes, from
around 300 questions. propositional and predicate logic to automated theorem
OlympiadBench [56] is a more advanced benchmark proving.
that spans Olympiad-level mathematics and physics prob- PrOntoQA [121] generates ontology graphs, similar to
lems, comprising 8,476 problems sourced from international causality graphs, which do not necessarily reflect natural
and Chinese Olympiad competitions, as well as the Chinese patterns. From these graphs, it constructs statements and
College Entrance Exam (GaoKao). poses questions that necessitate logical reasoning for resolu-
CollegeMATH [139] is designed for evaluating college- tion. Due to the abstract and artificial nature of some ontol-
level mathematics, with a dataset that contains 1,281 train- ogy graphs, models must focus more on step-by-step logical
ing problems and 2,818 test problems. These problems are reasoning rather than relying on commonsense inference to
sourced from textbooks, extracted with the help of LLMs. derive correct conclusions.
U-MATH [31] benchmark features 880 university-level BIG-Bench [132] is one of the most extensive bench-
test problems without images sourced from ongoing courses marks for reasoning tasks encompassing over 200 tasks,
across various institutions, currently available through the each potentially comprising numerous questions. It encom-
Gradarius platform. This benchmark presents unpublished, passes a broad range of domains and employs templated
23
question formats, enabling a systematic evaluation of rea- challenging for experts from unrelated domains. The dia-
soning capabilities across diverse contexts. mond subset contains 198 samples.
ARC Challenge [32] assesses the ability to understand MMLU (STEM) [58] incorporates questions across a
formal patterns, rules, and transformations within struc- spectrum of difficulty, ranging from general commonsense
tured, grid-based environments. Tasks focus on identifying reasoning to highly specialized domain knowledge.
logical structures such as conditional relationships and se- Other related benchmarks include Social IQa [120],
quences. For instance, deducing transformations between SWAG [165], HellaSWAG [166], CommonSenceQA [138],
grids based on abstract rules exemplifies the application of PIQA [19], PHYRE [7], OpenBookQA [102], CConS [74],
formal logical reasoning paradigms. WinoGrande [119], and FactCC [75].
Other benchmarks include ProofWriter [137], FO-
LIO [54], WANLI [84], CLUTRR [130], Adversar-
9.6 Reasoning Utilities
ial NLI [106], AbductionRules [162], and Adversarial
ARCT [107]. Benchmarking capabilities of RLMs related to reasoning
utilizies involve testing the capabilities of an RLM in how
it acts as an agent. This includes benchmarks such as
9.3 Coding GAIA [66], WebArena [177], Mind2Web [41], WebShop [160],
There also exist benchmarks related to how well a given ALFWorld [126], AgentBench [86], AgentGym [154], and
model can code. These include ODEX [151], SWE-bench [71], AgentBoard [21]. Another line of related benchmarks tests
DS-1000 [76], APPS [57], MBPP [6], and HumanEval [27]. the RAG capabilites [25], [44], [93], [156].
10.2 Explicit Reasoning Models finement processes, or hybrid search algorithms that adapt
The following works explore techniques that extend beyond dynamically to the task’s complexity. These strategies can
basic mechanisms applied during pre-training or inference. be tailored using the token probability distribution anal-
These methods involve additional computation to itera- ysis tools provided, leading to more effective generation
tively refine reasoning paths, often increasing computa- strategies that optimize reasoning steps through probabilis-
tional demands during training and/or inference. tic insights. The blueprint also provides a foundation for
Dong et al. [43] provide a taxonomy and survey of developing nested architectures where reasoning structures
inference-time self-improvement methods, including in- such as trees and graphs are embedded hierarchically. These
dependent, context-aware, and model-aided approaches. designs can address multi-layered reasoning tasks, expand-
Guan et al. [51] propose verifier engineering, a post-training ing the scope of RLM applications to domains requiring
paradigm for foundation models involving three stages: deep, structured reasoning processes.
Search, Verify, and Feedback, to enhance model outputs Scalability remains a key focus of this work. The
with scalable supervision signals. Zeng et al. [167] provide blueprint’s modular design supports future scalable cloud
a comprehensive roadmap for reproducing OpenAI’s o1 deployments that enable efficient distribution of compute-
reasoning model from a reinforcement learning perspective. intensive tasks across cloud infrastructures. These deploy-
Although the work thoroughly examines all core com- ments will not only enhance scalability but also optimize
ponents: policy initialization, reward design, search, and cost and resource utilization, making RLMs more accessible
learning, no implementation is provided. Various specific for real-world applications.
implementations of RLMs exist, we provide a summary in By exploring and integrating these ideas, this work aims
Table 1. There are also other works related to Explicit RLMs, to empower the next generation of reasoning language mod-
considering both coarse reasoning steps [149], [155] and fine els, democratize access to advanced reasoning capabilities,
reasoning steps [40], [149], [155]. and foster innovation across research and industry. The
Our blueprint provides a more foundational and uni- blueprint’s versatility, combined with the x1 platform, will
versally applicable framework for RLMs. We further sup- make it one of the factors in the progress in RLM research
plement the theoretical and algorithmic overview with a and applications.
modular and scalable implementation to enable practical
development and experimentation. ACKNOWLEDGEMENTS
We thank Nicolas Dickenmann for writing the initial MCTS
11 C ONCLUSION codebase. We thank Hussein Harake, Colin McMurtrie,
This work introduces a comprehensive blueprint for reason- Mark Klein, Angelo Mangili, and the whole CSCS team
ing language models (RLMs), providing a flexible and mod- granting access to the Ault, Piz Daint and Alps machines,
ular toolbox that demystifies the intricate design and oper- and for their excellent technical support. We thank Timo
ation of these advanced systems. By encompassing diverse Schneider for help with infrastructure at SPCL. This project
reasoning structures, operations, and training schemes, the received funding from the European Research Council
blueprint establishes a robust foundation for constructing, (Project PSAP, No. 101002047), and the European High-
analyzing, and extending RLMs tailored to various appli- Performance Computing Joint Undertaking (JU) under
cations. The accompanying x1 implementation enhances grant agreement No. 955513 (MAELSTROM). This project
this contribution, offering a modular, minimalist, and user- received funding from the European Union’s HE research
friendly platform for experimentation and rapid prototyp- and innovation programme under the grant agreement
ing of novel RLM architectures. No. 101070141 (Project GLACIATION). We gratefully ac-
Our blueprint and x1 pave the way for several exciting knowledge Polish high-performance computing infrastruc-
avenues of future research and development in reasoning ture PLGrid (HPC Center: ACK Cyfronet AGH) for provid-
AI. One example is Trace-Based Supervision (TBS), which ing computer facilities and support within computational
extends process-based supervision by incorporating labeled grant no. PLG/2024/017103.
traces of traversal through reasoning structures. TBS has the
potential to train more powerful implicit RLMs capable of
internalizing reasoning structures and improving general-
ization.
The work also explores new directions in value and
reward modeling, introducing a hierarchy of models and
formally identifying several recent designs as instances of a
new class of models, namely the Outcome-Driven Process
Reward Model. This model class bridges the gap between
outcome-based evaluation and process-based supervision
by dynamically connecting intermediate reasoning steps to
terminal outcomes, enabling more granular feedback during
training without the need.
Additionally, the blueprint’s extensive set of operators
can inspire the development of innovative reasoning strate-
gies, such as advanced tree-based searches, multi-step re-
25
j
Each ti is a token from the RLM’s vocabulary, and the total
A.1 Markov Decision Process number of tokens per reasoning step Mi can vary. One can
use a special token tMi = tend to indicate the end of the
Markov Decision Process (MDP) is defined as a 5-tuple
reasoning step. Typically, the initial query q is used as the
M = (S, A, p, r, γ), where S is the state space, A is the
first reasoning step z0 = q . In the study of RLMs, an action
action space with As ⊆ A denoting the set of actions which
a ∈ As usually represents appending a new reasoning step
can be taken in the state s, p represents the dynamics of
z (a) to thecurrent state s = (z0 , ..., zn ) resulting in a new
transitions between states, i.e., p : S × A × S → [0, 1] where
p(s, a, s′ ) is the probability of transitioning to the state s′ state s′ = z0 , ..., zn , z (a) . Since every action a is uniquely
when action a was selected in the state s, r : S × A × S → R associated with exactly one reasoning step z (a) for every
is the reward function, i.e., r(s, a, s′ ) represents the reward s = (z0 , ..., zn ) and s′ = (z0 , ..., zn , zn+1 ), we have
for arriving in the state s′ after selecting the action a in the (
state s, and γ ∈ [0, 1] is a discount factor. ′ 1 if zn+1 = z (a)
p(s, a, s ) =
0 if zn+1 ̸= z (a)
The definition of the reward function depends on the
A.1.1 Solving an MDP
specific task. A reward commonly seen in reasoning tasks
Before stating what it means formally to solve an MDP, we assigns non-zero reward only in the terminal states and
first need several definitions. hence only at the final reasoning step. This approach reflects
A trajectory τπ = (s0 , a0 , . . . , sT , aT , sT +1 ) is a sequence the fact that for most tasks, the only final answer can be
of interleaved states and actions, selected according to the evaluated against the ground-truth solution to the origi-
policy π (see below for the policy definition). Each trajectory nal query. We call such reward functions sparse to clearly
starts at an initial state s0 ∈ S and ends with sT +1 ∈ S distinguish it from other setting in which intermediate
which represents the terminal state where no further actions rewards can be observed by the algorithm in the non-
can be taken. terminal states. The discount factor γ determines how future
A policy π(s) is a function assigning a probability rewards influence the current decision-making process. A
distribution over the action space to a given state s; π : higher discount factor (γ → 1) places greater emphasis on
S → ∆(A) where ∆(A) is a set of probability distributions long-term reasoning success, allowing the model to generate
over action space A. The expression π(a | s) denotes the long reasoning sequences, while a lower discount factor
probability of selecting the action a in the state s according prioritizes immediate rewards, incentivizing faster progress
to the policy π . and shorter reasoning sequences.
State value function Vπ (st ) represents the expected In the RLM setting, a trajectory τπ =
cumulative future reward for a given state st under policy (s0 , a0 , . . . , sT , aT , sT +1 ) represents the progression of
π: " T # states st and actions at ending with a terminal state sT +1
X
Vπ (st ) = E k−t
γ r(sk , ak , sk+1 ) | st (3) in which no further reasoning steps can be added. The final
k=t reasoning step contains the RLM’s answer to the original
query.
where T is a predefined time-horizon. Note that, in order to
The policy π(a | s) in the context of RLMs defines
obtain the state sk+1 , an action ak is first derived by sam-
the probability of selecting an action a that corresponds to
pling from a distribution π(sk ). Once the action ak is chosen,
appending a reasoning step z (a) to the current reasoning
the environment dynamics p(sk+1 | sk , ak ) determine the
sequence represented by the state s. Since there exists a
probability distribution of the next state sk+1 .
bijective mapping f : A → Z between the action space A
Tthe goal of solving an MDP is to find a policy π ∗
and the reasoning step space Z , the probability distributions
which maximizes the value function as defined above for
can be equated using the change of variables. Formally:
all states s ∈ S , π ∗ = arg max Vπ (s)
π
π(a | s) = π(z | s), where z = f (a).
26
TABLE 2: Overview of mathematical notation used in the paper
Symbol Description
M = (S, A, p, r, γ) Markov Decision Process (MDP) definition.
s∈S A state in the state space, representing a sequence of reasoning steps.
a∈A An action in the action space, corresponding to selecting the next reasoning step.
As ⊆ A a set of actions available in state s.
p(s′ | s, a) The probability of transition to state s′ from state s taking action a in state s.
r(s) The reward received when arriving in state s.
γ ∈ [0, 1] Discount factor, determining the present value of future rewards.
πθ (a | s) Policy parameterized by θ, representing the probability of taking action a in state s.
Vπ (s) Value function under policy π , representing the expected return starting from state s.
Qπ (s, a) State-action value function under policy πθ , representing the expected return of taking action a in state s.
τπ A trajectory consisting of states and actions, (s0 , a0 , s1 , . . . , sT +1 ) following policy π .
Based on the definition of the reasoning step and apply- 1) Selection - a leaf-node in the current tree is selected for
ing the chain rule we can then rewrite the policy as: expanding its child (children).
Mt+1 2) Expansion - if the selected node does not correspond
π(tjt+1 | st , zt+1 j−1 to a terminal state, it is expanded by taking an action
Y
0
π(zt+1 | st ) = , . . . , zt+1 ),
j=0 (or multiple actions) in the underlying MDP and by
adding the resulting state (states) to the tree as children
In the RLM setting, the state value function V (st ) as-
of the current node. A trajectory unroll is performed
sesses the expected cumulative reward of a partial reasoning
for every added node to obtain a reward. “Unroll”
sequence st , estimating its overall potential to lead to a
refers to simulating a sequence of steps from a newly
successful solution. The state-action value function Q(st , at )
added node in the tree down to a terminal state. This
extends this by quantifying the expected cumulative reward
simulated trajectory represents a hypothetical path the
for taking a specific action at (e.g., appending a reasoning
system might take if it continued from the current node.
step zt+1 ) to the current state st and then following the
Once the simulation reaches a terminal state, a reward
policy π . It incorporates both the immediate reward for
value is calculated based on the outcome of that path.
appending the reasoning step and the anticipated future
3) Backpropagation - update the value estimates and the
rewards from completing the reasoning sequence. Together,
visit counts for the selected node and all its ancestors
these functions inform and guide the policy π to prioritize
based on the obtained reward.
actions that maximize the expected cumulative reward. By
leveraging V (st ) or Q(st , at ), the policy can be trained The MCTS algorithm finishes when the stop criterion
to select reasoning steps that progress toward correct and such as the the number of iterations, the predefined com-
complete solutions, transforming an LLM into a RLM. putational budget, or the convergence criterion is met.
for evaluating intermediate steps for several reasons. First, still oversimplify complex dependencies within reasoning
the training data and objective are inherently misaligned chains.
with step-wise evaluation, as they focus exclusively on
Outcome Process Outcome-Driven
final outcomes. Second, ORM evaluations tend to be overly Based Models Based Models Process Based Models
pessimistic for intermediate steps since a subsequent erro-
neous step can obscure the correctness of earlier steps. This
human/ is_correct?
observation aligns with Havrilla et al. [55], who noted that not
model
available
ORMs often underestimate the solvability of a problem from available
v v v v v=0
V-Value B.4 Evaluation Schemes
Model We also provide additional categorizations and details re-
garding overall evaluation.
transformed into rankings if needed, providing flexibility applicability is limited to well-defined domains, they pro-
across various applications. vide objective and verifiable feedback that complements
In addition to numerical evaluations, there are text- language models. By injecting precise knowledge into the
based evaluations, which are commonly used to provide evaluation process, external tools mitigate model-specific
detailed feedback and guidance for refining reasoning steps. limitations like hallucinations and offer actionable feedback
Examples include “LLM-as-a-Judge” [176] (which uses a for iterative refinement. This hybrid approach enhances
larger LLM to provide a pairwise comparison or a single reliability and ensures that the evaluation benefits from
graded answer with an explanation) and self-critique ap- both the flexibility of language models and the precision
proaches [122] that allow models to reflect on and evalu- of formal systems.
ate their own reasoning. These textual evaluations, often
including rationales, are particularly useful for structural
transformations rather than numerical guidance, enhancing A PPENDIX C
interpretability by offering context and detail. A LGORITHMIC D ESCRIPTIONS
C.1 Reasoning with Monte Carlo Tree Search
B.4.2 Evaluation of Reasoning Steps
C.1.1 Setup and Notation
Step-wise evaluations are vital for integrating reason-
ing into MCTS. Numerical evaluations-—whether relative We will now present the details of the training pipeline of x1.
or absolute-—provide straightforward metrics to compare
nodes and steer exploitation and exploration. Text-based MDP Design x1 assumes the MDP following the definition
evaluations, in contrast, are better suited for guiding struc- presented in Appendix A.1 with the γ values between
tural refinements rather than directly influencing search [0.95, 1] to avoid over-penalizing long reasoning sequences.
paths. In the RLM setup, the state space and action space of the
Given that reasoning steps are typically textual se- underlying MDP constitute a tree in which every state s
quences, language models are a natural fit for such evalua- other than the starting state s0 has exactly one action as
tion tasks. LLM-based approaches can involve external model leading to it. This allows us to simplify the notation by
approaches, where a dedicated value model is trained to omitting actions wherever it’s clear from the context that
predict scores, or internal model approaches, which leverage we are referring to only action leading to a given. For every
existing policy models. action a leading from the state s to the state s′ we will write:
External model approaches include value models that π(s′ | s) := π(as′ |s)
predict scalar reward signals (Reward models) [34], [83], r(s′ ) := r(s, a, s′ )
[143], reinforcement learning values like state-values (V- q(s′ ) := q(s, a).
value models) [128], state-action values (q-value models), or τ := (s0 , s1 , . . . , sT +1 )
pairwise models like the Bradley-Terry and PairRM frame- The final reasoning step in the terminal state contains
works. A more detailed comparison of reward models, v- the RLM’s answer to the original query. The final answer is
value, and q-value models can be found in Appendix B.3.2. compared to the ground truth solution, commonly referred
There exist a large range of internal model approaches to as the golden answer. This matches the common setup in
as substitutes for value models. They typically rely on many reasoning tasks and math problems, where no ground
methods like prompting the policy to output scores. Exam- truth and no reward source is available for the intermediate
ples include MCT Self-Refine (MCTSr) [168], querying for a reasoning steps.
binary feedback (e.g., “Is the answer correct? answer“yes” Consider a trajectory τ := (s0 , s1 , . . . , sT +1 ). We assign a
or “no””) [171] and evaluating the probability of the output, reward of r(sT +1 ) = 1 if the last reasoning step in the final
leveraging uncertainty metrics such as token entropy or state sT +1 contains the correct answer and r(sT +1 ) = −1
aggregated probabilities [174], and others [170]. otherwise. The state value function simplifies to
Heuristics may also serve as substitutes for evaluations h i
in resource-constrained scenarios. Vπ (st ) = Eπ γ T −t r(sT +1 ) ∈ [−1, 1] (4)
Simulating reasoning steps to terminal states for eval-
and the state action function can be rewritten as:
uation against golden answers is another option as done (
for example in MCTS, though often computationally pro- r(sT +1 ), if t = T + 1
hibitive. Qπ (st ) = ∈ [−1, 1] (5)
γVπ (st+1 ), otherwise
External tools provide an alternative path for eval-
uation, especially in domain-specific tasks. For program- hence both the value and the state-action value functions
ming, compilers can supervise tasks, as seen in Codex [27], are bounded between -1 and 1 for all states and state-action
self-debugging [30], and similar methods. Program-of- pairs.
Thought [28] and Program-aided-Language (PAL) [48] use
a formal language and Python interpreters to evaluate so- MCTS Design We define the MCTS tree as in Appendix A.2
lutions. In mathematical tasks, ensemble approaches like as T = (N, E), where N is a set of nodes, and E is the set of
MathPrompter [68] generate multiple algebraic expressions edges. We use the notation of a node-edge-node relationship
or Python functions to validate steps. These tool-based denoted by (s, a′ , s′ ) where s represents the origin node,
approaches excel at detecting errors due to their reliance a′ describes the action corresponding to an edge, and s′
on precise domain-specific rules, such as compilers for denotes the target node. This notation symbolically ties the
programming or interpreters for mathematics. While their action and the target state together, as the action uniquely
30
identifies the target state and is therefore indicative of it. Selection. The selection phase iteratively identifies the most
promising child node with a selection policy. We use the
The policy model We use a pretrained LM with following selection policy which is the node-based variant
parameters θ as a policy model and denote it πθ . The of the PUCT algorithm in AlphaZero [129] (which is defined
model autoregressively generates a sequence of tokens. We on edge-based values) without a prior for finding selecting
use a special token ’End of Intermediate Step’ (eois) to a child of s:
indicate the end of the reasoning step. We use a standard p
N (s) − 1
N (s) + c2
end-of-sequence (eos) token to indicate the end of the final arg max q(sc ) + · c1 + log
sc ∈C(s) 1 + N (sc ) c2
reasoning step concluding the reasoning trajectory.
where c1 and c2 are hyperparameters controlling the
The value model A parametric value model is exploration bias, and the other values can be taken from the
used to evaluate the quality of states. While MCTS node statistics.
traditionally approximates these values through extensive
simulations, such an approach is computationally Expansion. We append M nodes to the selected leaf, M
expensive and impractical in the RLM context. Inspired being a hyperparameter. One of the major challenges in
by AlphaZero [128], which replaces simulations with a applying RLMs is maintaining the diversity of reasoning
parameterized value model, we estimate state-action values paths. By adding M nodes, we increase the exploration of
(short q-value) for reasoning sequences using a value model alternative reasoning paths.
— effectively employing a process-based q-value model Backpropagation. The backpropagation step serves to prop-
Qφ (see Appendix B.3). The value model is instantiated as agate information from the terminal nodes back to their
a pretrained transformer-based LM, modified by adding ancestors. In our implementation, we update the running
three linear layers and a shifted, rescaled sigmoid activation estimates of the q-values using the following formula:
to align the output domain to the state action function
domain [−1, 1] (see Eq. 5). This setup proved more stable X
than alternatives, such as a tanh activation or a cropped q(s) ←(1 − α)q(s) + αγ ws (sc ) · q(sc ) ,
sc C(s)
linear layer. We will show in the following how such a
model can be trained and provide a description for the data where we look at the node-edge-node tuples (s, ac , sc ) and
generation process in Appendix D. During training, we sc ∈ C(s). The weights ws (sc ) for combining the children
assume access to a final answer verifier, which evaluates q-values are defined over the visit scores of the nodes as
the correctness of the model’s final answer and provides follows:
the true reward. N (sc )
ws (sc ) = P .
sc̃ ∈C(s) N (sc̃ )
C.1.2 MCTS Algorithm True Reward Propagation. We improve the quality of the
We now present the algorithmic steps of a Monte Carlo q-values by propagating the real final rewards back through
Tree Search variant similar to AlphaZero as implemented the tree when a terminal state sT +1 is reached. During
in the x1 reasoning framework. The MCTS search operates training, terminal nodes can be evaluated against a reference
in two distinct modes: training and inference. The core golden answer g ∗ using an external verifier. For actions
difference is that, during training, a final answer verifier leading to terminal states, the associated reward is equal
evaluates and scores the final reasoning steps, providing to the q-value see Eq. 5. Therefore, instead of using the pre-
a true reward signal that is backpropagated through the diction of the q-value model, we initialize q(sT +1 ) with the
MCTS tree. This reward serves as a reliable learning signal true reward r(sT +1 ) based on the evaluation of the external
for the value model Qφ . During inference, however, the verifier. The reward is then backpropagated via the q-values
verifier is unavailable, and decisions rely solely on the through the tree with our backpropagation operator. This
value model. adjustment anchors the q-value model predictions with real
reward signals and prevents the q-value model predictions
Notation. We chose to store all values in nodes instead of to diverge.
edges, which defines the following set of statistics saved for Best Path Selection. After N iterations, MCTS will have
each node s: formed a tree in which every path corresponds to one of the
• N (s) - the visit count of node s explored reasoning trajectories. The final reasoning step in a
• q(s) - the running estimate of the q-value of the transi- path with the highest terminal value estimate is returned as
tion leading to state s, the final solution.
• β(s) - the binary terminality function, returns 1 if the
node s is terminal 0 otherwise.
31
Algorithm 1 MCTS for Reasoning (Training mode in blue) C.2 Training Phase 1
Input: Policy model πθ , value model Qφ , question z0 ,
golden answer g ∗ , binary correctness verifier Γ, number of Overall Training Pipeline. To adequately employ the
MCTS iterations N , number of children expanded in every MCTS-based reasoning scheme introduced in the Ap-
selection phase M , exploration constants c1 , c2 , Backpropa- pendix C.1, the policy model must be fine-tuned to generate
gation weight α. responses in the format of semantically-relevant reasoning
Output: Search tree T = (N , E) containing the best path steps. The value model – a q-value model in our case – must
τ ∗. be trained to accurately estimate the values of the sequences
of reasoning steps.
1: s0 ← (z0 ) {Initialize root node}
We propose a two-phase training approach designed to
2: N (s0 ) = 0
let the policy effectively leverage the structured exploration
3: N ← {s0 } {Initialize node set}
and iterative refinement capabilities of the search process to
4: E ← ∅ {Initialize edge set}
generate optimal sequences of reasoning steps. A detailed
5: i ← 1
algorithmic description of the pipeline is in Figure 14.
6: while i ≤ N or β(s) ̸= 1 do
7: s ← s0 {Start from root node}
8: ————– Selection —————————————— Phase 1: Supervised Fine-Tuning. The first phase fo-
9: while s is not a leaf node do cuses on preparing the policy and value models to gen-
{Select child sc ∈ C(s) with selection score} erate and evaluate reasoning trajectories effectively. This
10: √ highest is achieved by supervised fine-tuning (SFT) training on a
N (s)−1
11: sc ← arg max q(sc ) + 1+N (sc ) c1 + log N (s)+c c2
2
dataset of example sequences of reasoning steps (where
sc ∈C(s)
12: s ← sc {Move to the selected child} intermediate reasoning steps are terminated by an ”End of
13: end while Intermediate Step” eois token). The objective is twofold: (1)
14: ————– Expansion —————————————– to fine-tune the policy model πθ to produce semantically
15: for j = 1 to M do coherent reasoning steps, and (2) to train the q-value model
16: zc ← (t1 , . . . tMzc ) ∼ πθ {Sample a new reasoning Qφ to accurately assign scalar scores to reasoning trajec-
step} tories, distinguishing between high-quality and suboptimal
17: sc ← s ⌢ zc {Append zc to the current state s} reasoning paths.
18: q(sc ) ← Qφ (s) {Predict with the Q-VM} This supervised fine-tuning phase ensures that the policy
19: N (sc ) ← 1 {Initialize visit count} can generate reasoning steps consistent with the structured
20: β(sc ) ← 0 {Initialize terminality function} format required for downstream MCTS-based exploration,
21: if sc terminal then while the q-value model provides reliable evaluations of
22: β(sc ) ← 1( {Mark as terminal} intermediate and terminal states. Together, these compo-
1, if Γ(sc , g ∗ ) = 1, nents form the foundation for the subsequent online re-
23: r(sc ) ← {Check for cor- inforcement learning in Phase 2, where the policy and q-
−1, if Γ(sc , g ∗ ) = 0.
value models are further refined through interaction with
rectness to determine the reward}
the reasoning framework.
24: q(sc ) ← r(sc ) {Overwrite by true reward}
25: end if
26: N ← N ∪ {sc } {Add the node to the tree} C.2.1 Datasets Generation and Preparation
27: E ← E ∪ {(s, sc )} {Add the edge to the tree}
28: end for Dataset for SFT of the Policy. Performing SFT of the policy
29: ————– Backpropagation ——————————– requires a dataset of high-quality
reasoning sequences
(i) (i)
30: while s ̸= s0 do denoted as DSFT = { xSFT , ySFT }. Each pair in the dataset
31: N (s) ← N (s) + 1 P {Update the visit count}
(i)
consists of a prompt xSFT composed of a sequence of
32: q(s) ← (1 − α)q(s) + αγ sc ∈C(s) ws (sc )q(sc ) (i) (i) (i)
reasoning steps (for example xSFT = (z0 , ..., zj )), and
33: {Update the value} (i) (i)
34: s ← sp {Move to the parent} a target completion ySFT = zj+1 which is the subsequent
35: end while reasoning step or final answer. Appendix D contains a
36: i←i+1 detailed account of the dataset creation and processing. It
37: end while covers how the special eois token is appended to reasoning
38: Best Path Selection: steps mark the end of a step during inference.
39: Select the best reasoning sequence s∗T .
40: Dataset for Q-Value Model Training. Similarly to SFT,
(i) training the q-value model requires a supervised dataset
41: return s∗T , all reasoning sequences {sj }j
of reasoning sequences and corresponding scores. We de-
(i) (i)
note this dataset DQVM-train = {(xQVM-train , yQVM-train )}, with
(i) (i) (i)
reasoning sequences xQVM-train = (z0 , ..., zt ) and target
(i)
q-value yQVM-train . Appendix D explains how this dataset
can be generated using an initial list of questions, a base
LLM for querying, and a verifier program to label reasoning
sequences as conducive to a correct final answer or not.
32
C.3 Training Phase 2: RL Tuning of Policy with MCTS Algorithm 4 Phase 2: RL of the Policy and Q-Value Model
Phase 2 involves generating reasoning reasoning sequences Input: Policy πθ , q-value model Qφ , dataset Dp = {p(i) },
from the policy with MCTS and the q-value model, MCTS hyperparameters ΞM CT S .
and fine-tuning the policy with an RL-based alignment Output: Trained πθ and updated Qφ .
algorithm to generate better completions. The q-value
model must also be continually updated in this training 1: for each training iteration do
loop to keep in-distribution with the policy’s outputs. 2: ————– Rollout ———————————
Sufficient Phase 1 pre-training of the policy and q-value 3: for each question p(i) ∈ Dp do
model is crucial to ensure stable training of the models in 4: {Generate MCTS tree with πθ and Qφ (Algorithm 1)}
Phase 2. The MCTS structure which provides a balanced 5: T (i) ← M CT S(p(i) , Qφ , πθ , ΞM CT S )
exploration-exploitation search combined with repeated 6: {Remove incomplete paths from the tree}
sampling of the policy ensures sufficient exploration during 7: T̃ (i) ← P rune(T (i) )
this online-RL phase. This final training phase returns the 8: {Extract nodes and values, store them in replay buffer}
finetuned policy and q-value model. (i) (i) (i)
9: R ← R ∪ {(sj , zj , q(sj )}sj ∈Ñ (i)
10: end for
11: ————– Training ———————————
C.3.1 Phase 2 Algorithm
12: for each epoch do
Phase 2 uses a set Dp = {p(i) } of prompt questions - these 13: Sample a batch B from replay buffer R.
questions may be isolated from the phase 1 dataset DSFT . 14: Update policy πθ (Algorithm 5).
The training process (Algorithm 4) involves a repetition of a 15: Update q-value model Qφ (Algorithm 7).
MCTS rollout phase followed by a training (reinforcement) 16: end for
phase. 17: end for
Further regularization can be imposed on the PPO train- Algorithm 5 Policy Update (PPO, DPO, or SFT)
ing procedure. To align the policy πθ with a reference Input: Batch D, policy πθ , reference policy πref , learning
policy πref (usually instantiated as πθ before phase 2) during rate η , clipping parameter ε, preference data Dpref for DPO.
training, the KL divergence KL(πθ ||πref ) , between the two Output: Updated policy πθ .
distributions can be added to the training loss. Additionally,
to maintain the diversity of policy generations (and explo- 1: ————– Train via PPO ———————————
ration during training), the entropy of the policy distribu- 2: Select state-action-value triplets from sequences in D
tion can be enhanced by subtracting it from the loss. The 3: for each (st , at , qt ) ∈ D do
entropy penalty is estimated over a batch D of state-action Compute the policy ratio: rθ =
πθ (at |st )
4: πθref (at |st ) .
pairs (s, a) where s denotes a reasoning sequence and a the
next reasoning step. The entropy of a single completion a is 5: Compute the advantages Â(st ) (Algorithm 6).
computed by summing the entropy of its individual tokens 6: Compute the PPO loss:
a1:|a| of a: LPPO = min(rθ Â(st ), clip(rθ , 1 − ε, 1 + ε)Â(st )).
7: end for
1 X X 8: Optional: add KL divergence or entropy regularization.
LH = − πθ (ai |[s, a1:i−1 ]) log πθ (a|[s, a1:i−1 ]).
|D| a ∈a
(s,a)∈D i
LPPO ← LPPO + λKL KL(πθ ||πref ) + λH LH .
Direct Preference Optimization (DPO). DPO [115]
aligns the policy to user preferences expressed as pairwise 9: Perform gradient update to refine πθ .
comparisons between reasoning sequences. Given pairs 10:
(s+ , s− ), where s+ is preferred over s− . This method may 11: ————– Train via DPO (pairwise preferences) ——
not require a process reward/value model. The loss involves 12: Select preference pairs of reasoning sequences in D
the sigmoid function which we denote as σ . 13: for each pair (s+ , s− ) ∈ Dpref do
Supervised Fine-Tuning (SFT). As a straightforward 14: Compute DPO objective:
alternative to RL, high-value reasoning sequences can be
πθ (s+ )
selected to perform SFT, i.e. train the policy to maximize the 1 X
LDPO = log σ β log .
likelihood of these reasoning steps. The high-value reason- |Dpref | + − πθ (s− )
(s ,s )
ing sequences may be selected as terminal nodes having
the highest q-value, or highest aggregated intermediate- 15: end for
step values. This approach is inspired by AlphaZero-like 16: Perform gradient update to refine πθ .
frameworks, focusing on iteratively refining the policy to 17:
generate high-quality reasoning trajectories without requir- 18: ————– Train via SFT (single target sequence) ——
ing explicit rewards. 19: Select high-value reasoning sequences s+ from D
20: for each reasoning sequence s+ do
C.3.3 Advantage Calculation (for PPO Policy Updates) 21: Perform SFT on s+
While standard advantage computation in PPO (e.g., via 22: end for
Generalized Advantage Estimation (GAE) [125]) is widely
applicable, we propose an alternative approach tailored to
our reasoning framework in Algorithm 6. Specifically, for Algorithm 6 Advantage Calculation in MCTS Framework
each state/node s, we leverage the q-value estimates q(s) Input: MCTS Tree T = (N, E), node statistics: rewards and
obtained during the MCTS process. They were updated q-values, q-value model Qφ , discount factor γ , and λ.
in the backpropagation phase to provide a more informed Output: Advantages {Â(st )}.
estimate of the q-values incorporating the estimates of the
children and potentially true reward signals from terminal 1: for each node si ∈ N do
paths in the tree. We expect these MCTS-derived values to 2: Compute state values: vsMCTS = γ1 q MCTS (si )
i+1
be more reliable as they incorporate the ground-truth termi-
3: Compute state values: vsMCTS = γ1 q MCTS (si−1 )
nal reward, propagated back through the tree, ensuring that i
a node’s value reflects both its immediate reward and the 4: Compute the advantage on the TD error: Â(si ) =
aggregated values of subsequent child states. r(si , ai ) + γvsMCTS
i+1
− vsMCTS
i
.
5: end for
This has the benefit of both improving the accuracy of D.2 Generating Data for Phase 1 Value Model Training
the value model, and keeping it ”in-distribution” with the The original MCTS framework relies on simulations to
new policy outputs during this online-RL training. evaluate a state. Given the state, n rollouts are performed till
a terminal state is reached. The terminal states usually can
Algorithm 7 Q-Value Model Update be evaluated (e.g., in math by comparing it with the golden
Input: Batch D, q-value model Qφ , learning rate η . answer). This enables the distribution of terminal rewards
Output: Updated Qφ . based on their success which are then aggregated to provide
1: Compute loss: a value estimate of the state. These Monte Carlo simulations
1 P MCTS ′ 2 serve as an estimate of a state’s ability to lead to a correct
Lq = |D| (s,a,s′ ) (Qφ (s, a) − q (s )) .
2: Perform gradient update on Lq . answer. The value estimated in this manner corresponds to
the expected cumulative future reward for a given state:
" T #
X
t−i
Vπθ (s) = Eτ ∼πθ γ r(st , at ) | si = s ,
A PPENDIX D t=i
DATA G ENERATION where T is the terminal step of the (sub-) reasoning chain
D.1 Generating Data for Phase 1 Policy Model Training τ = (si , ai , ri , si+1 , . . . , sT , aT , rT , sT +1 ).
Since rewards are sparse (i.e., r(st , at ) = 0 for all t < T ),
The objective of this training process is to introduce a
the value function simplifies to:
new ’End of Intermediate Step’ (EOIS) token that serves h i
to delimit individual reasoning steps while preserving the Vπθ (st ) = Eπθ γ T −t r(sT , aT ) | st .
original distribution of the model as much as possible. To
achieve this, the model is trained on data generated by itself This represents the expected terminal reward, which can
using greedy decoding. be empirically estimated using Monte Carlo (MC) estimates:
The training data are derived from eight chain-of- N
1 X T −t (i) (i)
thought (CoT) completions generated for 1,000 questions Vπθ (st ) ≈ γ r(sT , aT ) := V̂ (st ),
sampled from the training split of the MATH dataset [59]. N i=1
These completions are produced using the same model where N is the number of sampled reasoning chains, and
intended for subsequent training with greedy decoding. (i) (i) (i)
sT , aT , sT +1 denote the last transition of the simulation
During this generation process, the reasoning steps in the (i) (i) (i) (i) (i)
data are observed to be separated by two consecutive ’\n\n’. trajectory τ (i) = (st , at , st+1 , . . . , sT , aT , sT +1 ) for i ∈
This observation informs the method of delimitation used {1, . . . , N }.
to construct pairs of questions and their corresponding To avoid sample inefficiencies and high computational
sequences of reasoning steps. burdens, AlphaGo Zero [129] and AlphaZero [128]
introduce a value model to replace simulations by using
For each data point, consisting of a question prompt and
its predictions for a state. We follow this approach by
its associated target response comprising multiple reasoning
(i) (i) defining a process-based value model Vφ . Notably, we
steps (q (i) , [s1 , . . . , sn ]), additional tokens are introduced
train this model with simulation data (instead of true value
to explicitly mark the boundaries of the reasoning steps.
functions), thereby building a model that predicts state
Specifically, the ’End of Intermediate Step’ (EOIS) token is
(i) value function estimates V̂ . We denote this model as V̂φ ,
defined and inserted after each reasoning step sj , resulting parameterized by φ.
(i)∗
in a modified step sj . Additionally, the ’End of Sequence’
(i) Given that the input of a value model is a sequence of
(EOS) token is appended to the final reasoning step sn ,
(i)∗ (i)
yielding sn = [sn ; eos]. This augmentation ensures that reasoning steps - therefore a sequence of tokens, the natural
the model can consistently identify when a final solution value model architecture is to use an LLM on which one
has been reached during inference. adds linear layer(s) and a suitable ouput activation function.
For Llama models, it has been empirically observed that Typically, it is designed to output a scalar value V̂φ (st ) ∈
introducing an ’assistant’ token after each reasoning step C ⊆ R.
enhances the model’s effective utilization of the EOIS token. The core distinction between different modeling ap-
However, this behavior may not generalize to other base proaches to state value functions lies in how rewards are
models, necessitating careful consideration when applying modeled. Depending on whether a binary reward setting or
this approach. a continuous (bounded) one is used, the aggregation mech-
Accordingly, the target sequence for supervised fine- anism, model architecture, training loss, and interpretation
tuning (SFT) is constructed as: of the predictions vary. We provide an overview of both
scenarios and, although often omitted for simplicity, we
(i) (i) (i)
ySFT = [s1 , eois, assistant, s2 , . . . , s(i) consider both γ = 1 and γ ∈ (0, 1] for continuous rewards
n , eos].
in our analysis.
This approach yields a training dataset comprising pairs
of prompts and their corresponding target completions, D.2.1 Binary Rewards: Modeling the Likelihood of a Correct
formally represented as: Terminal State
For this approach the rewards are modeled binary, therefore
(i)
DSFT = {(q (i) , ySFT )}. r(sT , aT ) = +1 for correct solutions and r(sT , aT ) = 0 for
36
incorrect solutions. We will adopt a discount factor of γ = 1 will output values between 0 and 1. To accommodate the
which we will see aligns more with the interpretation this binary classification nature of this task, the model should
reward model provides and is widely adopted in literature. employ a sigmoid activation function in the output layer.
This approach corresponds to the value model proposed in The training objective is then to minimize the binary cross-
AlphaGo Zero [129]. entropy (CE) loss between the predicted probabilities and
D.2.1.1 State Value Estimation: The value function the empirical estimates derived from the simulations:
then further simplifies to:
L(φ) =
Vπθ (st ) = Eπθ [r(sT , aT ) | st ] = Pπθ (r(sT , aT ) = 1 | st ) N
1 Xh
(i)
(i)
i
This formulation represents the probability of reaching a − yi log V̂φ (st ) + (1 − yi ) log 1 − V̂φ (st )
N i=1
correct terminal state from a given state st . Empirically, this
probability is estimated using simulations as follows: where yi ∈ {0, 1} denotes the binary label indicating
#correct simulations whether the i-th simulation resulted in a correct terminal
Vπθ (st ) ≈ := V̂ (st ). state.
#simulations
D.2.1.2 Data Generation: To generate labels for es- Employing a binary reward structure offers several bene-
timating the state-value function during the training of a fits. First of all, simplicity since binary rewards simplify the
value model, we use MCTS with simulations till a terminal learning process, reducing the complexity associated with
node is reached and calculate the ratio between the num- continuous reward signals. Moreover, the clear distinction
ber of correct simulations to the number of simulations. between correct and incorrect states facilitates faster con-
There is one very important detail, for a trajectory τ = vergence during training making this approach effective.
(si , ai , ri , si+1 , . . . , sT +1 ) where sT +1 is a terminal state. In addition, binary classification is less susceptible to noise
By definition, the true state value function at sT +1 is zero. in reward signals, ensuring more stable value estimates.
However, in training the value model, we avoid instructing Furthermore, this approach aligns with the objectives of
it to output zero for terminal states. Instead, in a supervised reinforcement learning in achieving clear and unambiguous
learning setting, we can identify terminal states and directly rewards, thereby streamlining the optimization of the policy
compare the model’s predictions against the known correct πθ .
outcomes (referred to here as ”golden answers”). This com-
parison negates the need to rely solely on the value model to
D.2.2 Continuous and Bounded Rewards: Modeling the Ex-
estimate the value of terminal states or to determine the re-
pected Future Reward
ward associated with transitioning into these states. During
inference, while we can still recognize terminal states, we We model the rewards to be continuous and bounded by
cannot evaluate them by comparing the model’s output to a allowing values in [a, b]:
golden answer. Therefore, an alternative metric is necessary.
We train the value model to predict whether transitioning Vπθ (st ) ∈ [a, b]
to sT +1 leads to a correct terminal outcome. By learning the
A common design, is to set the borders to −1 and 1 such that
relationship between a node’s content and the correctness
a terminal reward is r(sT , aT ) = +1 for correct terminal
of the resulting terminal state, the model can estimate the
states and r(sT , aT ) = −1 for incorrect states. This ap-
likelihood that a terminal state leads to a correct answer.
proach models the expected future reward as a continuous
To approximate the terminal reward during inference, we
and bounded value, capturing the degree of correctness or
define:r(sT , aT , sT +1 ) ≈ 1[0.5,1] (V̂φ (sT +1 )) Here V̂φ (sT +1 )
quality of the terminal state. In contrast to the binary reward
represents the value predicted by the value model for the
structure, continuous and bounded rewards provide a more
terminal state sT +1 . If this predicted likelihood exceeds
nuanced representation of the outcomes in reasoning tasks.
a threshold (e.g., 0.5), we assign a terminal reward of 1;
Note, that without discounting this approach resembles the
otherwise, we assign a reward of 0. This approach allows
proposed value model of AlphaZero [128].
the value model to indirectly influence the terminal reward
D.2.2.1 Bounded rewards: By constraining rewards
by predicting the likelihood of a correct outcome. Conse-
within a predefined interval [a, b], we effectively create a
quently, during training, terminal rewards serve as labels
correctness scale where the extremities represent the defini-
for terminal states in the value model. It is important to note
tive outcomes of the reasoning process. Specifically, the
that V̂φ (sT +1 ) is not used in any other context but solely to
lower bound a corresponds to reaching an incorrect terminal
estimate the terminal reward.
state, while the upper bound b signifies a correct terminal
V̂φ (sT +1 ) ̸= V̂ (sT +1 ) state. This bounded framework mirrors the spectrum of
possible correctness, allowing the model to capture varying
This distinction clarifies that the predicted value for the degrees of solution quality between these extremes. Such a
terminal state V̂φ (sT +1 ) differs from the standard value scale facilitates a more nuanced evaluation of intermediate
function’s definition V̂ (sT +1 ) = 0. states, reflecting partial correctness or varying levels of
D.2.1.3 Model Training V̂φ : S → [0, 1]: When reasoning quality. Moreover, this approach ensures that the
trained with these labels we obtain a value model V̂φ , pa- reward signals remain interpretable and consistent, foster-
rameterized by φ, that represents the likelihood of a correct ing a clear distinction between successful and unsuccessful
terminal state emanating from state st . Therefore, the model outcomes.
37
D.2.2.2 State Value Estimation: With a discount fac- all moves contribute indirectly and trajectories are not
tor γ ∈ (0, 1], the value function is defined as: penalized for length, reasoning benefits from discouraging
h i unnecessary or redundant steps. The inclusion of the
Vπθ (st ) = E γ T −t r(sT , aT ) | st , discount factor γ ensures that rewards achieved sooner
have a greater impact on the value function, the model
where r(sT , aT ) = b for correct terminal states and incentivizes reaching correct solutions with fewer steps
r(sT , aT ) = a for incorrect ones. Empirically, this expec- which ultimately enhances efficiency and suppresses
tation is approximated by averaging the rewards of the redundancies. Moreover, this models the uncertainty decay
simulations: in the trajectories; the further into the future a reward lies,
N the more uncertain its prediction becomes. Discounting
1 X T −t (i) (i)
Vπθ (st ) ≈ γ r(sT , aT ) := V̂ (st ), naturally reduces the reliance on these uncertain long-
N i=1
term rewards, thereby stabilizing the learning process by
where N denotes the number of sampled reason- focusing on more predictable and immediate outcomes.
(i) (i) (i) However, the model’s performance becomes sensitive
ing chains, and (sT , aT , sT +1 ) represent the final
transition of the i-th simulation trajectory τ (i) = to the choice of γ , requiring careful tuning to balance
(i) (i) (i) (i) (i)
(st , at , st+1 , . . . , sT , aT , sT +1 ) for i ∈ {1, . . . , N }. If a the influence of immediate versus long-term rewards.
discount factor is applied γ ∈ (0, 1) then each terminal Balancing the discount factor is essential to ensure that the
reward is discounted proportional to the number of steps model effectively captures the importance of both progress
needed to reach the terminal state. This corresponds to the and the final correctness of the reasoning chain.
soft estimation proposed by Wang et al. [147]. We want to
note that this estimator typically underestimates V due to Employing a continuous and bounded reward structure
its proneness to false negatives [55], [163]. offers several benefits. Unlike binary rewards, continuous
rewards provide a finer distinction between varying degrees
D.2.2.3 Data Generation: Therefore, to generate la-
of correctness, allowing the model to capture subtle differ-
bels for state-value function estimate pairs to train a value
ences in terminal states. Continuous rewards can encode
model, we use MCTS with simulations and average the
more information about the quality of solutions, facilitating
outcomes of the simulations. Therefore, at each newly gen-
more informed decision-making during the search process.
erated node s we simulate till a terminal node is reached
Bounded rewards prevent extreme values, promoting nu-
and we record the depth - the number of steps needed
merical stability and consistent training dynamics. How-
starting from s (since T is not identical per trajectory). We
ever, this also shows that the choice of reward values and
then record the the terminal reward which in our case is
their scaling can significantly impact the learning process,
r(sT , aT ) = 1 for correct and r(sT , aT ) = −1 for incorrect
necessitating careful calibration to ensure effective training.
answers. Discounted by the depth we can average these
rewards nd obtain an estimation of the node value which
serves as a label for the initial value model training. D.3 State Action Value Function Modeling
D.2.2.4 Model Training V̂φ : S → [a, b]: The value The state-action value function, commonly denoted as
model V̂φ , parameterized by φ, is designed to predict the Qπθ (st , at ), represents the expected cumulative reward of
expected terminal reward from any given state st . To ac- taking action at in state st under policy πθ . Formally, it is
commodate the continuous and bounded nature of this task, defined in our framework as:
the model employs a scaled and shifted sigmoid activation
function in the output layer, ensuring that the predictions
remain within the range [a, b]. The training objective is to Qπθ (st , at )
T
" #
minimize the mean squared error (MSE) loss between the X
predicted values and the empirical estimates derived from = Eτ ∼πθ γ i−t r(si , ai ) | st , at
the simulations: i=t
T
" #
N X
i−(t+1)
1 X (i) (i) (i) 2
= r(st , at ) + γ Eτ ∼πθ γ r(si , ai ) | st , at
L(φ) = V̂φ (st ) − γ T −t r(sT , aT ) .
N i=1 i=t+1
= r(st , at ) + γ Est+1 [Vπθ (st+1 ) | st , at ]
We also experimented with a tanh activation output and det. P
a linear layer with clipping of the values. However, both = r(st , at ) + γVπθ (st+1 ),
methods proved to be unstable in training in contrast to the where T denotes the terminal step of the (sub-) reasoning
scaled and shifted sigmoid layer. A tanh and sigmoid layer chain τ = (st , at , rt , st+1 , . . . , sT , aT , rT , sT +1 ). In environ-
naturally bound the output but also push values towards ments characterized by sparse rewards, where r(st , at ) = 0
the extremes, enhancing the separation between high and for all t < T , the q-value simplifies to:
low value estimates. This characteristic can improve the
model’s ability to distinguish between highly correct and Qπθ (st , at ) = γVπθ (st+1 ).
highly incorrect states which is why we are particularly
interested in these activation functions. At terminal states, where the state value Vπθ (sT +1 ) = 0,
D.2.2.5 Discounting: Introducing a discount factor the q-value further reduces to:
γ aligns the value function with the incremental nature
of reasoning tasks. Unlike traditional games, where Qπθ (sT , aT ) = r(sT , aT ).
38
D.3.1 Process-Based Q-Value Modeling We introduced q-value models since they address a
A process-based q-value model utilizes the same architec- critical inconsistency of value models in terminal states.
ture as a process-based Value Model, typically leveraging Specifically, while value models assign a flat value of
a LLM enhanced with additional linear layers and an ap- zero to terminal states, q-value models provide a mean-
propriate output activation function. The output is a scalar ingful evaluation of the final action’s correctness through
value Q̂φ (st , at ) ∈ C ⊆ R. Specifically, the q-value model Qπθ (sT , aT ) = r(sT , aT ). This distinction is essential for
takes a state-action pair—comprising a sequence of past accurately assessing whether a terminal step leads to a
steps and the current action—and predicts the correspond- correct or incorrect response during inference.
ing q-value based on the aforementioned formulation.
D.3.1.1 Training Data Generation: To train the q-
value model, it is essential to compute the q-values for R EFERENCES
various state-action pairs. For t < T , q-values can be [1] A. Ahmadian, C. Cremer, M. Gallé, M. Fadaee, J. Kreutzer,
estimated using N Monte Carlo simulations as follows: O. Pietquin, A. Üstün, and S. Hooker. Back to Basics: Revisit-
ing REINFORCE-Style Optimization for Learning from Human
Feedback in LLMs. In L.-W. Ku, A. Martins, and V. Srikumar,
Qπθ (st , at ) = r(st , at ) + γVπθ (st+1 ) editors, Proceedings of the 62nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), ACL ’24,
= γVπθ (st+1 ) (since r(st , at ) = 0) pages 12248–12267, Bangkok, Thailand, Aug. 2024. Association
N for Computational Linguistics.
1 X T −(t+1) (i) (i) [2] J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin. Large
≈γ· γ r(sT , aT ) Language Models for Mathematical Reasoning: Progresses and
N i=1
Challenges. In N. Falk, S. Papi, and M. Zhang, editors, Proceedings
N of the 18th Conference of the European Chapter of the Association
1 X T −t (i) (i) for Computational Linguistics: Student Research Workshop, EACL
= γ r(sT , aT ) := Q̂(st , at ),
N i=1 ’24, pages 225–237, St. Julian’s, Malta, Mar. 2024. Association for
Computational Linguistics.
where N is the number of sampled reasoning chains, [3] AI-MO. Aime 2024. https://huggingface.co/datasets/AI-MO/
(i) (i) (i) (i) (i) aimo-validation-aime, July 2024. accessed 2025-01-19.
and τ (i) = (st , at , st+1 , . . . , sT , aT , sT +1 ) represents the
[4] AI-MO. Amc 2024. https://huggingface.co/datasets/AI-MO/ai
i-th simulation trajectory for i ∈ {1, . . . , N }. This estimation mo-validation-amc, July 2024. accessed 2025-01-19.
aligns with the state value estimation under the sparse [5] A. Amini, S. Gabriel, P. Lin, R. Koncel-Kedziorski, Y. Choi,
reward formulation: and H. Hajishirzi. MathQA: Towards Interpretable Math Word
Problem Solving with Operation-Based Formalisms, May 2019.
arXiv:1905.13319.
Q̂(st , at ) = V̂ (st ). [6] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Do-
han, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program Syn-
For t = T , the q-value is directly given by the immediate thesis with Large Language Models, Aug. 2021. arXiv:2108.07732.
reward: [7] A. Bakhtin, L. van der Maaten, J. Johnson, L. Gustafson, and
R. Girshick. PHYRE: A New Benchmark for Physical Reason-
ing. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Qπθ (sT , aT ) = r(sT , aT ) = Vπθ (sT +1 ) ̸= V̂ (sT ) = 0. Buc, E. Fox, and R. Garnett, editors, Proceedings of the Thirty-
third Annual Conference on Neural Information Processing Systems
D.3.1.2 Reward Modeling: For q-value models the (NeurIPS ’19), volume 32 of Advances in Neural Information Pro-
cessing Systems, pages 5082–5093, Vancouver, Canada, Dec. 2019.
same discussions about reward modeling apply here since Curran Associates.
the models are trained very similar. This is why omit it here. [8] T. Ben-Nun and T. Hoefler. Demystifying Parallel and Distributed
Deep Learning: An In-depth Concurrency Analysis. ACM Com-
D.3.2 The Difference between Value and Q-Value Models put. Surv., 52(4):65:1–65:43, Aug. 2019.
[9] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, L. Gianinazzi,
The difference of VMs and QVMs can be easily shown J. Gajda, T. Lehmann, M. Podstawski, H. Niewiadomski, P. Ny-
in how they are used in the evaluation processes of an czyk, and T. Hoefler. Graph of Thoughts: Solving Elaborate
MCTS algorithm. QVMs predict Q̂φ (st , at ), which evaluates Problems with Large Language Models. Proceedings of the AAAI
Conference on Artificial Intelligence, 38(16):17682–17690, Mar. 2024.
the action at taken in state st that deterministically transi- [10] M. Besta, A. C. Catarino, L. Gianinazzi, N. Blach, P. Nyczyk,
tions to st+1 . Thus, the value Q̂(st , at ) is used to evaluate H. Niewiadomski, and T. Hoefler. HOT: Higher-Order Dynamic
adding the node st+1 to the tree. On the other hand, for Graph Representation Learning with Efficient Transformers. In
S. Villar and B. Chamberlain, editors, Proceedings of the Second
VMs, adding a node st+1 to the tree is determined by Learning on Graphs Conference (LOG ’23), volume 231 of Proceed-
V̂ (st+1 ) = γ1 Q̂φ (st , at ), where γ is the discount factor. ings of Machine Learning Research, pages 15:1–15:20, Virtual Event,
This distinction is making the training processes differ- Nov. 2023. PMLR.
[11] M. Besta, R. Grob, C. Miglioli, N. Bernold, G. Kwaśniewski,
ent. Note that st ⌢ at = st+1 . For QVMs, the training G. Gjini, R. Kanakagiri, S. Ashkboos, L. Gianinazzi, N. Dryden,
tuples are ((st , at ), Q̂(st , at )) = (st+1 , Q̂(st , at )) due to the and T. Hoefler. Motif Prediction with Graph Neural Networks.
deterministic transition. For VMs, the corresponding train- In Proceedings of the 28th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, KDD ’22, pages 35–45, Washington
ing tuples are (st+1 , V̂ (st+1 )). Since we propose training DC, USA, Aug. 2022. Association for Computing Machinery.
VMs on terminal rewards for terminal states instead of [12] M. Besta and T. Hoefler. Parallel and Distributed Graph Neural
assigning a label of 0, VMs and QVMs become equivalent Networks: An In-Depth Concurrency Analysis. IEEE Transactions
under the following transformation for any t ∈ {0, . . . , T } on Pattern Analysis and Machine Intelligence, 46(5):2584–2606, May
2024.
for evaluating adding node st+1 : [13] M. Besta, A. Kubicek, R. Niggli, R. Gerstenberger, L. Weitzendorf,
1 M. Chi, P. Iff, J. Gajda, P. Nyczyk, J. Müller, et al. Multi-Head
V̂ (st+1 ) = Q̂φ (st , at ). RAG: Solving Multi-Aspect Problems with LLMs, Nov. 2024.
γ arXiv:2406.05085.
39
[14] M. Besta, F. Memedi, Z. Zhang, R. Gerstenberger, N. Blach, Dataset. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings
P. Nyczyk, M. Copik, G. Kwaśniewski, J. Müller, L. Gianinazzi, of the 2023 Conference on Empirical Methods in Natural Language
et al. Demystifying Chains, Trees, and Graphs of Thoughts, Apr. Processing, EMNLP ’23, pages 7889–7901, Singapore, Dec. 2023.
2024. arXiv:2401.14295. Association for Computational Linguistics.
[15] M. Besta, L. Paleari, A. Kubicek, P. Nyczyk, R. Gerstenberger, [30] X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching Large
P. Iff, T. Lehmann, H. Niewiadomski, and T. Hoefler. Check- Language Models to Self-Debug, Oct. 2023. arXiv:2304.05128.
Embed: Effective Verification of LLM Solutions to Open-Ended [31] K. Chernyshev, V. Polshkov, E. Artemova, A. Myasnikov,
Tasks, June 2024. arXiv:2406.02524. V. Stepanov, A. Miasnikov, and S. Tilga. U-MATH: A University-
[16] M. Besta, P. Renc, R. Gerstenberger, P. Sylos Labini, A. Ziogas, Level Benchmark for Evaluating Mathematical Skills in LLMs,
T. Chen, L. Gianinazzi, F. Scheidl, K. Szenes, A. Carigiet, P. Iff, Jan. 2025. arXiv:2412.03205.
G. Kwaśniewski, R. Kanakagiri, C. Ge, S. Jaeger, J. Was, F. Vella, [32] F. Chollet. On the Measure of Intelligence, Nov. 2019.
and T. Hoefler. High-Performance and Programmable Atten- arXiv:1911.01547.
tional Graph Neural Networks with Global Tensor Formulations. [33] A. Choudhury, Y. Wang, T. Pelkonen, K. Srinivasan, A. Jain,
In Proceedings of the International Conference for High Performance S. Lin, D. David, S. Soleimanifard, M. Chen, A. Yadav, R. Tijori-
Computing, Networking, Storage and Analysis, SC ’23, Denver, CO, wala, D. Samoylov, and C. Tang. MAST: Global Scheduling of
USA, Nov. 2023. Association for Computing Machinery. ML Training Across Geo-Distributed Datacenters at Hyperscale.
[17] M. Besta, Z. Vonarburg-Shmaria, Y. Schaffner, L. Schwarz, In Proceedings of the 18th USENIX Symposium on Operating Systems
G. Kwaśniewski, L. Gianinazzi, J. Beranek, K. Janda, T. Holen- Design and Implementation, OSDI ’24, pages 563–580, Santa Clara,
stein, S. Leisinger, P. Tatkowski, E. Ozdemir, A. Balla, M. Copik, CA, USA, July 2024. USENIX Association.
P. Lindenberger, M. Konieczny, O. Mutlu, and T. Hoefler. Graph- [34] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and
MineSuite: Enabling High-Performance and Programmable D. Amodei. Deep Reinforcement Learning from Human Pref-
Graph Mining Algorithms with Set Algebra. Proc. VLDB Endow., erences, Feb. 2023. arXiv:1706.03741.
14(11):1922–1935, July 2021.
[35] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser,
[18] Z. Bi, K. Han, C. Liu, Y. Tang, and Y. Wang. Forest-of-Thought: M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and
Scaling Test-Time Compute for Enhancing LLM Reasoning, Dec. J. Schulman. Training Verifiers to Solve Math Word Problems,
2024. arXiv:2412.09078. Nov. 2021. arXiv:2110.14168.
[19] Y. Bisk, R. Zellers, R. Le bras, J. Gao, and Y. Choi. PIQA: Reason-
[36] M. Copik, R. Böhringer, A. Calotoiu, and T. Hoefler. FMI: Fast and
ing about Physical Commonsense in Natural Language. Proceed-
Cheap Message Passing for Serverless Functions. In Proceedings
ings of the AAAI Conference on Artificial Intelligence, 34(05):7432–
of the 37th International Conference on Supercomputing, ICS ’23,
7439, Apr. 2020.
pages 373–385, Orlando, FL, USA, June 2023. Association for
[20] R. A. Bradley and M. E. Terry. Rank Analysis of Incomplete Computing Machinery.
Block Designs: I. The Method of Paired Comparisons. Biometrika,
[37] M. Copik, G. Kwaśniewski, M. Besta, M. Podstawski, and T. Hoe-
39(3/4):324–345, Dec. 1952.
fler. SeBS: A Serverless Benchmark Suite for Function-as-a-
[21] M. Chang, J. Zhang, Z. Zhu, C. Yang, Y. Yang, Y. Jin, Z. Lan,
Service Computing. In Proceedings of the 22nd International Mid-
L. Kong, and J. He. AgentBoard: An Analytical Evaluation
dleware Conference, Middleware ’21, pages 64–78, Virtual Event,
Board of Multi-turn LLM Agents. In Proceedings of the Thirty-
Dec. 2021. Association for Computing Machinery.
eighth Annual Conference on Neural Information Processing Systems
(NeurIPS ’24), volume 37 of Advances in Neural Information Process- [38] G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y. Fan, T. Yu,
ing Systems, Vancouver, Canada, Dec. 2024. Curran Associates. Q. Xu, W. Chen, et al. Process Reinforcement through Implicit
Rewards. https://curvy-check-498.notion.site/Process-Reinfor
[22] E. Charniak and M. Johnson. Coarse-to-Fine n-Best Parsing and
cement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2ea
MaxEnt Discriminative Reranking. In K. Knight, H. T. Ng, and
f896f, Jan. 2025.
K. Oflazer, editors, Proceedings of the 43rd Annual Meeting of the
Association for Computational Linguistics, ACL ’05, pages 173–180, [39] D. De Sensi, T. De Matteis, K. Taranov, S. Di Girolamo, T. Rahn,
Ann Arbor, MI, USA, June 2005. Association for Computational and T. Hoefler. Noise in the Clouds: Influence of Network
Linguistics. Performance Variability on Application Scalability. Proc. ACM
[23] G. Chen, M. Liao, C. Li, and K. Fan. AlphaMath Almost Zero: Meas. Anal. Comput. Syst., 6(3):49:1–49:27, Dec. 2022.
Process Supervision without Process. In Proceedings of the Thirty- [40] M. DeLorenzo, A. B. Chowdhury, V. Gohil, S. Thakur, R. Karri,
eighth Annual Conference on Neural Information Processing Systems S. Garg, and J. Rajendran. Make Every Move Count: LLM-
(NeurIPS ’24), volume 37 of Advances in Neural Information Process- based High-Quality RTL Code Generation Using MCTS, Feb.
ing Systems, Vancouver, Canada, Dec. 2024. Curran Associates. 2024. arXiv:2402.03289.
[24] J. Chen, T. Li, J. Qin, P. Lu, L. Lin, C. Chen, and X. Liang. [41] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun,
UniGeo: Unifying Geometry Logical Reasoning via Reformulat- and Y. Su. Mind2Web: Towards a Generalist Agent for the Web.
ing Mathematical Expression. In Y. Goldberg, Z. Kozareva, and In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt,
Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical and S. Levine, editors, Proceedings of the Thirty-seventh Annual
Methods in Natural Language Processing, EMNLP ’22, pages 3313– Conference on Neural Information Processing Systems (NeurIPS ’23),
3323, Abu Dhabi, United Arab Emirates, Dec. 2022. Association volume 36 of Advances in Neural Information Processing Systems,
for Computational Linguistics. pages 28091–28114, New Orleans, LA, USA, Dec. 2023. Curran
[25] J. Chen, H. Lin, X. Han, and L. Sun. Benchmarking Large Lan- Associates.
guage Models in Retrieval-Augmented Generation. Proceedings of [42] Y. Deng, W. Zhang, Z. Chen, and Q. Gu. Rephrase and Respond:
the AAAI Conference on Artificial Intelligence, 38(16):17754–17762, Let Large Language Models Ask Better Questions for Them-
Mar. 2024. selves, Apr. 2024. arXiv:2311.04205.
[26] J. Chen, J. Tang, J. Qin, X. Liang, L. Liu, E. Xing, and L. Lin. [43] X. Dong, M. Teleki, and J. Caverlee. A Survey on LLM Inference-
GeoQA: A Geometric Question Answering Benchmark Towards Time Self-Improvement, Dec. 2024. arXiv:2412.14352.
Multimodal Numerical Reasoning. In C. Zong, F. Xia, W. Li, and [44] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert. RAGAS:
R. Navigli, editors, Findings of the Association for Computational Automated Evaluation of Retrieval Augmented Generation, Sept.
Linguistics: ACL-IJCNLP 2021, pages 513–523, Virtual Event, Aug. 2023. arXiv:2309.15217.
2021. Association for Computational Linguistics. [45] X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and
[27] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Wang. AlphaZero-Like Tree-Search Can Guide Large Language
J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Model Decoding and Training, Feb. 2024. arXiv:2309.17179.
Evaluating Large Language Models Trained on Code, July 2021. [46] J. Frohberg and F. Binder. CRASS: A Novel Data Set and
arXiv:2107.03374. Benchmark to Test Counterfactual Reasoning of Large Language
[28] W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of Models. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri,
Thoughts Prompting: Disentangling Computation from Reason- T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani,
ing for Numerical Reasoning Tasks. Transactions on Machine H. Mazo, J. Odijk, and S. Piperidis, editors, Proceedings of the
Learning Research, Nov. 2023. Thirteenth Language Resources and Evaluation Conference, LREC
[29] W. Chen, M. Yin, M. Ku, P. Lu, Y. Wan, X. Ma, J. Xu, X. Wang, ’22, pages 2126–2140, Marseille, France, June 2022. European
and T. Xia. TheoremQA: A Theorem-driven Question Answering Language Resources Association.
40
[47] Y. Fu, L. Xue, Y. Huang, A.-O. Brabete, D. Ustiugov, Y. Patel, Processing, EMNLP ’14, pages 523–533, Doha, Qatar, Oct. 2014.
and L. Mai. ServerlessLLM: Low-Latency Serverless Inference Association for Computational Linguistics.
for Large Language Models. In Proceedings of the 18th USENIX [62] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
Symposium on Operating Systems Design and Implementation, OSDI and W. Chen. LoRA: Low-Rank Adaptation of Large Language
’24, pages 135–153, Santa Clara, CA, USA, July 2024. USENIX Models. In Proceedings of the Tenth International Conference on
Association. Learning Representations, ICLR ’22, Virtual Event, Apr. 2022.
[48] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, [63] J. Huang and K. C.-C. Chang. Towards Reasoning in Large
and G. Neubig. PAL: Program-Aided Language Models, Jan. Language Models: A Survey. In A. Rogers, J. Boyd-Graber, and
2023. arXiv:2211.10435. N. Okazaki, editors, Findings of the Association for Computational
[49] E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, E. Chen, A. Gun- Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada, July
ning, C. F. Olsson, J.-S. Denain, A. Ho, E. de Oliveira Santos, 2023. Association for Computational Linguistics.
O. Järviniemi, M. Barnett, R. Sandler, M. Vrzala, J. Sevilla, Q. Ren, [64] J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and
E. Pratt, L. Levine, G. Barkley, N. Stewart, B. Grechuk, T. Grechuk, D. Zhou. Large Language Models Cannot Self-Correct Reasoning
S. V. Enugandla, and M. Wildon. FrontierMath: A Benchmark for Yet. In Proceedings of the Twelfth International Conference on Learning
Evaluating Advanced Mathematical Reasoning in AI, Dec. 2024. Representations, ICLR ’24, Vienna, Austria, May 2024.
arXiv:2411.04872. [65] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and
[50] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- J. Han. Large Language Models Can Self-Improve, Oct. 2022.
Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The arXiv:2210.11610.
Llama 3 Herd of Models, Nov. 2024. arXiv:2407.21783. [66] S. Huang, W. Zhong, J. Lu, Q. Zhu, J. Gao, W. Liu, Y. Hou,
[51] X. Guan, Y. Liu, X. Lu, B. Cao, B. He, X. Han, L. Sun, J. Lou, X. Zeng, Y. Wang, L. Shang, X. Jiang, R. Xu, and Q. Liu. Planning,
B. Yu, Y. Lu, and H. Lin. Search, Verify and Feedback: Towards Creation, Usage: Benchmarking LLMs for Comprehensive Tool
Next Generation Post-Training Paradigm of Foundation Models Utilization in Real-World Complex Scenarios. In L.-W. Ku,
via Verifier Engineering, Nov. 2024. arXiv:2411.11504. A. Martins, and V. Srikumar, editors, Findings of the Association for
[52] X. Guan, L. L. Zhang, Y. Liu, N. Shang, Y. Sun, Y. Zhu, Computational Linguistics: ACL 2024, pages 4363–4400, Bangkok,
F. Yang, and M. Yang. rStar-Math: Small LLMs Can Master Thailand, Aug. 2024. Association for Computational Linguistics.
Math Reasoning with Self-Evolved Deep Thinking, Jan. 2025. [67] Y. Huang, M. Kleindessner, A. Munishkin, D. Varshney, P. Guo,
arXiv:2501.04519. and J. Wang. Benchmarking of Data-Driven Causality Discovery
Approaches in the Interactions of Arctic Sea Ice and Atmosphere.
[53] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang. REALM:
Frontiers in Big Data, 4(32):642182:1–642182:19, Aug. 2021.
Retrieval-Augmented Language Model Pre-Training, Feb. 2020.
arXiv:2002.08909. [68] S. Imani, L. Du, and H. Shrivastava. MathPrompter: Math-
ematical Reasoning using Large Language Models, Mar. 2023.
[54] S. Han, H. Schoelkopf, Y. Zhao, Z. Qi, M. Riddell, L. Benson, arXiv:2303.05398.
L. Sun, E. Zubova, Y. Qiao, M. Burtell, D. Peng, J. Fan, Y. Liu,
[69] A. Q. Jiang, W. Li, J. M. Han, and Y. Wu. LISA: Language models
B. Wong, M. Sailor, A. Ni, L. Nan, J. Kasai, T. Yu, R. Zhang,
of ISAbelle proofs. In Proceedings of the 6th Conference on Artificial
S. Joty, A. R. Fabbri, W. Kryscinski, X. V. Lin, C. Xiong, and
Intelligence and Theorem Proving, AITP ’21, Aussois, France, Sept.
D. Radev. FOLIO: Natural Language Reasoning with First-Order
2021.
Logic, Oct. 2024. arXiv:2209.00840.
[70] J. Jiang, S. Gan, Y. Liu, F. Wang, G. Alonso, A. Klimovic, A. Singla,
[55] A. Havrilla, S. C. Raparthy, C. Nalmpantis, J. Dwivedi-Yu, W. Wu, and C. Zhang. Towards Demystifying Serverless Machine
M. Zhuravinskyi, E. Hambro, and R. Raileanu. GLoRe: When, Learning Training. In Proceedings of the 2021 International Confer-
Where, and How to Improve LLM Reasoning via Global and ence on Management of Data, SIGMOD ’21, pages 857–871, Virtual
Local Refinements. In R. Salakhutdinov, Z. Kolter, K. Heller, Event, June 2021. Association for Computing Machinery.
A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,
[71] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R.
Proceedings of the 41st International Conference on Machine Learn-
Narasimhan. SWE-bench: Can Language Models Resolve Real-
ing (ICML ’24), volume 235 of Proceedings of Machine Learning
world Github Issues? In Proceedings of the Twelfth International
Research, pages 17719–17733, Vienna, Austria, July 2024. PMLR.
Conference on Learning Representations, ICLR ’24, Vienna, Austria,
[56] C. He, R. Luo, Y. Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, May 2024.
Y. Huang, Y. Zhang, J. Liu, L. Qi, Z. Liu, and M. Sun. Olympiad- [72] W. Knight. OpenAI Unveils New A.I. That Can ‘Reason’ Through
Bench: A Challenging Benchmark for Promoting AGI with Math and Science Problems. https://www.nytimes.com/2024/1
Olympiad-Level Bilingual Multimodal Scientific Problems. In 2/20/technology/openai-new-ai-math-science.html, Dec. 2024.
L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the accessed 2024-12-27.
62nd Annual Meeting of the Association for Computational Linguistics
[73] L. Kocsis and C. Szepesvári. Bandit Based Monte-Carlo Planning.
(Volume 1: Long Papers), ACL ’24, pages 3828–3850, Bangkok,
In J. Fürnkranz, T. Scheffer, and M. Spiliopoulou, editors, Pro-
Thailand, Aug. 2024. Association for Computational Linguistics.
ceedings of the European Conference on Machine Learning ECML ’06,
[57] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, volume 4212 of Lecture Notes in Computer Science (LNAI), pages
E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. 282–293, Berlin, Germany, Sept. 2006. Springer.
Measuring Coding Challenge Competence with APPS. In J. Van- [74] K. Kondo, S. Sugawara, and A. Aizawa. Probing Physical Reason-
schoren and S. Yeung, editors, Proceedings of the Thirty-fifth Neural ing with Counter-Commonsense Context. In A. Rogers, J. Boyd-
Information Processing Systems: Track on Datasets and Benchmarks, Graber, and N. Okazaki, editors, Proceedings of the 61st Annual
volume 1 of NeurIPS ’21, Virtual Event, Dec. 2021. Meeting of the Association for Computational Linguistics (Volume 2:
[58] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, Short Papers), ACL ’23, pages 603–612, Toronto, Canada, July 2023.
and J. Steinhardt. Measuring Massive Multitask Language Un- Association for Computational Linguistics.
derstanding. In Proceedings of the Ninth International Conference on [75] W. Kryscinski, B. McCann, C. Xiong, and R. Socher. Evaluating
Learning Representations, ICLR ’21, Virtual Event, May 2021. the Factual Consistency of Abstractive Text Summarization. In
[59] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of
D. Song, and J. Steinhardt. Measuring Mathematical Problem the 2020 Conference on Empirical Methods in Natural Language
Solving with the MATH Dataset. In J. Vanschoren and S. Yeung, Processing, EMNLP ’20, pages 9332–9346, Virtual Event, Nov.
editors, Proceedings of the Thirty-fifth Conference on Neural Informa- 2020. Association for Computational Linguistics.
tion Processing Systems: Track on Datasets and Benchmarks, NeurIPS [76] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-T.
’21, Virtual Event, Dec. 2021. Yih, D. Fried, S. Wang, and T. Yu. DS-1000: A Natural and Reliable
[60] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The Benchmark for Data Science Code Generation. In A. Krause,
Curious Case of Neural Text Degeneration. In Proceedings of the E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett,
Eighth International Conference on Learning Representations, ICLR editors, Proceedings of the 40th International Conference on Machine
’20, Virtual Event, Apr. 2020. Learning, volume 202 of Proceedings of Machine Learning Research,
[61] M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman. Learn- pages 18319–18345, Honolulu, HI, USA, July 2023. PMLR.
ing to Solve Arithmetic Word Problems with Verb Categorization. [77] Y. Leviathan, M. Kalman, and Y. Matias. Fast Inference from
In A. Moschitti, B. Pang, and W. Daelemans, editors, Proceedings Transformers via Speculative Decoding. In A. Krause, E. Brun-
of the 2014 Conference on Empirical Methods in Natural Language skill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,
41
Proceedings of the 40th International Conference on Machine Learning Reasoning in Language Models by Automated Process Supervi-
(ICML ’23), volume 202 of Proceedings of Machine Learning Re- sion, Dec. 2024. arXiv:2406.06592.
search, pages 19274–19286, Honolulu, HI, USA, July 2023. PMLR. [92] M. Luo, S. Kumbhar, M. shen, M. Parmar, N. Varshney, P. Baner-
[78] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, jee, S. Aditya, and C. Baral. Towards LogiGLUE: A Brief Survey
H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and and a Benchmark for Analyzing Logical Reasoning Capabilities
D. Kiela. Retrieval-Augmented Generation for Knowledge- of Language Models, Mar. 2024. arXiv:2310.00836.
Intensive NLP Tasks. In H. Larochelle, M. Ranzato, R. Had- [93] Y. Lyu, Z. Li, S. Niu, F. Xiong, B. Tang, W. Wang, H. Wu, H. Liu,
sell, M. Balcan, and H. Lin, editors, Proceedings of the Thirty- T. Xu, and E. Chen. CRUD-RAG: A Comprehensive Chinese
fourth Annual Conference on Neural Information Processing Systems Benchmark for Retrieval-Augmented Generation of Large Lan-
(NeurIPS ’20), volume 33 of Advances in Neural Information Process- guage Models, July 2024. arXiv:2401.17043.
ing Systems, pages 9459–9474, Virtual Event, Dec. 2020. Curran [94] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegr-
Associates. effe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. Ma-
[79] X. Li, G. Dong, J. Jin, Y. Zhang, Y. Zhou, Y. Zhu, P. Zhang, and jumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark.
Z. Dou. Search-o1: Agentic Search-Enhanced Large Reasoning Self-Refine: Iterative Refinement with Self-Feedback, May 2023.
Models, Jan. 2025. arXiv:2501.05366. arXiv:2303.17651.
[80] X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, [95] F. Mai, N. Cornille, and M.-F. Moens. Improving Language Mod-
L. Zettlemoyer, and M. Lewis. Contrastive Decoding: Open- eling by Increasing Test-time Planning Compute. In Proceedings of
ended Text Generation as Optimization. In A. Rogers, J. Boyd- the Eighth Widening NLP Workshop, WiNLP ’24, Miami, FL, USA,
Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Nov. 2024.
Meeting of the Association for Computational Linguistics (Volume 1: [96] A. Malinin and M. Gales. Uncertainty Estimation in Autoregres-
Long Papers), ACL ’23, pages 12286–12312, Toronto, Canada, July sive Structured Prediction. In Proceedings of the Ninth International
2023. Association for Computational Linguistics. Conference on Learning Representations, ICLR ’21, Virtual Event,
[81] X. L. Li and P. Liang. Prefix-Tuning: Optimizing Continuous May 2021.
Prompts for Generation. In C. Zong, F. Xia, W. Li, and R. Navigli, [97] R. Manvi, A. Singh, and S. Ermon. Adaptive Inference-Time
editors, Proceedings of the 59th Annual Meeting of the Association Compute: LLMs Can Predict If They Can Do Better, Even Mid-
for Computational Linguistics and the 11th International Joint Con- Generation, Oct. 2024. arXiv:2410.02725.
ference on Natural Language Processing (Volume 1: Long Papers), [98] Y. Mao, Y. Kim, and Y. Zhou. CHAMP: A Competition-level
ACL-IJCNLP ’21, pages 4582–4597, Virtual Event, Aug. 2021. Dataset for Fine-Grained Analyses of LLMs’ Mathematical Rea-
Association for Computational Linguistics. soning Capabilities. In L.-W. Ku, A. Martins, and V. Srikumar,
[82] M. Liao, W. Luo, C. Li, J. Wu, and K. Fan. MARIO: MAth Rea- editors, Findings of the Association for Computational Linguistics:
soning with code Interpreter Output – A Reproducible Pipeline, ACL 2024, pages 13256–13274, Bangkok, Thailand, Aug. 2024.
Feb. 2024. arXiv:2401.08190. Association for Computational Linguistics.
[83] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, [99] A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque. ChartQA:
J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s Verify A Benchmark for Question Answering about Charts with Visual
Step by Step. In Proceedings of the Twelfth International Conference and Logical Reasoning. In S. Muresan, P. Nakov, and A. Villav-
on Learning Representations, ICLR ’24, Vienna, Austria, May 2024. icencio, editors, Findings of the Association for Computational Lin-
[84] A. Liu, S. Swayamdipta, N. A. Smith, and Y. Choi. WANLI: guistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022.
Worker and AI Collaboration for Natural Language Inference Association for Computational Linguistics.
Dataset Creation. In Y. Goldberg, Z. Kozareva, and Y. Zhang, [100] C. Metz. In Two Moves, AlphaGo and Lee Sedol Redefined the
editors, Findings of the Association for Computational Linguistics: Future. https://www.wired.com/2016/03/two-moves-alphago
EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emi- -lee-sedol-redefined-future/, Mar. 2016. Wired.
rates, Dec. 2022. Association for Computational Linguistics. [101] X. Miao, C. Shi, J. Duan, X. Xi, D. Lin, B. Cui, and Z. Jia.
[85] C. Liu, J. Shen, H. Xin, Z. Liu, Y. Yuan, H. Wang, W. Ju, C. Zheng, SpotServe: Serving Generative Large Language Models on Pre-
Y. Yin, L. Li, M. Zhang, and Q. Liu. FIMO: A Challenge emptible Instances. In Proceedings of the 29th ACM International
Formal Dataset for Automated Theorem Proving, Dec. 2023. Conference on Architectural Support for Programming Languages and
arXiv:2309.04295. Operating Systems, Volume 2, ASPLOS ’24, pages 1112–1127, La
[86] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, Jolla, CA, USA, Apr. 2024. Association for Computing Machinery.
K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, [102] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a Suit
S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and of Armor Conduct Electricity? A New Dataset for Open Book
J. Tang. AgentBench: Evaluating LLMs as Agents, Oct. 2023. Question Answering. In E. Riloff, D. Chiang, J. Hockenmaier,
arXiv:2308.03688. and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical
[87] P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.- Methods in Natural Language Processing, EMNLP ’18, pages 2381–
W. Chang, M. Galley, and J. Gao. MathVista: Evaluating Math- 2391, Brussels, Belgium, Nov. 2018. Association for Computa-
ematical Reasoning of Foundation Models in Visual Contexts. tional Linguistics.
In Proceedings of the Twelfth International Conference on Learning [103] I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and
Representations, ICLR ’24, Vienna, Austria, May 2024. M. Farajtabar. GSM-Symbolic: Understanding the Limitations of
[88] P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S.-C. Zhu. Mathematical Reasoning in Large Language Models, Oct. 2024.
Inter-GPS: Interpretable Geometry Problem Solving with Formal arXiv:2410.05229.
Language and Symbolic Reasoning. In C. Zong, F. Xia, W. Li, and [104] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf.
R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Distinguishing Cause from Effect Using Observational Data:
Association for Computational Linguistics and the 11th International Methods and Benchmarks. Journal of Machine Learning Research,
Joint Conference on Natural Language Processing (Volume 1: Long 17(32):1–102, 2016.
Papers), ACL-IJCNLP ’21, pages 6774–6786, Virtual Event, Aug. [105] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang,
2021. Association for Computational Linguistics. M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica. Ray:
[89] P. Lu, L. Qiu, K.-W. Chang, Y. N. Wu, S.-C. Zhu, T. Rajpuro- A Distributed Framework for Emerging AI Applications. In
hit, P. Clark, and A. Kalyan. Dynamic Prompt Learning via Proceedings of the 13th USENIX Symposium on Operating Systems
Policy Gradient for Semi-Structured Mathematical Reasoning. Design and Implementation, OSDI ’18, pages 561–577, Carlsbad,
In Proceedings of the Eleventh International Conference on Learning CA, Oct. 2018. USENIX Association.
Representations, ICLR ’23, Kigali, Rwanda, May 2023. [106] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela.
[90] P. Lu, L. Qiu, W. Yu, S. Welleck, and K.-W. Chang. A Survey Adversarial NLI: A New Benchmark for Natural Language Un-
of Deep Learning for Mathematical Reasoning. In A. Rogers, derstanding. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault,
J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st editors, Proceedings of the 58th Annual Meeting of the Association
Annual Meeting of the Association for Computational Linguistics for Computational Linguistics, ACL ’20, pages 4885–4901, Virtual
(Volume 1: Long Papers), ACL ’23, pages 14605–14631, Toronto, Event, July 2020. Association for Computational Linguistics.
Canada, July 2023. Association for Computational Linguistics. [107] T. Niven and H.-Y. Kao. Probing Neural Network Comprehen-
[91] L. Luo, Y. Liu, R. Liu, S. Phatale, M. Guo, H. Lara, Y. Li, L. Shu, sion of Natural Language Arguments. In A. Korhonen, D. Traum,
Y. Zhu, L. Meng, J. Sun, and A. Rastogi. Improve Mathematical and L. Màrquez, editors, Proceedings of the 57th Annual Meeting of
42
the Association for Computational Linguistics, ACL ’19, pages 4658– vironments for Interactive Learning. In Proceedings of the Inter-
4664, Florence, Italy, July 2019. Association for Computational national Conference on Learning Representations, ICLR ’21, Virtual
Linguistics. Event, May 2021.
[108] OpenAI. Introducing ChatGPT. https://openai.com/index/cha [127] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den
tgpt/, Nov. 2022. accessed 2024-12-27. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
[109] OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4 M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,
o/, May 2024. accessed 2025-01-01. I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel,
[110] OpenAI. Introducing OpenAI o1. https://openai.com/o1/, and D. Hassabis. Mastering the Game of Go With Deep Neural
2024. accessed 2024-12-27. Networks and Tree Search. Nature, 529:484–489, Jan. 2016.
[111] R. Y. Pang, W. Yuan, H. He, K. Cho, S. Sukhbaatar, and J. E. [128] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai,
Weston. Iterative Reasoning Preference Optimization. In Proceed- A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap,
ings of the Thirty-eighth Annual Conference on Neural Information K. Simonyan, , and D. Hassabis. A General Reinforcement
Processing Systems (NeurIPS ’24), volume 37 of Advances in Neural Learning Algorithm that Masters Chess, Shogi, and Go Through
Information Processing Systems, Vancouver, Canada, Dec. 2024. Self-Play. Science, 362(6419):1140–1144, Dec. 2018.
Curran Associates. [129] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
[112] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen,
F. Huang, and H. Chen. Reasoning with Language Model T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel,
Prompting: A Survey. In A. Rogers, J. Boyd-Graber, and and D. Hassabis. Mastering the Game of Go without Human
N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Knowledge. Nature, 550:354–359, Oct. 2017.
Association for Computational Linguistics (Volume 1: Long Papers), [130] K. Sinha, S. Sodhani, J. Dong, J. Pineau, and W. L. Hamilton.
ACL ’23, pages 5368–5393, Toronto, Canada, July 2023. Associa- CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from
tion for Computational Linguistics. Text. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings
[113] Y. Qin, X. Li, H. Zou, Y. Liu, S. Xia, Z. Huang, Y. Ye, W. Yuan, of the 2019 Conference on Empirical Methods in Natural Language
H. Liu, Y. Li, and P. Liu. O1 Replication Journey: A Strategic Processing and the 9th International Joint Conference on Natural
Progress Report – Part 1, Oct. 2024. arXiv:2410.18982. Language Processing, EMNLP-IJCNLP ’19, pages 4506–4515, Hong
[114] Y. Qu, T. Zhang, N. Garg, and A. Kumar. Recursive Introspection: Kong, China, Nov. 2019. Association for Computational Linguis-
Teaching Language Model Agents How to Self-Improve, July tics.
2024. arXiv:2407.18219. [131] C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM Test-Time
[115] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and Compute Optimally Can be More Effective than Scaling Model
C. Finn. Direct Preference Optimization: Your Language Model is Parameters, Aug. 2024. arXiv:2408.03314.
Secretly a Reward Model. In A. Oh, T. Naumann, A. Globerson, [132] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid,
K. Saenko, M. Hardt, and S. Levine, editors, Proceedings of the A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-
Thirty-seventh Annual Conference on Neural Information Processing Alonso, et al. Beyond the Imitation Game: Quantifying and
Systems (NeurIPS ’23), volume 36 of Advances in Neural Information Extrapolating the Capabilities of Language Models, June 2023.
Processing Systems, pages 53728–53741, New Orleans, LA, USA, arXiv:2206.04615.
Dec. 2023. Curran Associates.
[133] S. Srivastava, A. M. B, A. P. V, S. Menon, A. Sukumar, A. S. T,
[116] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani,
A. Philipose, S. Prince, and S. Thomas. Functional Benchmarks
J. Michael, and S. R. Bowman. GPQA: A Graduate-Level Google-
for Robust Evaluation of Reasoning Performance, and the Rea-
Proof Q&A Benchmark, Nov. 2023. arXiv:2311.12022.
soning Gap, Feb. 2024. arXiv:2402.19450.
[117] C. D. Rosin. Multi-Armed Bandits with Episode Context. Annals
[134] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss,
of Mathematics and Artificial Intelligence, 61(3):203–230, Mar. 2011.
A. Radford, D. Amodei, and P. F. Christiano. Learning to
[118] S. Roy and D. Roth. Solving General Arithmetic Word Problems.
Summarize with Human Feedback. In H. Larochelle, M. Ranzato,
In L. Màrquez, C. Callison-Burch, and J. Su, editors, Proceedings
R. Hadsell, M. Balcan, and H. Lin, editors, Proceedings of the
of the 2015 Conference on Empirical Methods in Natural Language
Thirty-fourth Annual Conference on Neural Information Processing
Processing, EMNLP ’15, pages 1743–1752, Lisbon, Portugal, Sept.
Systems (NeurIPS ’20), volume 33 of Advances in Neural Information
2015. Association for Computational Linguistics.
Processing Systems, pages 3008–3021, Virtual Event, Dec. 2020.
[119] K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi. Wino- Curran Associates.
Grande: An Adversarial Winograd Schema Challenge at Scale.
Proceedings of the AAAI Conference on Artificial Intelligence, [135] J. Sun, C. Zheng, E. Xie, Z. Liu, R. Chu, J. Qiu, J. Xu, M. Ding,
34(05):8732–8740, Apr. 2020. H. Li, M. Geng, et al. A Survey of Reasoning with Foundation
Models, Jan. 2024. arXiv:2312.11562.
[120] M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y. Choi. Social IQa:
Commonsense Reasoning about Social Interactions. In K. Inui, [136] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduc-
J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 tion. MIT Press, 2015.
Conference on Empirical Methods in Natural Language Processing and [137] O. Tafjord, B. Dalvi, and P. Clark. ProofWriter: Generating
the 9th International Joint Conference on Natural Language Processing, Implications, Proofs, and Abductive Statements over Natural
EMNLP-IJCNLP ’19, pages 4463–4473, Hong Kong, China, Nov. Language. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Find-
2019. Association for Computational Linguistics. ings of the Association for Computational Linguistics: ACL-IJCNLP
[121] A. Saparov and H. He. Language Models Are Greedy Rea- 2021, pages 3621–3634, Virtual Event, Aug. 2021. Association for
soners: A Systematic Formal Analysis of Chain-of-Thought. In Computational Linguistics.
Proceedings of the Eleventh International Conference on Learning [138] A. Talmor, J. Herzig, N. Lourie, and J. Berant. Common-
Representations, ICLR ’23, Kigali, Rwanda, May 2023. senseQA: A Question Answering Challenge Targeting Common-
[122] W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, and sense Knowledge. In J. Burstein, C. Doran, and T. Solorio, editors,
J. Leike. Self-Critiquing Models for Assisting Human Evaluators, Proceedings of the 2019 Conference of the North American Chapter
June 2022. arXiv:2206.05802. of the Association for Computational Linguistics: Human Language
[123] T. Sawada, D. Paleka, A. Havrilla, P. Tadepalli, P. Vidas, A. Kra- Technologies, Volume 1 (Long and Short Papers), NAACL ’19, pages
nias, J. J. Nay, K. Gupta, and A. Komatsuzaki. ARB: Advanced 4149–4158, Minneapolis, Minnesota, June 2019. Association for
Reasoning Benchmark for Large Language Models, July 2023. Computational Linguistics.
arXiv:2307.13692. [139] Z. Tang, X. Zhang, B. Wang, and F. Wei. MathScale: Scal-
[124] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, ing Instruction Tuning for Mathematical Reasoning, Mar. 2024.
S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lil- arXiv:2403.02884.
licrap, and D. Silver. Mastering Atari, Go, Chess and Shogi by [140] Q. Team. QwQ: Reflect Deeply on the Boundaries of the Un-
Planning With a Learned Model. Nature, 588:604–609, Dec. 2020. known. https://qwenlm.github.io/blog/qwq-32b-preview/,
[125] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and Nov. 2024. accessed 2025-01-01.
O. Klimov. Proximal Policy Optimization Algorithms, Aug. 2017. [141] Y. Tian, B. Peng, L. Song, L. Jin, D. Yu, L. Han, H. Mi, and D. Yu.
arXiv:1707.06347. Toward Self-Improvement of LLMs via Imagination, Searching,
[126] M. Shridhar, X. Yuan, M.-A. Côté, Y. Bisk, A. Trischler, and and Criticizing. In Proceedings of the Thirty-eighth Annual Con-
M. Hausknecht. ALFWorld: Aligning Text and Embodied En- ference on Neural Information Processing Systems (NeurIPS ’24),
43
volume 37 of Advances in Neural Information Processing Systems, [157] J. Xiong, J. Shen, Y. Yuan, H. Wang, Y. Yin, Z. Liu, L. Li, Z. Guo,
Vancouver, Canada, Dec. 2024. Curran Associates. Q. Cao, Y. Huang, C. Zheng, X. Liang, M. Zhang, and Q. Liu.
[142] R. Tu, K. Zhang, B. Bertilson, H. Kjellstrom, and C. Zhang. Neu- TRIGO: Benchmarking Formal Mathematical Proof Reduction for
ropathic Pain Diagnosis Simulator for Causal Discovery Algo- Generative Language Models. In H. Bouamor, J. Pino, and K. Bali,
rithm Evaluation. In H. Wallach, H. Larochelle, A. Beygelzimer, editors, Proceedings of the 2023 Conference on Empirical Methods
F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Proceedings of in Natural Language Processing, EMNLP ’23, pages 11594–11632,
the Thirty-third Annual Conference on Neural Information Processing Singapore, Dec. 2023. Association for Computational Linguistics.
Systems (NeurIPS ’19), volume 32 of Advances in Neural Information [158] Y. Yan, J. Su, J. He, F. Fu, X. Zheng, Y. Lyu, K. Wang, S. Wang,
Processing Systems, pages 12793–12804, Vancouver, Canada, Dec. Q. Wen, and X. Hu. A Survey of Mathematical Reasoning in the
2019. Curran Associates. Era of Multimodal Large Language Model: Benchmark, Method
[143] J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, & Challenges, Dec. 2024. arXiv:2412.11936.
A. Creswell, G. Irving, and I. Higgins. Solving Math Word [159] K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil,
Problems with Process-and Outcome-Based Feedback, Nov. 2022. R. J. Prenger, and A. Anandkumar. LeanDojo: Theorem Proving
arXiv:2211.14275. with Retrieval-Augmented Language Models. In A. Oh, T. Nau-
[144] A. Vijayakumar, M. Cogswell, R. Selvaraju, Q. Sun, S. Lee, mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,
D. Crandall, and D. Batra. Diverse Beam Search for Improved Proceedings of the Thirty-seventh Annual Conference on Neural Infor-
Description of Complex Scenes. Proceedings of the AAAI Conference mation Processing Systems (NeurIPS ’23), volume 36 of Advances
on Artificial Intelligence, 32(1):7371–7379, Apr. 2018. in Neural Information Processing Systems, pages 21573–21612, New
[145] J. Wang, M. Fang, Z. Wan, M. Wen, J. Zhu, A. Liu, Z. Gong, Orleans, LA, USA, Dec. 2023. Curran Associates.
Y. Song, L. Chen, L. M. Ni, L. Yang, Y. Wen, and W. Zhang. [160] S. Yao, H. Chen, J. Yang, and K. Narasimhan. WebShop: Towards
OpenR: An Open Source Framework for Advanced Reasoning Scalable Real-World Web Interaction with Grounded Language
with Large Language Models, Oct. 2024. arXiv:2410.09671. Agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
[146] K. Wang, H. Ren, A. Zhou, Z. Lu, S. Luo, W. Shi, R. Zhang, K. Cho, and A. Oh, editors, Proceedings of the Thirty-sixth Annual
L. Song, M. Zhan, and H. Li. MathCoder: Seamless Code Conference on Neural Information Processing Systems (NeurIPS ’22),
Integration in LLMs for Enhanced Mathematical Reasoning, Oct. volume 35 of Advances in Neural Information Processing Systems,
2023. arXiv:2310.03731. pages 20744–20757, New Orleans, LA, USA, Dec. 2022. Curran
[147] P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, Associates.
and Z. Sui. Math-Shepherd: Verify and Reinforce LLMs Step- [161] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and
by-Step without Human Annotations. In L.-W. Ku, A. Martins, K. Narasimhan. Tree of Thoughts: Deliberate Problem Solving
and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting with Large Language Models. In A. Oh, T. Naumann, A. Glober-
of the Association for Computational Linguistics (Volume 1: Long son, K. Saenko, M. Hardt, and S. Levine, editors, Proceedings of the
Papers), ACL ’24, pages 9426–9439, Bangkok, Thailand, Aug. 2024. Thirty-seventh Annual Conference on Neural Information Processing
Association for Computational Linguistics. Systems (NeurIPS ’23), volume 36 of Advances in Neural Information
[148] X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Subramaniam, Processing Systems, pages 11809–11822, New Orleans, LA, USA,
A. Loomba, S. Zhang, Y. Sun, and W. Wang. SCIBENCH: Dec. 2023. Curran Associates.
Evaluating College-Level Scientific Problem-Solving Abilities of [162] N. Young, Q. Bao, J. Bensemann, and M. Witbrock. Abduction-
Large Language Models. In Proceedings of the 3rd Workshop on Rules: Training Transformers to Explain Unexpected Inputs. In
Mathematical Reasoning and AI, MATH-AI ’23, New Orleans, LA, S. Muresan, P. Nakov, and A. Villavicencio, editors, Findings of
USA, Dec. 2023. the Association for Computational Linguistics: ACL 2022, pages 218–
[149] X. Wang, L. Song, Y. Tian, D. Yu, B. Peng, H. Mi, F. Huang, 227, Dublin, Ireland, May 2022. Association for Computational
and D. Yu. Towards Self-Improvement of LLMs via MCTS: Linguistics.
Leveraging Stepwise Knowledge with Curriculum Preference [163] L. Yuan, W. Li, H. Chen, G. Cui, N. Ding, K. Zhang, B. Zhou,
Learning, Oct. 2024. arXiv:2410.06508. Z. Liu, and H. Peng. Free Process Rewards without Process
[150] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, Labels, Dec. 2024. arXiv:2412.01981.
A. Chowdhery, and D. Zhou. Self-Consistency Improves Chain [164] Z. Yuan, H. Yuan, C. Tan, W. Wang, and S. Huang. How Well
of Thought Reasoning in Language Models. In Proceedings of the Do Large Language Models Perform in Arithmetic Tasks?, Mar.
Eleventh International Conference on Learning Representations, ICLR 2023. arXiv:2304.02015.
’23, Kigali, Rwanda, May 2023. [165] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi. SWAG: A Large-Scale
[151] Z. Wang, S. Zhou, D. Fried, and G. Neubig. Execution- Adversarial Dataset for Grounded Commonsense Inference. In
Based Evaluation for Open-Domain Code Generation, May 2023. E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceed-
arXiv:2212.10481. ings of the 2018 Conference on Empirical Methods in Natural Language
[152] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, Processing, EMNLP ’18, pages 93–104, Brussels, Belgium, Nov.
E. Chi, Q. V. Le, and D. Zhou. Chain-of-Thought Prompting 2018. Association for Computational Linguistics.
Elicits Reasoning in Large Language Models. In S. Koyejo, [166] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi.
S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, HellaSwag: Can a Machine Really Finish Your Sentence? In
editors, Proceedings of the Thirty-sixth Annual Conference on Neu- A. Korhonen, D. Traum, and L. Màrquez, editors, Proceedings
ral Information Processing Systems (NeurIPS ’22), volume 35 of of the 57th Annual Meeting of the Association for Computational
Advances in Neural Information Processing Systems, pages 24824– Linguistics, ACL ’19, pages 4791–4800, Florence, Italy, July 2019.
24837, New Orleans, LA, USA, Dec. 2022. Curran Associates. Association for Computational Linguistics.
[153] P. Wiesner, I. Behnke, D. Scheinert, K. Gontarska, and L. Tham- [167] Z. Zeng, Q. Cheng, Z. Yin, B. Wang, S. Li, Y. Zhou, Q. Guo,
sen. Let’s Wait Awhile: How Temporal Workload Shifting Can X. Huang, and X. Qiu. Scaling of Search and Learning: A
Reduce Carbon Emissions in the Cloud. In Proceedings of the Roadmap to Reproduce o1 from Reinforcement Learning Per-
22nd International Middleware Conference, Middleware ’21, pages spective, Dec. 2024. arXiv:2412.14135.
260–272, Virtual Event, Dec. 2021. Association for Computing [168] D. Zhang, X. Huang, D. Zhou, Y. Li, and W. Ouyang. Accessing
Machinery. GPT-4 Level Mathematical Olympiad Solutions via Monte Carlo
[154] Z. Xi, Y. Ding, W. Chen, B. Hong, H. Guo, J. Wang, D. Yang, Tree Self-Refine with LLaMa-3 8B, June 2024. arXiv:2406.07394.
C. Liao, X. Guo, W. He, S. Gao, L. Chen, R. Zheng, Y. Zou, T. Gui, [169] D. Zhang, J. Wu, J. Lei, T. Che, J. Li, T. Xie, X. Huang, S. Zhang,
Q. Zhang, X. Qiu, X. Huang, Z. Wu, and Y.-G. Jiang. AgentGym: M. Pavone, Y. Li, W. Ouyang, and D. Zhou. LLaMA-Berry:
Evolving Large Language Model-based Agents across Diverse Pairwise Optimization for O1-like Olympiad-Level Mathematical
Environments, June 2024. arXiv:2406.04151. Reasoning, Nov. 2024. arXiv:2410.02884.
[155] Y. Xie, A. Goyal, W. Zheng, M.-Y. Kan, T. P. Lillicrap, [170] D. Zhang, S. Zhoubian, Z. Hu, Y. Yue, Y. Dong, and J. Tang.
K. Kawaguchi, and M. Shieh. Monte Carlo Tree Search ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree
Boosts Reasoning via Iterative Preference Learning, June 2024. Search. In Proceedings of the Thirty-eighth Annual Conference on
arXiv:2405.00451. Neural Information Processing Systems (NeurIPS ’24), volume 37
[156] G. Xiong, Q. Jin, Z. Lu, and A. Zhang. Benchmark- of Advances in Neural Information Processing Systems, Vancouver,
ing Retrieval-Augmented Generation for Medicine, Feb. 2024. Canada, Dec. 2024. Curran Associates.
arXiv:2402.13178. [171] L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and
44