0% found this document useful (0 votes)
16 views44 pages

Reasoning Language Models - A Blueprint

The document presents a comprehensive blueprint for Reasoning Language Models (RLMs), which enhance AI problem-solving capabilities by integrating advanced reasoning mechanisms with large language models (LLMs). It addresses challenges related to accessibility and scalability of RLMs by proposing a modular framework that includes various reasoning structures, strategies, and supervision schemes, alongside detailed mathematical formulations for implementation. The blueprint aims to democratize advanced reasoning capabilities, facilitating innovation and reducing disparities in AI access.

Uploaded by

Titty Mathew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views44 pages

Reasoning Language Models - A Blueprint

The document presents a comprehensive blueprint for Reasoning Language Models (RLMs), which enhance AI problem-solving capabilities by integrating advanced reasoning mechanisms with large language models (LLMs). It addresses challenges related to accessibility and scalability of RLMs by proposing a modular framework that includes various reasoning structures, strategies, and supervision schemes, alongside detailed mathematical formulations for implementation. The blueprint aims to democratize advanced reasoning capabilities, facilitating innovation and reducing disparities in AI access.

Uploaded by

Titty Mathew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

1

Reasoning Language Models: A Blueprint


Maciej Besta1† , Julia Barth1 , Eric Schreiber1 , Ales Kubicek1 , Afonso Catarino1 , Robert Gerstenberger1 ,
Piotr Nyczyk2 , Patrick Iff1 , Yueling Li3 , Sam Houliston1 , Tomasz Sternal1 , Marcin Copik1 , Grzegorz
Kwaśniewski1 , Jürgen Müller3 , Łukasz Flis4 , Hannes Eberhard1 , Hubert Niewiadomski2 , Torsten Hoefler1
† 1 2 3 4
Corresponding author ETH Zurich Cledar BASF SE Cyfronet AGH

Abstract—Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI’s o1 and o3,
DeepSeek-V3, and Alibaba’s QwQ, have redefined AI’s problem-solving capabilities by extending large language models (LLMs) with
advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures—uniquely combining
Reinforcement Learning (RL), search heuristics, and LLMs—present accessibility and scalability challenges. To address these, we
propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all
arXiv:2501.11223v2 [cs.AI] 22 Jan 2025

RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies
(e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based
and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent
tools). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing
how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint’s
versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and
experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models,
and the importance of familiar training distributions. Finally, we discuss scalable RLM cloud deployments and we outline how RLMs can
integrate with a broader LLM ecosystem. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and
fosters innovation, aiming to mitigate the gap between “rich AI” and “poor AI” by lowering barriers to RLM development and
experimentation.

Index Terms—Reasoning Language Model, Large Reasoning Model, Survey of Reasoning Language Models, Survey of RLMs, RLM,
LRM, Reasoning LLMs, Reinforcement Learning for LLMs, MCTS for LLMs, Large Language Model, LLM, Generative AI.

1 I NTRODUCTION RLMs have emerged as the new cornerstone of cutting-edge


AI, bringing us closer to AGI.
Reasoning Language Models (RLMs), such as OpenAI’s
However, the high cost and proprietary nature of state-
o1 [110], o3 [72], and Alibaba’s QwQ [140], also referred
of-the-art RLMs, such as those developed by OpenAI, risk
to as Large Reasoning Models (LRMs)1 , represent a trans-
exacerbating the divide between “rich AI” and “poor AI”,
formative breakthrough in AI, on par with the advent of
raising significant concerns about accessibility and equity.
ChatGPT [108]. These advanced systems have fundamen-
Even the publicly available QwQ only comes with its model
tally redefined AI’s problem-solving capabilities, enabling
weights, and Alibaba does not disclose details about their
nuanced reasoning, improved contextual understanding,
training or data generation methodologies. Businesses and
and robust decision-making across a wide array of domains.
individuals unable to afford these advanced systems face a
By extending the capabilities of standard large language
growing disadvantage, threatening to stifle innovation and
models (LLMs) with sophisticated reasoning mechanisms,
reinforce systemic inequities. As RLMs become integral to
1
critical applications, from healthcare to science, manage-
We use the term “Reasoning Language Model” instead of “Large
Reasoning Model” because the latter implies that such models are ment, and beyond, it is imperative to address these dis-
always large. This does not necessarily have to be the case – as a parities and ensure that the benefits of advanced reasoning
matter of fact, smaller RLM can outperform larger LLMs [52]. capabilities are broadly accessible.
Reasoning Language Model (RLM): What is it and how to build one?
Basics of RLMs Essence of RLMs Blueprint of RLMs x1 Framework & Insights
§2.1-§2.2 History & main §3 Essence of RLMs: §4 Blueprint: a toolbox with ingredients to build various RLMs §7 Design of the x1
pillars of RLMs an overview and the framework: how to easily
§5 How exis�ng schemes compare to the blueprint implement and experiment
most important with RLM designs
details of the RLM Appendix A Mathema�cal specifica�ons of RLMs
§2.3-§2.4 Different architecture
categories of RLMs §7.5 Enabling efficient scaling,
Appendix B Details on value and reward models modern cloud deployments
Fig. 4 An overview Appendix C-D Algorithic formula�ons of RLMs: how different
Fig. 2 History of RLMs and details of the parts of RLMs work in detail, facilita�ng implementa�on §7.6 Example analyses
inference, training, §8 Example insights for
Fig. 3 Pillars and categories and data genera�on Fig. 5 Toolbox overview TABLE 1 RLM comparison building effec�ve RLMs
of RLMs pipelines of RLMs
§6 Hints on how to use the blueprint for user's applica�on §9 Benchmarks for RLMs
Fig. 1: Summary of the contributions made by this paper. The x1 framework can be found at https://github.com/spcl/x1
2

AI Supercomputers Legend: start of an era Supercomputer CPU/GPU Model

Start of the H100 Helios

HPC
Petascale era Titan P100 GH200
A100 GB200
Piz Daint Tianhe-2 V100 Start of the Alps
Exascale era
Breakthroughs Breakthroughs
in compute in compute Breakthroughs in
The ongoing growth of compute power and data processing capabili�es of supercomputers and resources enabled resources enabled compute resources
high performance systems, previously driven by Moore's law and now by the massively parallel breakthroughs the introduc�on enabled the
processing capabili�es of GPUs, TPUs, and AI accelerators. in RL models of LLMs introduc�on of RLMs

AlphaZero
Deep OpenAI
Q-Network
RL
Value model Policy model
AlphaGo Five AlphaFold
(neural network) the branching factor is (neural network)
identical for all nodes AlphaZero MuZero DreamerV3
π π
π π
v=0.001 v=0.11 v=0.04 v=0.09
RL models for board games
become a pillar of RLMs
π π π π
π

v=0.01 v=0.05 v=0.12 v=0.01 v=0.08 LLM Chat-GPT Claude LLaMA-3


MCTS samples mul�ple tree searches to some depth and propagates final values up

LLM
the path, which keeps sta�s�cs for each (state, ac�on)-pair (edge). At the end, it
Transformer GPT-3
chooses the most promising ac�on from the root and prepares the next move. GPT-2 LaMDA LLaMA
LLM
PaLM GPT-4o
GPT-4
RLM Transformer

The policy model is LLMs become


Value model The value model is Policy model a pillar of RLMs
an LLM that replaces an LLM that is fine-
the final token tuned with a special autoregressive
Transformer loss func�on to Transformer
output layer with a
regression to a value. generate the best token generation
subsequent
reasoning steps.
o1 QwQ
Sort the numbers "3,2,4,5,6,12,5,6"

RLM
autoregressive autoregressive
token generation token generation
π π x1
π π TS-LLM: the first proposal
Numbers are blue Look up Quicksort Sor�ng is simple Split into two sets to use AlphaZero-like tree DeepSeek o3
search to enhance LLM's
v=0.001 v=0.11 v=0.04 v=0.09 π π
training & decoding
π π π
Quicksort sorts Split into two sets "3,2,4,5" & "7,12,5,6" "3,4" & "5,2"
Pick Pivot
numbers
v=0.08 v=0.01 v=0.05 v=0.12 v=0.01 2010 2015 2020 2025

Fig. 2: The history of RLMs. This class of models has been the result of the development of three lines of works: (1) Reinforcement Learning based models such as
AlphaZero [128], (2) LLM and Transformer based models such as GPT-4o [109], and (3) the continuous growth of compute power and data processing capabilities of
supercomputers and high performance systems.

The technical foundations of RLMs remain opaque and imize the clarity and comprehensiveness, we present the
complex, compounding the accessibility challenge. Emerg- blueprint using three perspectives: (1) architecture diagrams
ing analyses suggest that their design likely integrates el- and descriptions, (2) detailed mathematical formulations,
ements such as Monte Carlo Tree Search (MCTS) or Beam and (3) in-depth algorithmic specifications. By employing
Search, reinforcement learning (RL), process-based super- these complementary perspectives, we aim to provide a
vision (PBS) [83], [83], [143], [143], and advanced in-context clear and actionable guide for developing RLMs tailored to
learning (ICL) techniques like Chain-of-Thought (CoT) [152] specific applications, settings, and constraints.
or Tree of Thoughts (ToT) [161], and possibly even retrieval- Our blueprint comprehensively encompasses the poten-
augmented generation (RAG) [13], [53], [78], [79]. tial building blocks of RLMs, offering a flexible and modular
Additionally, these architectures employ multiple spe- framework. It incorporates a variety of reasoning structures,
cialized subcomponents—such as synthetic data generation such as chains, trees, graphs, and even higher-order struc-
engines and policy, value, and reward models—trained tures such as hierarchical (or nested) trees, along with nu-
through some form of novel loss functions and possibly sev- merous operations that transform and advance the reason-
eral fine-tuning schemes. However, the intricate interplay of ing process. The blueprint supports different granularities
these components and their integration into a cohesive and of reasoning steps, ranging from individual tokens to full
effective architecture remains poorly understood. Here, the sentences or structured segments. Additionally, it enables
“holy-grail question” is: what is the detailed design of an RLM diverse training schemes, including Outcome-Based Super-
and how to make it simultaneously achieve effectiveness (i.e., high vision (OBS) and PBS, and the related Outcome & Process
accuracy in delivered answers), low cost, and scalability? Reward Models (ORMs & PRMs). Next, in order to illustrate
To help answer this question and to address the above the capability of the blueprint to accommodate novel design
challenges, we propose a comprehensive blueprint for ideas, we describe several novel schemes and how they
constructing, analyzing, and experimenting with RLMs fit within the blueprint. One such example is Trace-Based
(contribution #1; a roadmap of all the contributions and the Supervision (TBS), which extends PBS by incorporating
paper is in Figure 1). Our approach identifies and crystal- labeled traces of traversal paths through entire reasoning
lizes the fundamental building blocks of RLMs, organizing structures, rather than just linear chains of reasoning steps.
them into a cohesive framework. This blueprint is presented By unifying all these components, our blueprint serves as
with increasing levels of granularity, starting from high- a versatile toolbox for constructing RLMs—ranging from
level overview, finishing at low-level details that can be simple models to sophisticated designs—tailored to specific
directly harnessed when implementing. Further, to max- reasoning tasks and performance objectives.
3

Three Pillars of Reasoning Language Models (RLMs) Hierarchy of Language Models


See §2.2 Language Models (LMs)
Reasoning Language Models (RLMs)
RLMs surpass the capabili�es of LLMs and RL agents

Examples: OpenAI o1, OpenAI o3, QwQ, DeepSeek-V3 Large Language Models (LLMs) Reasoning Language Models (RLMs)
See §2.1.1 See §2.2
See §2.1.1 See §2.1.2 See §2.1.3 Capable of System 1 Thinking; Capable of System 2 Thinking;
Pillar 1: Pillar 2: Pillar 3: can do Interpola�on (see §2.3) be�er at Extrapola�on (see §2.3)
Large Language Reinforcement High-Performance
Models (LLMs) Learning (RL) Compu�ng (HPC) Examples: GPT-4o, LLaMA, Qwen Examples: o1, o3, DeepSeek-V3, QwQ

Strengths: Finds op�mal strategies Strengths: Ongoing growth of


Strengths: Understanding and for decision-making problems by compute and data processing
genera�ng human language trial and error; no embedded capabili�es enables con�nuous
domain knowledge needed. increase in models' capabili�es
Explicit RLMs (see §2.4.2) Implicit RLMs
Limita�ons: End of Moore's Law and (see §2.4.1)
Limita�ons: Lack of a deliberate, Limita�ons: Lacks the ability to
encode real-world knowledge or Dennard's Scaling requires more Explicit structure of LLM
itera�ve reasoning process elaborate designs to drive the reasoning (e.g., MCTS) Policy model Value model
handle mul�-face�ed reasoning tasks and the strategy for its Structure encoded
growth of compute capabili�es evolving, e.g., a strategy in weights (matrices)
based on Reinforcement Transformer Transformer
Examples: GPT-4o, LLaMA, Qwen, Examples: AlphaZero, AlphaGo, Example AI-focused architectures: Learning analogous to Transformer
AlphaZero Training
Grok MuZero CSCS Alps, Cyfronet Helios

autoregressive autoregressive
token generation token generation autoregressive
token generation

Previously driven by Moore's law and now by the massively


Examples: Example:
parallel processing capabili�es of GPUs, TPUs, and AI accelerators,
LLaMA-Berry, QwQ
HPC is the founda�on of LLMs, RL, and RLMs. Marco-o1

Fig. 3: Hierarchy of language models (right) and the three pillars of RLMs (left).

We conduct a broad analysis of existing reasoning optimize performance across diverse applications. By bridg-
schemes (contribution #2), demonstrating how they fit into ing the gap between conceptual advancements and practical
our blueprint as special cases. This analysis encompasses implementations, this work seeks to accelerate progress in
not only standard MCTS and reinforcement learning-based the field, unlock new possibilities for intelligent systems
designs, such as LLaMA-Berry [169], but also models like across research, industry, and education, and to mitigate the
QwQ [140]. Additionally, we include paradigms diverging risk of the growing gap between “rich AI” and “poor AI”.
from standard MCTS, such as Journey Learning [113] or
Beam Search, which redefines reasoning through implicit
2 E VOLUTION & F OUNDATIONS OF RLM S
long-chain structures, and advanced structured prompt-
ing techniques like CoT [152], ToT [161], and Graph of We first summarize the evolution and foundations of rea-
Thoughts [9]. We also consider reasoning utilities such as soning language models. Figure 2 shows an overview of the
Retrieval-Augmented Generation (RAG) and data stores, history of the development of these models.
tools, and others. By mapping these diverse approaches to
one blueprint, we showcase its versatility and expressive 2.1 Basic Pillars of Reasoning LMs
power, highlighting its ability to unify a wide range of The development of reasoning-capable LLMs represents a
reasoning methodologies within a coherent framework. convergence of three critical threads: (1) advances in LLMs
To demonstrate the utility of our framework, we in- such as GPT-4, (2) RL designs such as AlphaZero, and (3)
troduce x1, a modular and user-friendly implementation2 High-Performance Computing (HPC) resources. Together,
designed to simplify the process of developing and exper- these threads have shaped models capable of efficient Sys-
imenting with new RLM architectures, covering not only tem 2 Thinking – a level of reasoning that combines explicit
training and inference, but also synthetic data generation deliberation with novel problem-solving abilities, distinct
(contribution #3). We design x1 to facilitate supporting var- from the intuitive, fast, and automatic heuristics of System 1
ious optimizations, design decisions, and overall scalability, Thinking. Figure 2 compares example designs in these pillars
such as batch processing, making it a well-suited founda- while Figure 3 (left side) further discusses the details of
tion of experimentation infrastructure. We also discuss key these pillars.
aspects of deployment in cloud environments, ensuring that
x1 can be seamlessly integrated into modern infrastructure 2.1.1 Large Language Models: A Reservoir of Knowledge
for both research and production use cases. LLMs such as GPT-4o [109] or Llama [50] represent an
By providing both theoretical insights and practical extraordinary leap in the field of AI, constituting a vast
tools, this work aims to democratize access to advanced repository of world knowledge encoded directly in their
RLMs, enabling researchers and practitioners to design, weights. Trained on huge corpora of text from diverse
train, and deploy sophisticated reasoning models with re- sources, LLMs are capable of understanding and generating
duced complexity and cost. Our blueprint offers a clear and human language with remarkable fluency. However, their
adaptable framework that lowers the barriers to entry, fos- reasoning abilities largely align with the fast, automatic, and
tering broader experimentation and innovation. Addition- intuitive System 1 Thinking. While they can generate co-
ally, the modular implementation of x1 serves as a founda- herent responses and even perform simple reasoning tasks,
tion for rapid prototyping and large-scale experimentation, LLMs have limitations. The reasoning they exhibit is often
empowering users to explore new reasoning paradigms and shallow, rooted in the simple mechanism of predicting the
next most probable token in a sequence rather than engag-
2
https://github.com/spcl/x1 ing in explicit problem-solving or structured analysis. While
4

LLMs may generate plausible-sounding solutions to a prob- models, supporting the combination of vast knowledge, rea-
lem, these outputs are the result of statistical language mod- soning capabilities, and computational scalability – allowing
eling rather than a deliberate, iterative reasoning process. AI evolution to continue beyond the limits of traditional
This distinction highlights the need for integrating more Moore’s Law scaling.
advanced mechanisms capable of explicit reasoning into AI
systems—paving the way for hybrid designs that combine
the knowledge-rich foundation of LLMs with structured 2.2 The Convergence: System 2 Thinking in AI
reasoning methodologies. The intersection of these three threads – LLMs, RL, and HPC
– has culminated in the emergence of models capable of
2.1.2 Reinforcement Learning: Exploring and Innovating
System 2 Thinking. These advanced systems combine the
RL has historically provided a framework for decision- knowledge-rich foundation of LLMs with the exploratory
making and exploration in environments where an agent and optimization capabilities of RL, all supported by the
must learn optimal strategies through trial and error. Land- scalability and performance of modern HPC. The result is a
mark systems like AlphaZero [128] and a long line of others new class of AI models that can engage in explicit, deliberate
such as AlphaGo [127] or MuZero [124] demonstrated the reasoning processes.
profound potential of RL by achieving superhuman per-
These models possess a world model encoded in the
formance in games such as chess, shogi, and Go. Unlike
weights of their LLM components, allowing them to reason
traditional AI systems, AlphaZero began with no embedded
about complex scenarios and contexts. Their RL capabilities
domain knowledge. Instead, it mastered these games purely
combined with the HPC capabilities enable them to navigate
through self-learning, discovering novel strategies that even
truly immense decision spaces, evaluate multiple strategies,
human experts had not considered.
and iteratively refine solutions.
One of the most striking examples of RL’s innovative
capacity came during an AlphaZero match, where the sys-
tem made a move initially deemed a mistake by human 2.3 Interpolation (LLMs) vs. Extrapolation (RLMs)
observers. This move [100] later proved to be both sur-
prising and strategically brilliant, illustrating the capacity Standard LLMs, driven by their autoregressive token pre-
of RL agents to explore unconventional solutions that lie diction mechanism, primarily perform interpolation within
outside the bounds of human intuition. Such capabilities are the vast search space of solutions. They excel at generating
fundamentally rooted in RL’s ability to navigate vast search responses that align with patterns seen in their training data,
spaces effectively. effectively synthesizing knowledge from known contexts.
However, traditional RL systems lacked the ability to However, this process limits them to producing outputs that
encode real-world knowledge or handle complex, multi- remain within the boundaries of their training distribution.
faceted reasoning tasks. This limitation spurred the integra- In contrast, reasoning LMs enable extrapolation beyond
tion of RL principles with LLMs, combining the structured these boundaries. By combining structured exploration, rea-
exploration and optimization capabilities of RL with the soning LMs navigate uncharted areas of the solution space,
knowledge-rich reasoning foundation of language models. generating novel insights and solutions that extend past the
limits of their training data. This enables a shift from basic
2.1.3 HPC: Scalability & Efficiency pattern completion to active problem-solving.
The growth of LLM and RL systems has been propelled
by advancements in High-Performance Computing (HPC).
Initially driven by Moore’s Law, which enabled a doubling 2.4 Hierarchy of Reasoning-Related Models
of transistor density approximately every two years, HPC The evolution of RLMs can be understood as a hierarchical
benefited from both technological advancements and the progression, with earlier models such as GPT-4o being less
economic feasibility of manufacturing smaller transistors. capable in terms of reasoning, and the o1-like architectures
However, as the costs of further miniaturization have risen demonstrating increasing sophistication and explicit reason-
sharply, Moore’s Law has reached practical limits, necessi- ing abilities. This hierarchy reflects the integration of System
tating alternative strategies like parallelism and heteroge- 1 (LLMs) and System 2 (RLMs) Thinking. RLMs can be
neous computing. further divided based on how reasoning is implemented
Modern HPC systems rely heavily on GPUs, TPUs, into Implicit RLMs and Explicit RLMs; the details of this
and AI accelerators for their parallel processing capabil- categorization can be found in Figure 3 (the right side).
ities, alongside CPUs for sequential and general-purpose
tasks. Heterogeneous computing leverages these compo-
2.4.1 Implicit Reasoning Models
nents to optimize task-specific performance. Distributed
frameworks, employing techniques such as data, model, In this subclass, the reasoning structure is embedded
and pipeline parallelism [8], [12], [16], further enable the entirely within the model’s weights. Models such as
training of enormous models across thousands of compute QwQ [140] operate as “black boxes”, where reasoning is im-
nodes. plicit and cannot be explicitly disentangled or manipulated.
Energy efficiency innovations, including sparsity, quanti- While these models exhibit improved reasoning capabilities
zation, and pruning, mitigate the growing energy demands compared to standard LLMs, their reasoning processes are
of scaling AI systems. These advancements ensure that HPC opaque and rely on the internalized patterns learned during
remains a cornerstone for developing and deploying AI training.
5

2.4.2 Explicit Reasoning Models strategies, described below). This approach, inspired by
These models introduce explicit reasoning mechanisms ex- methods used in AlphaZero, ensures that the search process
ternal to the model’s core weights. Examples include de- is both efficient and directed toward promising solutions.
signs such as LLaMA-Berry [169], Marco-o1 [174], and po- The policy model 4 is responsible for generating new
tentially OpenAI’s o3, which incorporate mechanisms like reasoning steps at each node, predicting the next most
explicit MCTS combined with RL for decision-making. This likely and logical steps to expand the reasoning process.
explicit structure enables the model to simulate, evaluate, Meanwhile, the value model 5 evaluates the quality of a
and refine solutions iteratively, facilitating novel problem- reasoning path starting at a given node, helping the system
solving and extrapolation. By separating reasoning from prioritize the most promising steps to follow. Sometimes, a
the static knowledge encoded in the weights, these models reward model3 6 is used instead, to assess the quality of
achieve greater flexibility and interpretability in their rea- an individual specific node and its corresponding reasoning
soning processes. Note that the explicit reasoning can be step. In our blueprint, as detailed in the next section, we
internalized via training making it implicit – we discuss it abstract the models into a more general notion of operators
later in the blueprint. 7 to enable more flexibility in how they are implemented.
The search and reasoning processes continue iteratively
until a terminal step is reached 8 . This terminal step
3 E SSENCE OF R EASONING LM S represents a completion of the reasoning chain that forms
We now describe the general architecture of RLMs, which the final answer to the posed problem. It serves as the leaf
we summarize in Figure 4. In the following sections, we node in the tree, concluding that particular reasoning path.
generalize this description to the full RLM blueprint. This architecture provides a unified framework that
accommodates a wide range of reasoning tasks. Whether
3.1 Basic Architecture, Pipelines, & Concepts reasoning steps are fine-grained (e.g., individual token se-
quences) or coarse-grained (e.g., entire reasoning chains
We now outline the foundational architecture, operational treated as single nodes), the architecture adapts seamlessly.
pipelines, and core concepts. Figure 4 offers three levels of By structuring the search space explicitly and guiding ex-
detail. In general (the top-left part), the whole RLM archi- ploration with policy and value models, the RLM achieves
tecture consists of three main pipelines: inference, training, a level of reasoning capability bridging intuitive pattern
and data generation. The inference serves user requests, recognition and deliberate problem-solving.
using models (e.g., the value or policy model) provided by A detailed specification of the inference pipeline can be
the training pipeline. Data generation mirrors the inference found in Appendix C.1 and in Algorithm 1.
pipeline in its internal design; the main difference is that
it runs independently of the user requests, generating data
3.1.2 Training
that is then used to re-train the models. As such, training
combined with data generation from various domains [121], Training details depend on what model is trained (value,
[168] offers self-learning capabilities and is analogous to the policy, reward, ...). In general, we assume fine-tuning a
self-play setting of AlphaZero [128]. model such as Llama. Here, we follow an approach where
one first harnesses supervised data, usually coming from
3.1.1 Inference existing datasets such as PRM800K [83] 1 , which becomes
The inference process begins when the user provides an a part of the supervised training data 2 used in the su-
input prompt 1 , which typically describes the problem or pervised training pipeline 3 of the framework to train
question to be addressed by the RLM. This input serves some, or all, of the models 4 considered in the blueprint.
as the root of the reasoning process and initiates the con- The second part of the overall training framework in RLMs
struction of a reasoning structure 2 that organizes RLM’s is the unsupervised (self-learning) training pipeline, in
progress. The structure is usually represented as a tree. which training data is being continually generated 5 and
The root of this tree corresponds to the user’s input, and used to improve the models. The data can be obtained from
subsequent nodes are generated to explore the search space inference, assuming quality control [52], but also from a
– the domain of possible reasoning paths or solutions. dedicated synthetic data generation pipeline that mirrors
The purpose of this reasoning structure is to systematically that of the inference. To collect the data, one executes the
investigate potential solutions, progressively refining and respective RLM pipeline for a given input task and gathers
extending reasoning paths to converge on an optimal or the results 6 ; depending on how detailed the gathering
satisfactory answer. process is, the data collected can contain only outcome-
An individual point in the search space, represented as a based labels 7 , process-based labels 8 , or some other
node in the reasoning structure, corresponds to a reasoning variant such as trace-based labels 9 suggested in our
step 3 . A reasoning step is defined as a coherent and blueprint, that generalize process-based samples to samples
self-contained unit of thought – a sequence of tokens that that contain also information about operators applied dur-
advances the solution by either exploring a new branch of ing the task solution process. All this data becomes a part of
the problem or building upon existing progress. These steps the replay buffer 10 and is used in the unsupervised training
form the building blocks of the reasoning process.
3
The details of how the structure evolves are usually We use a naming scheme in which a model used to estimate the quality
of a whole reasoning path starting at a given node, is called the value
governed by the MCTS scheme, enhanced with policy model, while a model used to estimate the quality of a given reasoning
and value models (we also distinguish other reasoning step, is called the reward model.
6
Legend Medium-level overview (§3.1)
References to descrip�ons
Part of the Reasoning Models & Training 1 in text (inference pipeline) One can use external data
pipeline scheme Operators Data References to descrip�ons such as human-prepared External sources
1 in text (training pipelines) user chains of thoughts
provide data
Implicit Explicit RLM Reasoning
High-level overview (§3.1) RLM Scheme (§4.2.2)
executes
Inference
user executes ... Training
Models are used to run inference uses
New self-learning data is generated and used by training uses Data ... ...
Data
Inference Reasoning Scheme Training Genera�on
...
· Policy model uses uses
Inference uses reasoning scheme ... generates
· Value model
More Reasoning uses
Self-Learning ... ...
details Training uses
· Supervised u�li�es
fine-tuning data
Data ... · Replay buffer trains
Genera�on

Data genera�on uses reasoning scheme


Models value model,
(§4.3 - §4.4) policy model, ...
New self-learning data is generated and used by training becomes
Use updated models and buffer for data genera�on One can train an Implicit RLM during the execu�on of the Explicit RLM pipelines,
for example by training the model on the execu�on traces from the Explicit RLM

More
details
Detailed view (§3.1.1, §3.1.2)
user External sources
1 provide data
Implicit 1 Explicit RLM
13 RLM
Reasoning Scheme Reasoning Reasoning Reasoning
Scheme = Structure + Strategy
Inference executes
Input Sort the numbers "3,2,4,5,6,12,5,6"
Numbers
are blue
(can also be used for
Split into two sets 2
data genera�on 5 )
uses uses v=3 v=9 3 v=2 v=6
Look up Sor�ng "3,4" & "5,2" Supervised
Quicksort is simple Instance of
Reasoning fine-tuning
Pick Structure
Pivot v=1 v=2 v=8 v=3 v=5 data
Split into
2
5 Quicksort two sets "3,2,4,5" & "7,12,5,6"
sorts nubers
Data v=3 v=2
Genera�on executes "3,2,4" & "5,2,6"
"3,2,4,5" & "6,12,5,6"
8
Output "2,3,4,5,5,6,6,12"
uses uses Training
Data collec�on Data
6
Reasoning 7 Samples for Outcome-based Supervision
is included
u�li�es Sort the numbers "3,2,4,5,6,12,5,6" "2,3,4,5,5,6,6,12"
into
8 Samples for Process-based Supervision
Tools
Sort the numbers "3,2,4,5,6,12,5,6" Look up Quicksort is included Unsupervised
Databases, RAG Split into two sets "3,2,4" & "5,2,6" "2,3,4,5,5,6,6,12"
into fine-tuning
data
Web access 9 Samples for Trace-based Supervision
is included
(replay
Sort the numbers "3,2,4,5,6,12,5,6" Generate into 10
Agents buffer)
Numbers are blue Evaluate Backtrack Generate
Coding on-the-fly
Look up Quicksort Select "2,...,12"
...

uses uses uses


Training
3 Supervised fine-tuning 11 Unsupervised (self-learning) 12 Implicit RLM training uses
fine-tuning
trains trains trains
trains trains
trains
4 Models
(§4.4) 4 Policy model 5 Value model 6 Reward model 13 Implicit RLM

implements implements implements

7 Operators
(§4.3) Generate Refine Evaluate Backtrack Select Prune

Fig. 4: Overview of a general RLM design and core concepts. We provide a high-level overview (the top-left part), a more detailed medium-level overview (the
top-right part), and a very detailed diagram showing the inference and training pipelines (the bottom part). A detailed specification of the inference pipeline can be
found in Appendix C.1 and in Algorithm 1. Details on the pipelines for different training phases and paradigms can be found in Appendices C.2 and C.3 as well as in
Algorithms 2–7. The data generation pipeline is detailed in Appendix D.
7

scheme 11 or it can also be used to train 12 a model that 4.1 Overview & Main Components
would become an Implicit RLM 13 .
The blueprint specifies a toolbox of components that can be
A detailed specification of the pipelines for different
used to build an arbitrary RLM. We identify several classes
training phases and paradigms can be found in Appen-
of such components. First, an RLM includes a reasoning
dices C.2 and C.3 as well as in Algorithms 2–7. The data
scheme, which specifies a reasoning structure (e.g., a tree)
generation pipeline is detailed in Appendix D.
together with a reasoning strategy (e.g., MCTS) of how
this structure evolves in order to solve a given input task.
3.2 Encompassing Diverse RLM Architectures Second, there is a set of operators (e.g., Refine) that can
The above-described design is applicable to many RLM be applied to the reasoning structure (as specified by the
designs. However, there are numerous other variants of reasoning strategy) in order to evolve it and make progress
architectures, some of which do not fully conform to this towards solving the input task. Operators are specified
framework. In this section, we discuss these variants, high- based on what they do (i.e., what effect they have on the
lighting how our blueprint accommodates such variations. reasoning structure). How this effect is achieved, depends on
In some RLM designs [169], a single node in the MCTS how a given operator is implemented. Here, many operators
tree could represent an entire reasoning structure, such as rely on neural models (e.g., Policy Model), which – together
a complete chain of reasoning steps. In this case, the ac- with their training paradigms – form the third class of
tion space involves transitioning between different reason- the blueprint components. Finally, we also distinguish a
ing structures rather than individual steps. This approach set of pipelines, i.e., detailed specifications of operations that
changes the nature of the search, as the focus shifts from orchestrate the interaction between the reasoning scheme
iteratively constructing a single reasoning path to evaluating and the operators in order to achieve a specific objective,
and refining entire structures within the search space. Our such as training, inference, or data generation. Hence, an
blueprint accommodates this with the concept of nesting, RLM can be defined as a composition of a reasoning scheme, a
where a node in the reasoning structure can contain another set of operators and associated models, and a set of pipelines.
reasoning structure.
Other architectures introduce even more novel
paradigms. For instance, Journey Learning [113] adds 4.2 Reasoning Scheme
an additional layer of complexity by incorporating a
A reasoning scheme is the part of the blueprint that specifies
transformation step that “rewires” the search or reasoning
the details of the reasoning steps progressing toward the
structure. This transformation consolidates multiple paths
solution, how they are interconnected to form coherent
in the tree, synthesizing them into a new form that is used
chains, trees, or more complex reasoning structures, and
as input for subsequent reasoning iterations.
how these structures evolve in the course of solving the
Despite these variations, our blueprint is sufficiently
input task.
general to encompass all these cases and beyond, as we
illustrate more formally in the following. This generality
ensures that the blueprint is not only applicable to existing 4.2.1 Reasoning Step
designs but also provides a foundation for future innova- A reasoning step is a fundamental unit of the reasoning
tions in RLM development. structure – a sequence of tokens that advances the RLM
towards the solution. Reasoning steps can vary in length,
3.3 Integration with Broader LLM Agent Ecosystems ranging from a single token to entire segments of text. The
The integration of RLMs into broader LLM agent ecosys- variability in their granularity depends on the user design
tems would enable these models to interact dynamically choice. In existing schemes, a reasoning step is typically
with external tools, databases, and resources during exe- conceptualized as a “coherent and self-contained unit of
cution. This interaction can occur within the inference or thought”. For instance, in mathematical proofs, this may
data generation pipeline, leveraging value or policy models correspond to an individual logical argument or deduction.
to extend the reasoning process through access to retrieval- The flexibility in defining reasoning steps allows mod-
augmented generation (RAG), web queries, and specialized els to adapt to different problem domains, balancing fine-
tools. For example, during a reasoning task, the value or the grained and coarse-grained reasoning. Coarse steps, such
reward model could query a database to verify intermediate as logical arguments (or even complete reasoning path-
steps, ensuring factual correctness or retrieving additional ways [169]), simplify preparation and adoption of training
context to refine its reasoning. Similarly, these models could data, enhance interpretability, and – as we discuss in Sec-
utilize computational tools for mathematical or symbolic tion 8 – reduce computational overhead. On the other hand,
computations, thereby expanding the scope and accuracy single-token steps enable the utilization of concepts like
of their reasoning. token entropy [96] to incorporate the model’s uncertainty,
as well as the integration of advanced decoding schemes
(e.g., speculative decoding [77] or contrastive decoding [80])
4 B LUEPRINT FOR R EASONING LM S explicitly into the RLM design. Yet, while making the rea-
We now introduce our RLM blueprint that can be used to soning steps more fine-grained allows for a more detailed
develop novel reasoning models and to provide ground for exploration of solution paths, this increased flexibility re-
analysis, evaluation, and comparison of such designs. We sults in greater computational demands, particularly when
overview the blueprint in Figure 5. combined with search algorithms such as MCTS.
8

1 Reasoning Scheme (§4.2) A toolbox of paradigms for modeling and evolving the reasoning structure

1.1 Reasoning Step (§4.2.1) 1.2 Reasoning Structure (§4.2.2)


What is the content of an individual reasoning step? What is the connec�on structure of reasoning steps?

Coarse-grained (e.g., unit of thought) Chain Tree Example: TS-LLM, Graph Example: Nes�ng A node can
(not a par�cularly Tree of Thoughts Graph of Thoughts contain another
good reasoning step) Sort the numbers "3,2,4,5,6,12,5,6"
Input task Input task Input task Input task Input task structure
(input task statement)
Numbers
are blue
(a reasonably good
reasoning step)
Split into two sets

Look up
Quicksort Sor�ng
is simple

"3,4" & "5,2"

Pick Split into "3,2,4,5" & "7,12,5,6"


Quicksort two sets
Pivot sorts numbers
Nodes
Example: form
TS-LLM a DAG Example: Llama-Berry
"3,2,4,5" & "6,12,5,6" "3,2,4" & "5,2,6"

Fine-grained (e.g., single token)


Root contains
the input task Input task Sort the numbers "3,2,4,5,6,12,5,6"
1.3 Reasoning Strategy (§4.2.3)
Split
How does the reasoning structure evolve in order to progress solving the input task?
Look
Sor�ng
Numbers MCTS Beam Search Ensemble
Ensemble methods
Methods
The value func�on Select a fixed
assigns a score to Different
Input task each node Input task number of Input task structures Input task
up con�nua�ons can evolve
over across per level independently
down
Example:
MergeSort QuickSort Token entropy v=9 v=6 v=9 v=6
v=3 v=2 v=3 v=2
v=5
...
Decoding Strategy (§4.2.4) v=8 v=8
v=1 v=2 v=3 v=5 v=1 v=2 v=3
Greedy
search
Nucleus
sampling
... Example:
Select
configura�ons
with highest
Example:
TS-LLM, v=3 v=2 scores
Tree of
Thoughts
v=3 v=2 v=4 v=1 Select best solu�on
Marco-o1 Example: Best-of-N

Toolbox of
reasoning schemes Toolbox of pipelines 4 Pipelines Inference: §3.1.1, Appendix 3.1
Toolbox of operators
RLM Toolbox of models
Training: §3.1.2, Appendix 3.2 - 3.4
Data genera�on: Appendix D

A toolbox of opera�ons for changing A toolbox of neural models for implemen�ng


2 Operators (§4.3) & interac�ng with the reasoning structure 3 Models (§4.4) operators and of paradigms for training these models

2.1 Structure Operators (§4.3.1)


3.1 Models Harnessed (§4.4)
Modify the reasoning structure
What operators are being implemented as models?
Generate Example:
Refine Enhance Aggregate Prune More details
"Expand"
rou�ne
a given
node Example: Value model Policy model Reward model ... on models in
in MCTS Graph of Appendix B
Thoughts

Example: Gather data


from more
3.2 Training Paradigm (§4.4.1)
add 'improve' remove
than one nodes
nodes in Graph of
node How is a given model being trained?
Thoughts

Restructure (apply arbitrary structural transforma�ons) Direct Preference


Rejec�on Sampling Proximal Policy Op�miza�on Op�miza�on
Example 1: Summary of ensemble structures Example 2: Lineariza�on

More details
Supervised fine-tuning Reasoning Policy Op�miza�on on training in
Appendix C
+ =
Example:
Journey
Learning 3.3 Training Data Scope (§4.4.2)
What informa�on does a single training sample contain?

Traversal Operators Update Evaluate Outcome-based supervision Process-based supervision


2.2 (§4.3.2) 2.3 (§4.3.3) 2.4 (§4.3.4) Training samples only contain inputs and Training samples contain all intermediate
Specify which nodes to select for next opera�on Modify the nodes' Map the structure to values outputs as well as a label that is either steps between input and output,
contents but not correct ( ) or incorrect ( ). annotated with a quality score (q) or a label
Select Example: Backtrack the structure Example: that is either correct ( ) or incorrect ( ).
"Select" "Evaluate" Example 1
Example: rou�ne Example 1
rou�ne Input:
"Backprop" in MCTS
in MCTS Order the following numbers in ascending order: "3,2,4,5,6,12,5,6"
rou�ne Input Output
in MCTS Output:
"2,3,4,5,5,6,6,12" q=0.7 q=0.5 q=0.8

Example 21 Example 1
back- Input:
-track Order the following numbers in ascending order: "3,2,4,5,6,12,5,6"
Input Output
Output:
Example: Backtrack "2,3,4,5,6,12"
Step selected for further scheme in Tree
expansion (e.g., due to a high
score from a Value Model) of Thoughts Value

Fig. 5: A blueprint for reasoning LMs. It consists of four main toolboxes: the reasoning scheme (the top part), operators (the bottom-left part), and models (the
bottom-right part); pipelines are mentioned in the center and detailed in Appendix C.1 and in Algorithm 1 (the inference pipeline), Appendix C.2, Appendix C.3, and
in Algorithms 2–7 (the training pipelines), and in Appendix D (the data generation pipeline).
9

4.2.2 Reasoning Structure 4.3 Operators


The reasoning structure specifies how individual reasoning Operators specify operations that can be applied to various
steps are connected and organized. Common structures parts of the reasoning structure to progress the reasoning
include chains (linear sequences), trees (hierarchical branch- process. We now provide an extensive toolbox of operators.
ing), and graphs (arbitrary connections). Many of them have been widely used in RLM-related de-
Chains are sequential reasoning flows, where each step signs, but some – to our best knowledge – are still unex-
builds directly on the preceding one. Chain structures are plored, we include them to foster innovation and propel the
prevalent in CoT-based models, where each reasoning step design of more effective and more efficient RLMs.
follows logically from the previous step in a linear pro-
gression. In tree structures, each reasoning step can branch 4.3.1 Structure Operators
into multiple continuations, forming a decision tree. This
Structure operators transform the reasoning structure by
structure is commonly used in MCTS-based frameworks,
taking it as input and producing a modified version, typ-
where multiple potential paths are explored before selecting
ically through addition or refinement of reasoning steps.
a branch that will be further investigated. It enables more
For instance, they may add new children to a specific node,
effective exploration of the space of reasoning steps, but
facilitating the exploration of alternative reasoning paths.
simultaneously makes the RLM design more complex and
• Generate The Generate operator adds one or more new
costly. Finally, graph structures allow for arbitrary depen-
dencies between reasoning steps, enabling graph-based rea- reasoning steps to a reasoning structure. Within the MCTS
soning, such as that found in the Graph of Thoughts (GoT) reasoning strategy, this operator is typically implemented
framework [9]. as a policy model to generate new steps. In other strate-
Further generalization involves nested structures, where gies, the generation operator may involve sequentially
reasoning nodes themselves may contain substructures. For appending steps (CoT) or exploring multiple candidate
example, a node in a tree structure may represent a CoT steps in parallel (Beam Search).
chain, as proposed in LlaMa-Berry [169]. This hierarchical • Refine The Refine operator enhances a given individual

organization could be particularly useful for multi-step reasoning step. For example, it could address ambiguities,
tasks where high-level decisions guide low-level computa- correct errors, and optimize inefficiencies, resulting in a
tions, such as meta-reasoning frameworks [169]. One could more robust version of the step [94]. It could also integrate
harness any other higher-order structures, such as hyper- suggestions from self-critique [122] (evaluates steps to
graphs, motifs, and others [10], [11], [14], [17]. identify weaknesses and suggest targeted improvements),
summarization [178] (condenses key elements into concise
4.2.3 Reasoning Strategy representations to streamline the reasoning structure), or
The reasoning strategy governs how the reasoning structure rephrasing [42] (reformulates steps to improve clarity and
evolves, specifying the process by which new reasoning coherence while preserving their logical integrity).
steps are added and integrated. Example strategies include: • Aggregate This operator combines multiple reasoning
• MCTS [73] A popular approach that balances exploration steps, paths, or structures into the next individual step.
and exploitation by simulating multiple reasoning paths This enables consolidating information or improving co-
and selecting the most promising one based on a scoring herence. It is used in Ensemble Methods [18] or in Graph
function. of Thoughts [9].
• Beam Search [131] A breadth-limited search that keeps • Prune This operator removes nodes or reasoning steps
a fixed number of top-ranked continuations at each step. from the structure that are deemed suboptimal or irrel-
While commonly used for decoding token sequences, evant based on evaluation metrics. It enables optimizing
beam search can also apply to reasoning steps. the reasoning structure in order to, e.g., reduce token costs.
• Ensemble Methods These methods involve aggregating • Restructure The Restructure operator applies arbitrary
multiple independent reasoning strategies, such as com- transformations to the reasoning structure, enabling flex-
bining chains and trees to enhance robustness and accu- ible reorganization of its components. A notable example
racy. One example is Best-of-N [45], [150] – a strategy is the conversion of a reasoning tree into a linear chain by
where multiple independent reasoning paths are gener- rearranging its branches into a sequential series of steps,
ated, and the most effective solution is selected based on as done in Journey Learning [113]. This restructuring
predefined criteria, e.g., accuracy or completeness. An- facilitates the integration of insights from diverse branches
other example is tree ensemble (Forest) [18] where, instead into a cohesive flow, “flattening” it and making it easier
of a single reasoning tree, a reasoning “forest” consists of for the model to process and utilize information within a
multiple disconnected trees, which may eventually con- single, unified context.
verge at a shared solution node. This approach supports Discussion on Diversity In structure operators, there is
diverse reasoning pathways that parallelize exploration. a notion of how diverse the outcomes of the operator are.
Reasoning Strategy vs. Decoding Strategy. It is crucial to For example, when generating k new reasoning steps, one
distinguish reasoning strategies from token-level decoding may want to make the contents of these steps as different to
strategies. While decoding strategies, such as greedy search one another as possible. While different mechanisms to steer
and nucleus sampling [60], generate the internal token se- diversity exist, a typical approach is the use of the policy
quences within a reasoning step, reasoning strategies focus model temperature. We additionally propose to consider
on the higher-level process of integrating and expanding the Diverse Beam Search [144] which promotes diversity by
reasoning steps within the reasoning structure. maintaining multiple diverse candidate sequences during
10

decoding. In MCTS, there is also a distinction between ex- for more efficient assessments. Other methods such as
ploitation (expanding the structure by applying generation embedding-based verification could also potentially be har-
operators within an already established tree branch) and ex- nessed [15].
ploration (generating new branches). Here, one impacts di- Another form of evaluation employs a value estimator,
versity by manipulating the exploitation-exploration trade- which judges a given reasoning step based on its expected
off, as determined by the Upper Confidence Bound for Trees contribution to a correct final outcome. This method evalu-
(UCT) formula [73] or its variants. ates both the correctness of the step and its alignment with
the overall solution goal. Such evaluations can be performed
4.3.2 Traversal Operators through simulations, as in the original MCTS algorithm, or
Traversal operators define how the reasoning process navi- more efficiently using a learned value model [129].
gates through the existing reasoning structure. These opera- A critical aspect of evaluation is the selection of appro-
tors play a crucial role in shaping the flow of reasoning by priate metrics. For instance, in value estimation, an ideal
determining which paths to pursue. metric considers both the correctness of a reasoning step and
the extent of progress it represents toward the final solution,
• Select The Select operator determines which reasoning
ensuring a balanced assessment of its contribution.
steps to pick for further exploration, evaluation, or refine-
ment within the reasoning process. It evaluates existing 4.3.5 Discussion: Test-Time Compute
elements based on predefined criteria, such as heuris-
One of the recent trends in next-generation LLMs [95], [145]
tic scores, likelihood estimates, performance metrics or
is to shift from merely increasing model sizes to enhancing
search strategies like PUCT [117] or UCT [73], selecting the
computational strategies during inference, a concept known
most promising candidates to guide the next stages of
as the test-time compute (TTC). This approach allocates
reasoning. By balancing exploration (considering diverse
additional computational resources during a model’s ex-
alternatives) and exploitation (focusing on high-potential
ecution to improve performance, particularly in complex
paths), the selection operator optimizes resource allocation
reasoning tasks. This methodology mirrors human cognitive
and ensures efficient reasoning progression.
processes, where increased deliberation is applied to more
• Backtrack The Backtrack operator enables the model to
challenging problems.
explicitly return to a previous reasoning step and continue
Recent studies [131] indicate that optimizing test-time
along a different reasoning path. This operator supports
compute can be more effective than merely increasing
error correction, divergence handling, and hypothesis re-
model size. For instance, employing a compute-optimal
vision by abandoning unproductive directions in favor of
strategy—where computational resources are adaptively al-
alternative trajectories. The QwQ model output indicates
located based on the problem’s complexity—can enhance
that the reasoning structures used as training data in this
efficiency by over four times compared to traditional meth-
model harnessed Backtrack.
ods. Moreover, in scenarios where smaller base models
achieve moderate success rates, augmenting test-time com-
4.3.3 Update Operators
pute enables them to outperform models up to 14 times
The Update operator enhances specific parts of the rea- larger.
soning structure without altering the structure itself. A While test-time compute offers significant benefits, it
common example is the backpropagation phase in MCTS, also presents challenges, related to – among others – re-
where evaluation scores are propagated and updated along source allocation (determining the optimal amount of com-
existing reasoning steps to inform future decisions. Another putational resources for each inference task requires sophis-
form of update involves refining the content of individual ticated strategies to balance performance gains against com-
nodes or subsets of nodes, replacing their original versions putational costs), dynamic scaling (implementing adaptive
with improved iterations, such as the “enhance” thought compute strategies necessitates models capable of assessing
transformation in Graph of Thoughts [9]. problem difficulty in real-time and adjusting their computa-
tional efforts accordingly) [97], and hardware implications
4.3.4 Evaluate Operators (the shift towards increased test-time computation may
Evaluate operators take as input a segment of the reasoning influence hardware requirements, putting more pressure
structure and output a value without any modifications to on delivering specialized inference-focused hardware solu-
the structure. They are widely used with reasoning strate- tions).
gies, such as MCTS. Test-Time Compute in the Context of the Blueprint. Our
One important type of evaluation occurs when the rea- blueprint offers mechanisms to dynamically allocate com-
soning structure reaches a terminal state, allowing the full putational resources during inference to improve perfor-
reasoning sequence to be assessed against a known solu- mance, particularly for more complex problems. By lever-
tion—applicable to tasks with definitive answers, such as aging the modular structure of the blueprint, TTC can be ef-
mathematical problems. This terminality evaluation veri- fectively implemented through specific operators designed
fies whether the final step provides a correct and complete for reasoning tasks. We now provide several examples.
solution. • The Generate operator can be used to implement TTC by
One can also evaluate intermediate steps (i.e., non- dynamically increasing the number of next reasoning steps
terminal ones). This can involve estimating the reward generated for harder problems. For simpler tasks, the op-
associated with specific reasoning steps, using heuristics, erator may only generate a minimal set of continuations.
aggregated simulation outcomes, or a trained reward model However, for more complex problems, the operator can
11

be used to create a larger set of potential reasoning steps, RL-based methods such as Proximal Policy Optimization
thereby expanding the search space. (PPO) [125], Direct Preference Optimization (DPO) [115],
• The Refine operator provides another avenue for imple- and reasoning-specific variants like Reasoning Policy Op-
menting TTC by enhancing a given reasoning step multi- timization (RPO) [111]. Several training paradigms also
ple times for harder problems. In this approach, the oper- incorporate self-learning, where the model iteratively im-
ator iteratively improves the quality of a reasoning step, proves by generating and evaluating its own reasoning
addressing ambiguities, rectifying errors, or improving sequences, thereby simulating competitive or cooperative
clarity. For simpler tasks, the operator might only refine reasoning scenarios.
a step once, while for more complex reasoning, it can per-
form multiple enhancement iterations to ensure the output
meets a higher standard of precision and robustness. 4.4.2 Training Data Scope
• The Traversal operators, such as Select, enable the explo-
ration of multiple reasoning paths at test time, offering The training data for RLMs can vary significantly in terms
another key mechanism for implementing TTC [171]. By of how much of the reasoning structure it captures. We
using Select on several next reasoning steps, the model now outline two established approaches, outcome-based
can dynamically expand its search tree for more challeng- supervision (OBS) and process-based supervision (PBS).
ing problems, thereby increasing the diversity and depth More details regarding both OBS and PBS can be found in
of reasoning paths under consideration. For example, in Appendix B.1.
a complex task, the model might select multiple high- In outcome-based supervision (also known as a sparse
probability steps and explore their corresponding contin- training signal) [35], [143] each training sample consists
uations in parallel. This approach facilitates broader ex- solely of the input and the corresponding output. For exam-
ploration of the reasoning space, ensuring that promising ple, in mathematical problem-solving, a sample may include
paths are not prematurely discarded. the task statement and the final solution, labeled as correct
• To efficiently manage the expanded set of possibilities, the or incorrect. This approach is straightforward to implement,
blueprint allows integration with the Aggregate operator. and the required data is relatively easy to collect. However,
This operator evaluates the generated reasoning paths it can limit the model’s reasoning accuracy, as it provides
and selects the most promising ones based on prede- minimal insight into the intermediate steps that led to the
fined criteria, such as the likelihood of correctness or the solution [83].
quality of intermediate steps. This combination ensures
An alternative approach is process-based supervision
that while more computational resources are allocated
(also known as a dense training signal) [83], [147], where
for challenging tasks, only the most relevant paths are
a training sample reflects the entire reasoning structure. In
explored further, optimizing both accuracy and efficiency.
this case, the sample contains not only the input and final
output but also all intermediate reasoning steps, annotated
4.4 Models with labels indicating the quality of each step. This richer
Models are used to implement various types of operators. training data allows the model to learn more granular rea-
Most common are the value model (implementing the value soning patterns, improving its ability to generate accurate
evaluation operator) and the policy model (implementing and interpretable solutions by understanding the reasoning
the generate operator). process in detail. However, such data is much more chal-
Models are further categorized and discussed in detail in lenging to generate or gather [83].
Appendix B; we discuss the variants of the value model (Q OBS vs. PBS By varying the training data scope,
Value model, V Value model), we compare Process Reward developers can strike a balance between ease of data col-
and Outcome Reward models, and we formally identify lection and the depth of reasoning insights provided to the
a new variant of models, the Outcome-Driven Process model, with dense supervision generally offering improved
Reward Model. performance at the cost of increased data complexity. We
detail these, and additional aspects of ORMs and PRMs in
4.4.1 Training Paradigm Pipelines for different training phases and paradigms can be
Each model must be trained according to a specified found in Appendix B, Appendix C.2, Appendix C.3, and in
paradigm, which outlines the methodology for optimizing Algorithms 2–7.
its performance. This paradigm defines key training compo- Trace-based supervision (TBS) is a potential way to
nents such as the loss function, data generation and labeling extend PBS by incorporating detailed information about
procedures, and other critical training details. the sequence of applied operators, including traversal op-
A wide range of training schemes has been developed erators, within the reasoning structure. By capturing the
for models used in RLMs, with early foundational work full trace of how reasoning steps are generated, refined, or
stemming from advancements related to AlphaZero. These revisited, TBS would provide richer supervision that teaches
schemes have since evolved to support the complex require- the model to internalize not just the reasoning steps but also
ments of reasoning tasks within LLMs. Common training the process of navigating and manipulating the reasoning
paradigms include supervised fine-tuning (SFT), where structure itself. This approach could enable the training of
models are trained on reasoning sequences labeled with more powerful Implicit RLMs by guiding them to replicate
q-values; rejection sampling [22], [134], which involves the reasoning dynamics of explicit structures, improving
filtering generated outputs based on quality criteria; and their ability to reason flexibly and efficiently.
12

4.5 Pipelines reasoning chains from it and combining them together into
A pipeline is a detailed specification of operations that an individual long chain. This way, the scheme attempts to
orchestrates the details of the interaction between the rea- harness insights from different tree branches. By maintain-
soning scheme and the operators and models to achieve a ing a chain-based structure, Journey Learning preserves the
specific objective. Typically, an RLM would incorporate a simplicity of linear reasoning while embedding the capacity
single pipeline for inference and a separate pipeline for for self-correction and exploration of multiple hypotheses.
training each model used in an RLM. Moreover, there could Additionally, Journey Learning introduces a pipeline for
also be pipelines for synthetic data generation used for the internalization of such long reasoning chains into its
training models. One can also distinguish a pipeline that weights. This enables the final model to generate such long
trains an Implicit RLM using the provided reasoning traces reasoning chains, possibly containing different reasoning
from the Explicit RLM. branches, directly from its weights, making it an implicit
The details of pipelines depend on arbitrary design RLM.
choices. In Section 3, we provided a general description
of how these pipelines work. In Appendix C, we present 5.2 Implicit RLMs
detailed algorithmic specifications of our pipelines, along Qwens’s QwQ [140] embodies a fully implicit reasoning
with insights into the reasoning behind these design choices. model, characterized by an implicit reasoning structure that
Specifically, the inference pipeline can be found in Ap- is generated autoregressively directly by the model weights.
pendix C.1 and in Algorithm 1. Pipelines for different train- The reasoning strategy in QwQ – as indicated by the
ing phases and paradigms can be found in Appendix C.2, model output – harnesses next-step generation, backtrack-
Appendix C.3, and in Algorithms 2–7. The data generation ing, summarization, and critique generation to derive the
pipeline is detailed in Appendix D. final solution. At each step, the model implicitly generates a
new node within the chain by employing one of these four
implicit generate operators, presumably implemented using
5 E XPRESSING E XISTING S CHEMES
special tokens.
We now showcase the expressivity of our blueprint, by
illustrating how it can be used to model a broad scope of
5.3 Structured Prompting Schemes
existing RLMs and other related works. We summarize the
outcomes of the analysis in Table 1. We start with typical and Finally, we also illustrate that advanced structured prompt-
most prevalent Explicit RLM architectures based on MCTS ing schemes, such as CoT, ToT, and GoT, constitute a fully
and policy and/or value models, where a single reasoning explicit RLM structure without any implicit reasoning than
step is an individual logical argument (Section 5.1). We also what is originally presented in the used LLM, i.e., no models
discuss there schemes that generalize this typical design, nor training or data generation pipelines.
by harnessing nesting or Linearization Structure operators. CoT [152] utilizes an implicit reasoning structure con-
Finally, we study Implicit RLMs (Section 5.2) and various sisting of a chain of reasoning steps. The reasoning strategy
structured prompting schemes such as Cot or ToT (Sec- employed in CoT is oriented towards constructing a single
tion 5.3), showing that they also fit our blueprint. coherent chain of reasoning, culminating in a solitary solu-
tion, thus only needing the generation operator. CoT serves
as the foundational framework for a range of advanced rea-
5.1 Explicit RLMs soning strategies, including prompting methodologies such
We start with the most widespread variant of RLMs that fol- as Self-Consistency and Self-Refinement, among others.
lows the architecture outlined in Section 3.1. These reason- Self-Consistency (SC) [150] extends the CoT frame-
ing models such as TS-LLM [45], AlphaLLM [141], MCTS- work by introducing redundancy into the reasoning pro-
DPO [155], and others [23], [52], [145], [169], [170], [174] cess. It generates multiple reasoning chains and employs a
generally employ an explicit tree structure in which a node majority-voting mechanism to determine the most consis-
represents a distinct reasoning step. The reasoning strategy tent solution, which implements a Select operator from our
is based on the MCTS and focuses on iterative exploration, blueprint.
expansion and evaluation of nodes within the tree. By incor- ToT [161] adopts an explicit reasoning structure orga-
porating value mechanisms—such as prompt-based evalu- nized in a hierarchical, tree-based format. Within this frame-
ation or dedicated value models, the system identifies and work, each node corresponds to a distinct reasoning step,
prioritizes promising branches, facilitating more informed and branching facilitates exploration across multiple infer-
decision-making and refinement of the reasoning process. ential pathways (the Generate operator). Additionally, an
All MCTS based reasoning models implement at least a evaluation operator, implemented via a specialized prompt
next-step generation operator, an evaluation operator, and and the LLM itself, assesses branches of the tree.
the update operator for back-propagating the values. In ad- GoT [9] introduces a more intricate reasoning structure
dition, ReST-MCTS*, LLaMA-Berry, and Marco-o1 support a by employing an explicit graph-based representation. In this
refinement operator to further improve produced reasoning framework, nodes represent individual reasoning steps, and
steps. the graph architecture supports non-linear, interdependent
Journey Learning [113] exhibits two main differences to relationships between these steps. The reasoning strategy in
typical MCTS-based RLMs. First, it harnesses the Lineariza- GoT is orchestrated by an external controller, realized as a
tion Structure operator, in which the tree reasoning structure separate LLM, which guides the exploration, refinement and
is transformed into a chain, by extracting several selected aggregation of the graph’s nodes.
13

Reasoning Reasoning Operator Models Pipeline


Structure Traversal Update Evaluation
Scheme Remarks
Structure Step Strategy Gen. Ref. Agg. Pr. Res. Sel. BT Bp. Inter. Final. PM VM Inf. Tr. DG

Explicit RLMs (Section 5.1)

rStar-Math [52] E Tree C Thought + Code Block E MCTS é é é é é


PRIME [38], [163] E Multiple Chains F Token E Best-of-N é é é é é é
C Thought
Marco-o1 [174] E Tree F Token Sequence E MCTS é é é é é
C Thought
Journey Learning (Tr.) [113] E Tree E Thought E Tree Search é é é * *Separate Entry
OpenR [145] E Tree C Thought E Best-of-N é é é é
E Beam Search
E MCTS
LLaMA-Berry [169] E Tree of Chains C Solution E MCTS é é é é é
ReST-MCTS* [170] E Tree C Thought E MCTS * é é é é *Advice by critic
AlphaMath Almost Zero [23] E Tree F Thought E MCTS é é é é é * * *Single model
MCTS-DPO [155] E Tree F Token Sequence E MCTS é é é é é * * *Single model
AlphaLLM [141] E Tree C Option E MCTS é é é é
TS-LLM [45] E Tree F Token E MCTS é é é é
F Sentence E Tree Search

Implicit RLMs (Section 5.2)

QwQ [140] I Chain* F Token é é é é é é é é é é é é é *Linearized Tree


Journey Learning (Inf.) [113] I Chain* C Thought I DFS é é é é é é é é é é é é *Linearized Tree

Structured Prompting Schemes (Section 5.3)

Graph of Thoughts (GoT) [9] E Graph* C Thought E Controller é é é é é é *DAG


Tree of Thoughts (ToT) [161] E Tree C Thought E Tree Search é é é é é é é
Self-Consistency (SC) [150] E Multiple Chains C Thought E Majority Voting é é é é é é é é é é é é
Chain of Thought (CoT) [152] I Chain C Thought é é é é é é é é é é é é é é

TABLE 1: Comparison of RLMs with respect to the provided taxonomy (Section 4 and Figure 5). “Reasoning”: Details of the reasoning approach, specifically
what is its Structure and its Strategy? “Reasoning Operator”: Does a given scheme support operators on the reasoning structure? If yes, which classes (and specific
functionalities) are supported Structure (“Gen.”: generate, “Ref.”: refine, “Agg.”: aggregate, “Pr.”: prune, “Res.”: restructure), Traversal (“Sel”: select, “BT”: backtrack),
Update (“Bp.”: backpropagate), and Evaluation of “Inter.”: intermediate steps and “Final.”: final steps? “Model“: Does a given scheme use models to implement
its operators and if so, which ones (“PM”: policy model, “VM”: value model)? “Pipeline”: Which pipelines are harnessed by a given scheme (“Inf.”: inference, Tr.”:
training, “DG”: data generation)? When describing representations, we use the following abbreviations: “E”: explicit, “I”: implicit. “F”: fine-grained. “C”: coarse-
grained. “ ”: full support (i.e., YES), “ ”: partially [supported], “é”: no support (i.e., NO).

6 H OW TO U SE T HE B LUEPRINT decoding strategy, scoring functions, and step evaluation


We now outline how to use our blueprint for the user’s methods. These choices will significantly impact the model’s
application; we keep this section in a tutorial style. reasoning dynamics, scalability, and overall effectiveness.
Each decision at this stage lays the foundation for tailoring
6.1 Part 1: Define the Reasoning Scheme the RLM to your specific application requirements.
The first step in using the blueprint is to define the rea-
soning scheme, which specifies the foundational structure
and strategy of your RLM. Start by selecting the reasoning 6.2 Part 2: Define the Operators
structure. Chains are the most affordable in terms of token
costs, at least when it comes to ICL [14]. Trees, while The next step is to specify the set of operators that will
the most expensive, offer rich branching that enhances govern the reasoning process. For an MCTS-based design,
exploratory reasoning. Graphs, though slightly cheaper than the simplest approach is to implement the core operators:
trees, introduce additional challenges in implementation but Generate (often called Expand for MCTS), Select, and Back-
can yield significant accuracy gains due to their flexibility. propagate. These fundamental operations suffice for many
Next, decide on the granularity of reasoning steps. scenarios, providing a straightforward framework for rea-
Coarse-grained steps, such as thoughts or sentences, are soning.
widely used due to their simplicity and ease of scaling. Beyond the basics, consider whether you want to incor-
However, token-based granularity, which operates at the porate less mainstream operators, such as Backtrack. By ex-
level of individual tokens, offers the potential for greater plicitly including Backtrack, you enable a clearer tracking of
precision and unexplored accuracy improvements. This ap- progress within the search tree, making it potentially easier
proach, while promising, demands significantly more com- to revisit and refine earlier reasoning steps. This approach
putational resources and careful design. This decision de- also facilitates advanced training schemes, like Trace-based
fines your action space (possible operations) and state space Supervision, by generating richer and more structured data.
(configuration of the reasoning structure). Consider using this and other operators within our toolbox.
Another decision is choosing a reasoning strategy to gov- You will also need to determine the implementation
ern how the reasoning structure evolves. MCTS combined details for each operator. Decide which operators will be
with some variants of policy and value models remains the implemented as neural models—such as using a policy
most widely adopted approach due to its balance of ex- model to guide selection or a value model for backpropa-
ploration and exploitation. However, alternative strategies gation—and which will rely on non-neural methods. This
that have not been deeply studied, such as ensembles of choice affects both the computational complexity and the
reasoning structures, may offer untapped potential. flexibility of the system, so it’s important to align these
Finally, determine the specific details of your chosen decisions with your reasoning scheme and performance
strategy, including parameters like exploration coefficients, goals.
14

6.3 Part 3: Determine the Training Details 7.2 Operators


In this phase, you need to outline the specifics of training The Generate operator plays a crucial role in expanding the
for the models that will implement operators. For an MCTS- tree by adding new children to a selected node. To improve
based design, consider the typical approach of using the the diversity of these newly generated nodes, we employ
policy model to implement Generate (Expand) and the value diverse beam search [144], which ensures variability among
model for Simulate. If necessary, you might also train a the children. Alternatively, high-temperature sampling can
separate model to calculate the reward at individual nodes, be used to introduce stochasticity into the generation pro-
enhancing the precision of the reward signals. cess, fostering the exploration of different reasoning paths.
Identify the application or training domain in order to Traversal of the reasoning tree is managed by the Select
address generalization requirements. This step ensures that operator, which uses the PUCT function to identify the next
your models are trained on data representative of the tasks node to expand. This operator balances a trade-off between
you want them to handle. exploration, favoring less-visited nodes, and exploitation,
Define the models, including their architectures and the reinforcing nodes with higher potential based on previous
selection of suitable base models. Consider how the design evaluations. Always starting from the root node, the traver-
of these models—such as transformer-based architectures sal mechanism ensures that the system can dynamically
or more specialized designs—aligns with your reasoning explore alternative paths and recover from suboptimal deci-
structure and overall objectives. sions by backtracking and selecting new branches.
Collect training data for both the policy and value mod- The Backpropagation Update operator refines the q-
els. For the policy model, consider generating data auto- values which can be used as guidance for the select operator
matically with our pipeline or using a scheme such as CoT along the path from an expanded node back to the root. This
prompting, and include a special end-of-step token to en- process incorporates new information from downstream
sure clean segmentation. For the value model, generate data nodes, leading to progressively more accurate q-values for
through MCTS full simulations, which provide rich, struc- the intermediate nodes. These refined q-values subsequently
tured information about reasoning paths and outcomes. inform future decisions, making the reasoning process in-
Fine-tune the models as needed. If using coarse reason- creasingly robust over time.
ing steps, perform supervised fine-tuning (SFT) on the pol- The framework implements two different Evaluate Op-
icy model to teach it how to reason step-by-step. Similarly, erators. First, the Reasoning Path Evaluation operator pre-
apply SFT to the value model to initialize it as a reliable dicts the discounted expected future reward for a chain
evaluator. extending from the root to a specific node. This prediction
Run MCTS with initialized models to collect additional is derived from the q-value model, offering a quantitative
data. You might filter this data to keep only high-quality measure of the path’s quality. Second, when the ground
reasoning paths (terminal states) or strong signals (high truth is available, the Ground Truth-Based Reward operator
absolute advantages) for further training. directly evaluates leaf nodes for correctness, assigning fixed
Finally, train both models either by additional SFT rewards to verified solutions. These rewards are incorpo-
rounds or with reinforcement learning methods such as rated into the q-values of upstream nodes, ensuring that the
Proximal Policy Optimization (PPO). This ensures that the reasoning process is informed by both model predictions
models are optimized not only for accuracy but also for and objective validation.
the efficiency and robustness needed in complex reasoning
tasks.
7.3 Models & Training Paradigms
7 F RAMEWORK X 1: D ESIGN & I MPLEMENTATION Both the value and the policy model in x1 are fine-tuned
versions of an LLM6 , without reliance on prompting, which
We now introduce x14 , an extensible and minimalist frame-
is used in several other RLM architectures [52], [170]. This
work that can serve as ground to design and experiment
design decision aims to maximize the quality of results. We
with RLMs, and currently provides one example of the
now outline briefly selected key aspects of how we train
blueprint.5 An overview of the framework is in Figure 6.
these models, full details can be found in Appendix B, C,
and D.
7.1 Reasoning Scheme
The x1 framework employs a tree reasoning structure in 7.3.1 Training the Policy Model
conjunction with MCTS as the reasoning strategy. This
The policy model also leverages an LLM to generate new
combination allows for a systematic exploration of reason-
nodes during the MCTS. It is fine-tuned to output an in-
ing paths while balancing exploration of new possibilities
dividual next reasoning step instead of a whole chain of
and exploitation of promising solutions judged by a value
thoughts towards a completion (which LLMs commonly
model. The framework achieves this alignment through
do). We achieve this by introducing a novel token, the end of
the implementation of a series of operators that guide the
intermediate step (eois) token, which denotes the completion
construction, traversal, evaluation, and updating of the rea-
of each reasoning step. The eois token complements the
soning tree.
standard end of sequence (eos) token, which indicates the
4
https://github.com/spcl/x1 conclusion of an entire reasoning chain. By incorporating
5
We are working continuously on expanding the framework as well as
6
adding more RLMs. We currently use Llama-3.1-8B-Instruct as base model.
15

the eois token, the framework enables the explicit identifi- • Resource Optimization The independent allocation of
cation of intermediate reasoning steps, allowing for greater computational resources to the value and policy models
interpretability and precise determination of whether the is inherently supported by the framework’s architecture,
reasoning process is complete or ongoing. This dual-token enhancing efficient resource utilization.
strategy enhances the LLM’s capability to decompose com- • Replication and Distribution The separation of value
plex problems into manageable substeps while ensuring the and policy models facilitates the application of distinct
model recognizes when a solution has been reached. replication and distribution strategies.

7.3.2 Training the Value Model Figure 6 illustrates the implementation of the framework
as a server architecture, demonstrating how these structural
The value model is designed to estimate the sum of the
enhancements contribute to improved scalability and effi-
expected discounted future rewards for a sequence of rea-
ciency. Building on these architectural enhancements, we
soning steps and a newly proposed reasoning step, quanti-
employ the following strategies to further optimize the
fying the value of the node modeling this step. For a given
framework’s efficiency and scalability, focusing on inference
node in the MCTS tree, its value (referred to in the MCTS
and parallelization.
literature as state action value or q-value) is defined as the
In the framework, we incorporate the standard optimiza-
expected cumulative reward discounted by the number of
tions of batching, quantization, and KV caching. Inference
steps required to achieve it. Formally, the q-value Qπ (st , at )
calls are batched in the policy model, enabling simultaneous
for traversing the edge to node st+1 when taking action at
processing of multiple queries. To expedite the reasoning
from st at depth t in the MCTS tree is expressed as
process, the framework creates multiple child nodes in par-
h i allel during the node expansion phase. Specifically, N new
Qπ (st , at ) = E γ T −t r(sT , aT ) | st , at (1) nodes are generated concurrently in each expansion step, re-
N ducing computational overhead and enhancing overall sys-
1 X T −t (i) (i) tem performance. Further optimization of inference speed is
≈ γ r(sT , aT ) (2)
N i=1 achieved through KV caching and quantization. KV caching
where γ is the discount factor, T marks the last reasoning mechanisms mitigate redundant computations, while quan-
step aT that is added resulting the terminal state sT +1 tization techniques reduce the memory consumption of both
containing the complete reasoning structure and rewards policy and value models.
are modeled sparse. The terminal state sT +1 is defined as the
state in which no additional reasoning steps can be added. It
7.5 Blueprint for Efficient Scaling
typically represents the state containing the final solution to
the problem at hand. Accordingly, r(sT , aT ) is the terminal Our blueprint can be deployed to AI HPC systems and
reward. We chose to model rewards sparse, where only the clouds, as both systems provide the performance and re-
final reasoning step receives a non-zero reward, since for sources necessary to scale RLMs. Deployment on HPC
most reasoning tasks, only the final answer can be evaluated systems is straightforward: compute tasks are distributed
against the true solution. As a result, one can only obtain across statically allocated nodes, connected with a low-
a reward signal when the last step is reached. We can latency and high-bandwidth interconnect, and with train-
approximate the q-value by sampling N reasoning chains ing data being available on a high-performance parallel
until the terminal state, as in 2, and averaging the terminal filesystem. On the other hand, the cloud provides many
rewards discounted by the depth required. configurable services that offer different trade-offs between
The q-value model is trained using data from completed performance, cost, and reliability. There, it becomes the
MCTS searches. Initially, when the q-value model is unavail- user’s responsibility to choose the storage options and com-
able, N simulations (complete rollouts) are performed, and pute granularity that provides the best match for expected
the average discounted reward is used to initialize the q- performance and cost. The architecture of our blueprint fits
values for each node. More information can be found in the into the microservice architecture, with a clear separation of
Appendix D.2. compute tasks, data storage, and coordination. This archi-
tecture helps to ease the configuration process, as different
7.4 Enabling Scalability and Efficiency components of the system can be deployed, scaled, and
The current implementation is built to scale to multiple optimized independently. In particular, the separation of
GPUs on multiple nodes. To further enhance the scalabil- value and policy servers allows them to be scaled sepa-
ity and computational efficiency, several architectural and rately according to the complexity of reasoning steps that
operational improvements have been implemented. might require different resource allocations to handle task
One design decision involves the decoupling of the value generation and evaluation.
and policy models. The deployment of dedicated Value and First, we outline the major decisions users must make
Policy servers confers several advantages: before deploying the x1 scaling blueprint:
• Scalability The decoupling of Value and Policy servers • Deployment Training and inference tasks are typically
from the MCTS instance facilitates scalability and the allocated to virtual machines and containers, with the
execution of multiple parallel MCTS instances. latter typically deployed as managed services with an
• Batch Processing The policy server incorporates batching orchestrator such as Kubernetes. There, x1 can benefit
capabilities, allowing the concurrent processing of multi- from modern frameworks like Ray [105] that hide the
ple queries, thereby enhancing throughput. complexity of managing a service in a Kubernetes cluster.
16

Phase 1 Training: Ini�alize models Phase 2 Training: Reinforcement Learning


Alternate between 1 and 2 to improve models ( 1 ) and data ( 2 )
LLaMA 3.1 Policy Model
RL Training Data 1 Training Policy Server
generates generates (data generator) context + reasoning step, advantage
Mul�ple CoT examples Mul�ple MCTS trees PPO
Input Output
Input Input Input Input Trainer
Input Output
Input Output
2 Data Genera�on Policy
Output Output Output Output batched
context & synced
Model
algorithmically insert extract one training Input new reasoning step
Buffer
\\eois tokens between sample per node
reasoning steps
Output
Training Data Training Data
....\\eois..........\\eois........\\eois....... context + reasoning step + MCTS q-value
context + reasoning step + MCTS q-value
Input 1 Training Value Server
..\\eois.......\\eois..........\\eois...........
context + reasoning step, q-value
.......\\eois........\\eois.......\\eois...... context + reasoning step + MCTS q-value
Output
MSE
LLaMA 3.1 Trainer
2 Data Genera�on
Value
LLaMA 3.1 SFT Training Stack of Training Input Model
linear layers
context + reasoning step
value
Policy Model Value Model Output

Fig. 6: An overview of the x1 framework is presented, highlighting its two-phase training process. In phase 1, the models are initialized, while in phase 2, the models
are iteratively refined by alternating between constructing a sufficient number of MCTS trees and training the models on data derived from these trees.

• Data Storage In the cloud, object storage provides auto- communication protocols [36], [70].
matic bandwidth scalability that allows scale computa- • GPU Management Cloud rental of GPU devices is partic-
tions operating on the same data. To overcome latency and ularly expensive, and procuring a sufficient number of de-
power constraints, data can also be placed in in-memory vices can be challenging, specifically when constrained to
caches like Redis and hybrid solutions that combine disks a single cloud region. Given the large compute and mem-
with flash memory [172]. ory requirements of base models, space-sharing might
• Communication Requirements of the x1 blueprint differ not be feasible. On the other hand, time-sharing of GPU
from classical microservices, that rely on high-level ab- devices between different x1 services could be a viable
stractions like RPC and REST interfaces. RLM must uti- alternative, but it is currently constrained by large memory
lize high-performance network fabrics offered by modern allocations and the cost of swapping model checkpoints
clouds, such as InfiniBand on Azure and Elastic Fab- between CPU and GPU memory. To increase resource
ric Adapter (FBA) on AWS, both capable of achieving utilization, new techniques for efficient GPU checkpoint
throughput of 400 Gb/s [39]. These are also available and restore are needed [47].
to training processes distributed across many GPUs, e.g., • Parameter-Efficient Resource Sharing Resource-sharing
through specializations of the NVIDIA collectives library can be further enhanced by utilizing a shared base model
NCCL. architecture for the policy and value models, while dy-
• Parallelism We apply parallelism at multiple blueprint namically swapping task-specific parameter layers - such
levels, including the classic data, model, and pipeline as Low-Rank Adaptation [62], prefix tuning [81], or other
parallelism. These can scaled horizontally across a larger adapter layers - on the GPU during inference. These
number of virtual machines and containers. On the other modular strategies keep the base model loaded in device
hand, reasoning steps can benefit from elastic scaling, like memory and replace only the lightweight task-specific
in distributed MCTS and Beam Search, where each path layers, eliminating redundant loading and reducing both
can be explored in parallel. There, containers can be allo- latency and memory usage. An example of an RLM, which
cated on the fly to support new paths and deallocated as uses a shared base model with separate additional linear
soon as the parallelism scale of the computation decreases. layers for policy and value model, is AlphaMath [23].
New developments in the machine learning infrastruc- • Cross-Region Deployment Cloud applications are often
ture can significantly impact RLM deployment strategies: deployed in a single region to avoid the performance and
cost of cross-region data access. However, workloads can
• Elastic Compute Computing tasks can be executed on be scheduled globally, suspended, and migrated across re-
ephemeral resources that trade the guaranteed lifetime gions to avoid hardware resource exhaustion and achieve
and reliability for lower costs, such as spot virtual ma- lower carbon emissions [33], [153].
chines [101]. Serverless functions provide elasticity scal-
ability with fine-grained pricing models [37], which can
be a good fit for dynamically generated reasoning steps. 7.6 Example Analysis: Token Probability Distributions
However, serverless functions are stateless and suffer As an illustrative example, we use the framework to directly
from cold starts, which requires optimization techniques leverage the token probability distribution, thereby facilitating
dedicated to LLMs [47]. Furthermore, restricted network the use of associated properties—such as entropy and vari-
communication in functions forces the adoption of new ance—for guiding subsequent reasoning decisions. By fo-
17

(a) 1st example (b) 2nd example

(c) 3rd example (d) 4th example

Fig. 7: Four examples of model output with highlighted tokens indicating uncertainty levels. The outputs have been color-coded to reflect the confidence levels of
the model’s token predictions. Tokens are highlighted in purple when the highest probability is below 0.8 (indicating lower certainty without significant contention),
in blue when the second-highest probability exceeds 0.1 (indicating contention, where another token is a close alternative), and in red when both conditions are
met (indicating high uncertainty). These examples illustrate varying levels of prediction confidence and contention in reasoning steps, emphasizing regions of high
ambiguity or competition between plausible continuations. This type of visual analysis is useful for identifying points in the reasoning process where the model
lacks confidence or is torn between alternatives, guiding refinements in reasoning strategies and model design. It also helps pinpoint critical areas where additional
supervision or context may improve model performance.

cusing on these probabilistic characteristics, the framework well-supported continuation, this confidence can stream-
can help identify when to expand a given reasoning step. line decision-making and reduce computational overhead.
Using token probability distributions can be used for navi- However, if the model’s confidence is misplaced—perhaps
gating the reasoning based on both coarse and fine steps. due to biases in the training data or a lack of con-
To support this analysis, the x1 implementation includes text—relying on a single dominant token may cause the
scripts that provide insights into token-level metrics, such reasoning process to follow a suboptimal path. In such
as entropy fluctuations and distribution patterns, to inform cases, it’s crucial to assess whether the high-probability
reasoning strategies. token genuinely represents the most logical next step or if
additional validation is needed.
7.6.1 Relevance of Token Probability Distribution • Skewed Distribution with Multiple High-Probability

The token probability distribution provides critical informa- Tokens. In some cases, the distribution may be skewed
tion about the likelihood of different next-step candidates in with a small set of tokens receiving much higher prob-
a reasoning process. By examining this distribution, we can abilities than others. This indicates that the model sees
gain insight into how certain tokens dominate or diversify several plausible continuations, each with a reasonable
the reasoning space, and in turn, guide more informed chance of being correct. While this is generally a posi-
decisions about which step to take next. tive sign—offering a diversity of credible options—it also
We now list a few scenarios where different token distri- complicates the decision-making process. The reasoning
butions offer insights into which reasoning decision is best strategy must weigh the trade-offs between these top can-
to take at a given step. didates, considering not only their individual probabilities
but also how each choice impacts the subsequent reason-
• Flat Token Distribution. A flat probability distribution
ing trajectory. This scenario highlights the need for effec-
occurs when all tokens have roughly equal probabilities. In tive evaluation metrics (like entropy or Gini coefficient) to
this scenario, there is significant uncertainty about which help select the step that contributes most to reaching the
step is the best to choose because no single token stands correct or desired outcome.
out as a clear candidate. This can make the reasoning
process more exploratory, as the model may need to By analyzing token probability distribution and identi-
consider multiple tokens equally and rely on additional fying the cases above and others, reasoning strategies can,
strategies—such as external heuristics or learned poli- for example, improve efficiency (identifying when a distri-
cies—to identify the most promising step. While this can bution is flat allows the reasoning algorithm to focus on
foster exploration, it may also lead to inefficiencies since diversification or introduce additional constraints to narrow
the model might need to evaluate many equally plausible down choices), enhance decision confidence (recognizing
paths before finding an optimal solution. Another decision when one token is dominant can help expedite decisions,
that could be taken in such a scenario, is to delay initiating provided the model’s confidence is well-founded), or foster
a reasoning step till the token distribution changes to be balanced exploration (detecting multiple high-probability
more skewed. tokens facilitates exploring various credible paths without
• Skewed Distribution with One Dominant Token. When
being overly committed to a single option).
one token has a much higher probability than others, the
distribution is highly skewed. This often signals that the 7.6.2 Analyzing Token Probability Distribution
model is confident about the next step in the reasoning To understand the form of a token probability distribution,
process. If the dominant token corresponds to a logical or we examine variance, entropy, VarEntropy, and the Gini
18

coefficient as key metrics that offer distinct perspectives on In Figures 8a and 8d, specific regions emerge where the
the distribution’s shape and characteristics. top two probabilities are very close, while the remaining
Variance provides a broad measure of uncertainty by probabilities are significantly smaller. Such regions likely
reflecting how spread out the probabilities are across the vo- indicate scenarios where forking the reasoning process (e.g.,
cabulary. When variance is low, the probabilities are nearly exploring multiple paths) could disproportionately benefit
uniform, indicating a flat distribution. However, variance future outcomes, as the competing high-probability tokens
alone does not capture the specific structure or shape of the suggest alternative plausible continuations. Conversely, in
distribution. For example, two distributions can have the instances where the first probability is notably high, with
same variance but differ in their overall form, such as one much lower second and remaining probabilities, the model
having multiple minor peaks versus another being nearly exhibits strong confidence in a single continuation. These
uniform with a single dominant token. To address this, we cases are conducive to more deterministic reasoning, as
consider further measures below. forking may be unnecessary.
Entropy has long been a standard measure of uncer- Additionally, regions with a relatively high sum of the re-
tainty and information content in a probability distribu- maining probabilities (close to the top two) highlight flatter
tion. Higher entropy corresponds to greater unpredictabil- distributions with high uncertainty. These scenarios signal
ity—requiring more information to describe the system’s a need for cautious reasoning, where clarification or addi-
state. For instance, if all tokens have nearly equal proba- tional contextual refinement may help reduce ambiguity. For
bilities, the entropy is high, reflecting a flat distribution. In instance, such uncertainty may suggest that the model has
contrast, low entropy occurs when a small number of tokens not yet committed to a specific path and could benefit from
dominate, resulting in a skewed distribution.
P The entropy revisiting earlier reasoning steps to address potential errors
of a distribution is given by H = − i pi log2 (pi ), where or misalignments.
pi is the probability of the i-th token. This metric provides Figure 9 further analyzes these results using metrics such
valuable insight into whether the distribution is diffuse and as variance, entropy, VarEntropy, and the Gini coefficient. In
exploratory or concentrated and decisive. Figure 9a, a zero-shot prompt demonstrates lower uncer-
VarEntropy extends this analysis by measuring the vari- tainty overall, suggesting that it yields more confident pre-
ability of entropy itself, thus offering a dynamic view of how dictions and potentially higher-quality outputs. However,
uncertainty changes. A high VarEntropy combined with low the presence of specific high-probability tokens (e.g., “472”)
entropy often indicates a sharp, focused distribution with a raises concerns about potential data leakage into the training
few dominant outcomes. Conversely, low VarEntropy and set or the tokenizer, which could bias the results. Another
high entropy typically reflect a flat, uniform distribution notable observation is the high uncertainty associated with
where
P no single token stands out. The VarEntropy is defined <thought>tokens, which appear challenging for the model
2
as i pi (| log(pi )| − |H|) . This metric captures the nu- to predict accurately. This highlights the complexity intro-
anced shifts in distribution shape, helping to pinpoint how duced by token granularity, where most words correspond
tightly probabilities cluster around certain tokens versus to single tokens, resulting in a roughly even distribution for
how broadly they spread. the next token across the vocabulary in some contexts.
The Gini Coefficient, traditionally used to measure in- The uncertainty metrics provide actionable insights for
equality, provides another lens on the form of the distribu- reasoning strategy design. For example, cases with high
tion. A perfectly equal distribution has a Gini coefficient of VarEntropy and low entropy indicate a distribution where a
0, signifying that all tokens have identical probabilities. A few outcomes dominate, making tree-based search strate-
Gini coefficient closer to 1 indicates high inequality, where a gies effective. These strategies prioritize exploring high-
few tokens hold most of the probability mass. By visualizing probability outcomes while avoiding unnecessary evalua-
the cumulative distribution of sorted probabilities, the Gini tions of less probable branches. In contrast, low VarEntropy
coefficient highlights how the probability is concentrated or and high entropy reflect a flat distribution where no clear
dispersed. outcome dominates. Such cases could benefit from clarifica-
Together, these metrics—variance, entropy, VarEntropy, tion mechanisms or intermediate step refinements to reduce
and Gini—enable a detailed examination of token prob- ambiguity before proceeding further.
ability distributions. By leveraging each metric’s unique Interestingly, the Gini coefficient often highlights critical
strengths, we can effectively characterize whether a distri- regions more effectively than other metrics. In vital reason-
bution is flat, skewed with a dominant token, or skewed ing areas, it captures the inequality in token probabilities,
across several highly probable tokens, ultimately guiding helping to identify tokens that significantly influence the
more informed decisions in reasoning and model develop- reasoning process. This contrasts with metrics like entropy
ment. and VarEntropy, which may also flag tokens related to
formatting or stylistic choices, providing less task-specific
7.6.3 Example Results utility.
Figure 7 and 8 illustrate example model outputs and their Overall, these visualizations and metrics emphasize the
respective token probability distributions. By analyzing the importance of analyzing token probability distributions to
highest probabilities, the second-highest probabilities, and design effective reasoning strategies. By leveraging the nu-
the sum of the remaining probabilities, we gain valuable anced patterns revealed by these metrics, models can better
insights into the underlying token distribution, which can adapt to uncertainty, balance exploration and exploitation,
subsequently be quantified through the uncertainty metrics and optimize decision-making during the reasoning pro-
discussed earlier. cess.
19

1.0

2nd Highest Probability


0.4
Highest Probability 0.3

Sum of the Rest


0.8 0.3
0.2 0.2
0.6
0.1 0.1
0.4
0.0 0.0
lua o
te
eva |T>

$\
ceil
l
sq{r\
t
{
20
}}
\
ceir
l
^
2
$,
newee
d
foll to
ow
o the
ope rder
rat of
s
PE(
MD
AS
):
Eva 1
lua .
te
squthe

4
47 .
2$
2
Rou .
nd
up
nea theo

usgi er
are
t
of
20
:
sq$r\
t
{
20
}
app \

int rest
ng
funeiline
ctiog
n
...
rox
roo
ion

c th
xt

e
f_te
n_o
egi
<|b

√ √
(a) To evaluate ⌈ 20⌉2 , we need to follow the order of operations (PEMDAS):1. Evaluate the square root of 20: 20 ≈ 4.472
2. Round up to the nearest integer using the ceiling function: ⌈4.472⌉ = 5
3. Square the result: 52 = 25Therefore, the final answer is 25 .

1.0 0.6

2nd Highest Probability


0.4
Highest Probability

Sum of the Rest


0.8 0.3 0.4
0.6 0.2
0.2
0.4 0.1
0.0 0.0
ht
>
ou<gth
Firs>
t
,
newee
d
finto
d
squthe

,
so
let
's
cal use
cul a
r
app geto
rox a t
ima n
ue
of
20

4
47 .
are
t
of
20

2
ug /
h>t
ougth
ht
Sin >
ce
dea 'ree
wintg
h
funeiline
ctiog
n
we,
sid d
sm ther
in llese
grteegert

n
equor
al
to
thtaer

squthe
...
val te

are
tho .<
roo

ato

c th

con oul
xt|

<

li

a
a
f_te

s h
n_o
egi
<|b


(b) <thought>First, we need to find the square root of 20, so let’s use a calculator to get an approximate value of 20 ≈ 4.472.<thought>
<thought>Since we’re dealing with the ceiling function, we should consider
√ the smallest integer greater than or equal to the square root of 20, which is the ceiling of
20. </thought> √
<thought>We can also consider the perfect squares that are closest to 20, such as 16 and 25, to estimate the value of 20 and then apply the ceiling
function.</thought>
1.0

2nd Highest Probability


0.8
0.4
Highest Probability

Sum of the Rest


0.8 0.6
0.3
0.6 0.4
0.2
0.4
0.1 0.2
0.2
0.0 0.0
eva cae
>
ou<gth
ht
Firs>
t
,
newee
d
finto
d
squthe

20
app wh ,
rox ich
ima is

4
47 .

tho .<5
ug /
h>t
ou<gth
>Wht
are
t
of

2
,
so
we
roucan

nea theo

usbi er
funeiline
ctiog
n

l n
exp uate
res the
y

nd
up

nu holet
ng

wh ,
giv h
es
us

n
$\
ceil
l
sq{r\
t
{
20
}}
\
...
roo

res
tel

c th

sio
xt|

i
m
w
f_te
n_o
egi
<|b

(c) <thought>First, we need to find the square root of 20, which is approximately 4.472, so we can round up to the nearest whole number using the ceiling function,
√ which gives us 5.</thought>
<thought>We can evaluate the expression ⌈ 20⌉2 by first finding the square root of 20, then rounding up to the nearest whole number, and finally squaring the
result.</thought> √
<thought>Since ⌈x⌉ is the ceiling function, we can rewrite the expression as ⌈ 20⌉2 = ⌈4.472⌉2 = 52 = 25.</thought>
1.0 0.5
2nd Highest Probability

0.6
Highest Probability

0.4
Sum of the Rest

0.8
0.3 0.4
0.6
0.2
0.4 0.2
0.1
0.2 0.0 0.0
ce
>
ou<gth
ht
Firs>
t
,
newee
cal d
cul to
ate
squthe

tho .<0
ug /
h>t
ou<gth
>Wht
staan
fin byt

of
20
app wh ,
rox ch
ima is

4
47 .
are
t
of

g
squthe
are
t

2
a ,
t nd
rouhen
y

nd
up
nea theo

usbi er
nu holet
ng
funeiline
ctiog
tho .<n
ug /
h>t
ou<gth
A ht
appnoth >
roa er
ch
is
to
...
r
roo

roo

res
tel

c th
2

din
xt|

m
w
_te
_of
n
egi
<|b

(d) <thought>First, we need to calculate the square root of 20.</thought>


<thought>We can start by finding the square root of 20, which is approximately 4.472, and then round up to the nearest whole number using the ceiling
function.</thought>
<thought>Another approach is to recognize that 20 is between the perfect squares 16 and 25, so we can use this information to estimate the ceiling of the square root
of 20.</thought>

Fig. 8: Probabilities of the first 64 tokens of example model outputs. We show the two highest probabilities as well as the sum of the other probabilities.
20

0 1.5 1.000

Gini Coefficient
Variance (1e-6) 4

VarEntropy
0 1.0

Entropy
0.999
0 0.5 2
0.998
0 0.0 0
luaTo
te
eva t|>

$\
ceil
l
sq{r\
t
{
20
}}
\
ceir
l
^
2
$,
newee
d
foll to
ow
ope ordeer
rat of
s
P(
MDE
AS
):
Eva 1
lua .
squthee

:
sq$r\
t
{
20
app }\

4
47 .
2$
2
Rou .
nd
up
nea thteo
usgi er
are
t
of
20

int rest

tg
fucneilinhe
ctiog
n
...
rox
roo
ion
th

n
x

e
f_te
n_o
egi
<|b

√ √
(a) To evaluate ⌈ 20⌉2 , we need to follow the order of operations (PEMDAS):1. Evaluate the square root of 20: 20 ≈ 4.472
2. Round up to the nearest integer using the ceiling function: ⌈4.472⌉ = 5
3. Square the result: 52 = 25Therefore, the final answer is 25 .

0 1.0000
4

Gini Coefficient
Variance (1e-6)

0 2

VarEntropy
Entropy
0.9995
0 1 2
0 0.9990
0 0
>
ou<gth
ht
Firs>
t
,
newee
d
finto
squthde

ima n
ue
of
20

4
4 .
roroe
t
of
20
so,
let
's
cal use
cul a
r
app gto
rox aet

tho .7<2
ug /
h>t
ou<gth
ht
Sin >
ce
dea 'ree
ling
th
fucneilinhe
ctiog
n
,
cosnhouwe
sid ld
sm ther
inalle e
grteegest
a r
n
equor
al
squtheo
thtaer

...
val te

are
ato

t
w

wit
xt|

a
f_te
n_o
egi
<|b


(b) <thought>First, we need to find the square root of 20, so let’s use a calculator to get an approximate value of 20 ≈ 4.472.<thought>
<thought>Since we’re dealing with the ceiling function, we should consider
√ the smallest integer greater than or equal to the square root of 20, which is the ceiling of
20. </thought> √
<thought>We can also consider the perfect squares that are closest to 20, such as 16 and 25, to estimate the value of 20 and then apply the ceiling
function.</thought>

0 4 1.0000
3

Gini Coefficient
Variance (1e-6)

0 3

VarEntropy
2 0.9995

Entropy
0 2
1 0.9990
0 1
0 0 0.9985
eva cWae
>
ou<gth
ht
Firs>
t
,
newee
d
finto
squthde

of

app wh 0,
rox ich
tel s

4
47 .
roroe
t

2
s,
weo
rouan

mb e
usi er
fun ilinhe
ctiog
n

es
nd
up
nea thteo
nuwhosl t

tg

wh ,
givich
us
tho .<5
ug /
h>t
ou<gth
> ht
exp uatne
res the
n
$\
ceil
l
sq{r\
t
{
20
}}
\
...
y
ima i
2

sio
xt|

re
a

e
f_te

l
c
n_o
egi
<|b

(c) <thought>First, we need to find the square root of 20, which is approximately 4.472, so we can round up to the nearest whole number using the ceiling function,
√ which gives us 5.</thought>
<thought>We can evaluate the expression ⌈ 20⌉2 by first finding the square root of 20, then rounding up to the nearest whole number, and finally squaring the
result.</thought> √
<thought>Since ⌈x⌉ is the ceiling function, we can rewrite the expression as ⌈ 20⌉2 = ⌈4.472⌉2 = 52 = 25.</thought>

0 1.0000
Gini Coefficient
Variance (1e-6)

0 2 4 0.9995
VarEntropy
Entropy

0 0.9990
1 2
0
0 0 0.9985
ce
>
ou<gth
ht
Firs>
t
,
newee
cal d
cul to
squthee

ou<gth
>Wht
staan
fin bryt
squthge

app wh 0,
rox ich
tel s

4
47 .
are
t
of
tho .2<0
ug /
h>t

roroe
t
of

2
a ,
t nd
rouhen

nea thteo
mb e
usi er
y

nd
up

nuwhosl t

tg
fucneilinhe
cti g
tho .o<n
ug /
h>t
ou<gth
ht
apApnoth >
roa er

to
ch
is
...
roo

ima i
at

din

n
xt|

re
a
_te
_of
n
egi
<|b

(d) <thought>First, we need to calculate the square root of 20.</thought>


<thought>We can start by finding the square root of 20, which is approximately 4.472, and then round up to the nearest whole number using the ceiling
function.</thought>
<thought>Another approach is to recognize that 20 is between the perfect squares 16 and 25, so we can use this information to estimate the ceiling of the square root
of 20.</thought>

Fig. 9: Uncertainty metrics (variance, entropy, VarEntropy, and the Gini coefficient) plotted against the first 64 tokens of the output token sequence.
21

Use Two Phases for Training Adopting a two-phase


training strategy—splitting SFT and RL—has proven effec-
0.10 Qwen2.5 Math 1.5b Instruct
LLama 3.1 8b Instruct
tive in several contexts. This phased approach allows the
model to first learn a solid foundation of reasoning patterns
95% Confidence Interval Length

0.08 in phase one, followed by fine-tuning under more complex,


adaptive conditions in phase two. For instance, research on
Process Reinforcement through Implicit Rewards demon-
0.06
strates that models trained with a dedicated SFT phase
can maintain performance on standard benchmarks while
0.04 achieving improved reasoning capabilities during RL. This
separation also helps mitigate instability and ensures that
0.02
each phase targets specific learning objectives, ultimately
leading to more robust RLMs.
0 200 400 600 800 1000 Train on Familiar Distributions Training on familiar
Question Set Size data distributions can significantly influence a model’s ini-
Fig. 10: Estimated 95%-confidence interval length for different question set sizes tial performance and subsequent improvements. For exam-
using sampled generated answers from a subset of 1000 questions with eight ple, PRIME [38], [163] shows that training on a carefully
generated answers per question at temperature 1. The confidence interval is
calculated over the eight different pass@1 subsets of each question with 32 sets
curated token sequence (such as the eois token approach)
randomly sampled with replacement for each set size. avoids performance degradation. Similarly, in tasks like
rStar-Math [52], models trained on well-defined, familiar
distributions tend to stabilize more quickly and produce
7.7 Benchmarking RLMs higher-quality reasoning outputs. By focusing on familiar
Our experience with benchmarking RLMs highlights critical distributions, researchers can ensure that the models effec-
considerations for ensuring fair and reliable performance tively internalize the fundamental reasoning patterns before
comparisons. Incorporating multiple models within a rea- moving on to more diverse or challenging tasks.
soning scheme often increases output variance, emphasizing Be Careful with Prompting LLMs to Critique and
the need for benchmarking on sufficiently large sample Evaluate Relying on prompting alone to encourage large
sizes. Benchmarks with limited sample sizes, such as AIME language models to critique and evaluate their own outputs
or AMC, which often provide only a two-digit range of sam- often leads to instability. Research indicates that models
ples, risk selective reporting. This occurs when researchers struggle to self-correct reliably when prompted to refine
focus on subsets of results where their models perform well, their reasoning without external guidance. For example, a
rather than reflecting the true variability of their systems. recent study [64] illustrates that such prompting typically
Experimental findings (Figure 10) demonstrate that fails to produce consistently improved results. Another
achieving low error variability, within a single-digit percent- work [114] demonstrates that explicitly training the model
age range, requires evaluation across at least 500 samples. to output better responses through iterative refinement out-
Given the inherent complexity of RLMs, which often exhibit performs simple prompting. These findings highlight the
greater variability than simpler LLM setups, these results importance of structured training approaches and careful
suggest specific sample size thresholds. We recommend that operator design when aiming for self-improvement capabil-
individual benchmarks contain at least 200 samples per ities in RLMs.
category, with a minimum of 500 samples evaluated across
all categories to ensure statistically robust comparisons. 9 B ENCHMARKS FOR RLM S
Adhering to these guidelines would in many cases mitigate
variability-driven biases and facilitate more transparent as- We now outline benchmarks related to RLMs. Sun et al. [135]
sessments of RLM performance across different approaches. provide a clear distinction between various types of reason-
ing including mathematical, logical, casual, and common-
sense. Below, we highlight a selection of benchmarks for
8 E XAMPLE I NSIGHTS FOR E FFECTIVE RLM S each category. We also include additional categories related
to the realm of RLMs, namely, coding related benchmarks
We provide example insights gathered from the literature and benchmarks that involve reasoning utilities such as
and from our analyses of design decisions using x1. tools or RAG. We show the benchmarks in Figure 11.
Use Process-Based Evaluation Process-based evalua-
tion, in which the reasoning structure as a whole is as-
sessed, has been shown to be more reliable than alternative 9.1 Mathematical Reasoning
methods such as Outcome-Based Reward Models (ORMs). Mathematical reasoning benchmarks involve arithmetic, ge-
By examining the reasoning steps and their relationships ometry, and other mathematical tasks that use logical con-
within the structure, process-based evaluation provides a structs and symbolic computation. They can be further cate-
richer signal that helps models refine their reasoning paths gorized into benchmarks with fixed datasets and template-
and improve overall accuracy. This approach ensures that based benchmarks [103], [133].
each intermediate step contributes positively to the final GSM8K [35] consists of a train set (7,473 samples) and
outcome, resulting in more robust reasoning and better a test set (1,319 samples) of high-quality grade school-
generalization across tasks. level mathematical word problems. Early breakthroughs in
22

Reasoning Benchmarks (§9)

Logical Causal Commonsense Reasoning


Mathema�cal Reasoning (§9.1) Reasoning (§9.2) Coding (§9.3) Reasoning (§9.5)
Reasoning (§9.4) U�li�es (§9.6)
GSM8K MATH TheoremQA- PrOntoQA ODEX Tübingen GPQA GAIA
GSM Symbolic MATH Cause-Effect MMLU
AIME BIG-Bench DS-1000 Mind2Web
Olympiad- Pairs Dataset
CollegeMATH U-MATH ARC Challenge MBPP Common ALFWorld
Bench SenceQA
AMC ProofWriter SWE-bench Neuropathic
GaoKao AgentGym
Fron�er- Pain Dataset Social IQa
MATH-401 Mul�Arith Math FOLIO APPS WebArena
Arc�c Sea
AddSub HellaSWAG
CHAMP Func�onal- WANLI HumanEval Ice Dataset WebShop
MathQA MATH CLUTRR PIQA
ARB CRASS CRASS AgentBench
TABMWP Benchmark Benchmark OpenBookQA
FIMO Geometry3K Adversarial AgentBoard
SCIBENCH NLI WinoGrande
GeoQA UniGeo
Mul�Hier� Abduc�on- SWAG
miniF2F LeanDojo
Rules
TRIGO ChartQA PHYRE
LISA Adversarial
MathVista ARCT CConS
FactCC

Fig. 11: Overview of benchmarks for RLMs.

mathematical problem-solving by language models were open-ended problems balanced across six core subjects.
achieved by training on the training subset of this bench- FrontierMath [49] is an expert-level benchmark contain-
mark. ing exceptionally challenging mathematics problems cov-
GSM Symbolic [103] introduces a generator that can ering a wide array of modern mathematical domains. The
use 100 templated questions, which are derived from the dataset size remains undisclosed, but the problems have
questions of the GSM8K dataset. This approach emphasizes been carefully crafted and tested by expert mathematicians.
the limited generalization capabilities of current RLMs and Notably, current state-of-the-art models can solve less then
highlights the importance of templated benchmarks in eval- 2% of the problems, revealing a still significant gap between
uating LLMs’ performance in mathematical reasoning. AI capabilities and human expertise in the field of mathe-
MATH [59] benchmark contains questions ranging in matics.
difficulty from high school to competition-level mathemat- In general, it is recommended to utilize templated ver-
ics, containing 12,500 problems, split into 7,500 for training sions of these benchmarks where available, rather than
and 5,000 for testing. These problems are sourced from relying solely on question-answer (QA) pairs. Templated
various mathematics competitions such as the AMC 10, benchmarks minimize the likelihood of contamination from
AMC 12, and AIME (Level 5). prior exposure during model training, thus providing a
Functional MATH [133] builds upon the MATH dataset more accurate measure of performance [103], [133].
by introducing templated problem formats designed to as- Other related benchmarks include MATH-401 [164],
sess the functional understanding of mathematical concepts MultiArith [118], AddSub [61] CHAMP [98], MathQA [5],
by LLMs. However, the code and templates remain inacces- ARB [123], FIMO [85], Geometry3K [88], GeoQA [26],
sible to the public, limiting its broader adoption. UniGeo [24], miniF2F [175], LeanDojo [159], TheoremQA-
AIME [4], AMC [3], and GaoKao [82] feature mathemat- MATH [29], TRIGO [157], LISA [69], MathVista [87],
ical tasks ranging from Olympiad level to college entrance ChartQA [99], TABMWP [89], MultiHiertt [173], and
level difficulty. The AMC is generally easier, the GaoKao SCIBENCH [148].
offers a broader range of difficulty levels, while the AIME is
likely the most challenging. AIME consists of 30 problems, 9.2 Logical Reasoning
the AMC includes 40 problems and the GaoKao contains Logical reasoning emphasizes formal processes, from
around 300 questions. propositional and predicate logic to automated theorem
OlympiadBench [56] is a more advanced benchmark proving.
that spans Olympiad-level mathematics and physics prob- PrOntoQA [121] generates ontology graphs, similar to
lems, comprising 8,476 problems sourced from international causality graphs, which do not necessarily reflect natural
and Chinese Olympiad competitions, as well as the Chinese patterns. From these graphs, it constructs statements and
College Entrance Exam (GaoKao). poses questions that necessitate logical reasoning for resolu-
CollegeMATH [139] is designed for evaluating college- tion. Due to the abstract and artificial nature of some ontol-
level mathematics, with a dataset that contains 1,281 train- ogy graphs, models must focus more on step-by-step logical
ing problems and 2,818 test problems. These problems are reasoning rather than relying on commonsense inference to
sourced from textbooks, extracted with the help of LLMs. derive correct conclusions.
U-MATH [31] benchmark features 880 university-level BIG-Bench [132] is one of the most extensive bench-
test problems without images sourced from ongoing courses marks for reasoning tasks encompassing over 200 tasks,
across various institutions, currently available through the each potentially comprising numerous questions. It encom-
Gradarius platform. This benchmark presents unpublished, passes a broad range of domains and employs templated
23

question formats, enabling a systematic evaluation of rea- challenging for experts from unrelated domains. The dia-
soning capabilities across diverse contexts. mond subset contains 198 samples.
ARC Challenge [32] assesses the ability to understand MMLU (STEM) [58] incorporates questions across a
formal patterns, rules, and transformations within struc- spectrum of difficulty, ranging from general commonsense
tured, grid-based environments. Tasks focus on identifying reasoning to highly specialized domain knowledge.
logical structures such as conditional relationships and se- Other related benchmarks include Social IQa [120],
quences. For instance, deducing transformations between SWAG [165], HellaSWAG [166], CommonSenceQA [138],
grids based on abstract rules exemplifies the application of PIQA [19], PHYRE [7], OpenBookQA [102], CConS [74],
formal logical reasoning paradigms. WinoGrande [119], and FactCC [75].
Other benchmarks include ProofWriter [137], FO-
LIO [54], WANLI [84], CLUTRR [130], Adversar-
9.6 Reasoning Utilities
ial NLI [106], AbductionRules [162], and Adversarial
ARCT [107]. Benchmarking capabilities of RLMs related to reasoning
utilizies involve testing the capabilities of an RLM in how
it acts as an agent. This includes benchmarks such as
9.3 Coding GAIA [66], WebArena [177], Mind2Web [41], WebShop [160],
There also exist benchmarks related to how well a given ALFWorld [126], AgentBench [86], AgentGym [154], and
model can code. These include ODEX [151], SWE-bench [71], AgentBoard [21]. Another line of related benchmarks tests
DS-1000 [76], APPS [57], MBPP [6], and HumanEval [27]. the RAG capabilites [25], [44], [93], [156].

9.4 Causal Reasoning 10 R ELATED A NALYSES


Causal reasoning involves understanding and analyzing RLMs have been explored from several angles in prior
cause-effect relationships, including counterfactual reason- works, yet significant gaps remain in providing a systematic
ing and causal inference. This domain challenges models to blueprint and open-sourced framework for their construc-
predict or reason about events based on causal dynamics. tion. Below, we categorize prior efforts and describe how
Tübingen Cause-Effect Pairs Dataset [104] comprises our work advances the field.
108 cause-effect pairs drawn from diverse domains such
as meteorology, biology, medicine, engineering, and eco-
nomics. It serves as a comprehensive benchmark for assess- 10.1 Reasoning with Standard LLMs
ing causal reasoning across various contexts. Several works explore techniques for enhancing the reason-
Neuropathic Pain Dataset [142] captures complex rela- ing capabilities of standard LLMs. These approaches use
tionships between nerve function and symptoms in patients. straightforward mechanisms applied during pre-training,
It requires a domain-specific knowledge and causal infer- fine-tuning or inference.
ence to accurately interpret the data. Enhancing Reasoning with Training Huang and
Arctic Sea Ice Dataset [67] consists of a 12-variable Chang [63] outline pre-training and fine-tuning on reason-
graph that models the dynamics of Arctic sea ice based on ing datasets, and advanced prompting strategies. Sun et
satellite data generated since 1979. It provides a structured al. [135] contribute additional insights, including techniques
environment to explore causal relationships within climato- such as alignment training and the integration of Mixture
logical systems. of Experts architectures. Furthermore, Huang et al. [65]
CRASS Benchmark [46] focuses on counterfactual rea- demonstrate the possibility of self-improvement on reason-
soning tasks using 274 sample multiple choice questions. It ing tasks with additional training on self-generated labels.
evaluates models’ abilities to answer counterfactual ques- Reasoning with Prompting & In-Context Learning
tions, using top-k accuracy as the primary performance Qiao et al. [112] provide an overview of prompting-only
metric. techniques, classifying prompting methods into two main
Many of these benchmarks have either been largely categories: strategy-enhanced reasoning and knowledge-
solved by current state-of-the-art models, or their applica- enhanced reasoning. Besta et al. [14] provide a taxonomy
bility in real-world language model tasks remains limited, of different advanced in-context reasoning topologies. These
rendering them unsuitable for benchmarking current RLMs. include the Chain-of-Thought (CoT) [152], Tree of Thoughts
(ToT) [161], and Graph of Thoughts (GoT) [9].
Some of these works further provide overviews of dif-
9.5 Commonsense Reasoning ferent reasoning tasks, reasoning datasets, and reasoning
Commonsense reasoning encompasses tasks that require benchmarks [63], [112], [135]. Others focus on enhancing
the application of everyday knowledge, including questions domain-specific reasoning, such as mathematical [2], [90],
that rely on implicit cultural, social, or contextual under- [158] or logical reasoning [92].
standing. This category also extends to specialized domain These studies remain largely limited to reviewing ex-
knowledge tasks. isting literature. Therefore, they lack code implementation
GPQA (Diamond) [116] is a multiple-choice benchmark and rarely employ formal language. Most importantly, they
spanning disciplines such as chemistry, genetics, biology, rarely cover explicit reasoning models. Our blueprint inte-
and physics. The questions are designed to be solvable by grates most of these techniques within a broader, modular
experts (PhDs) within their respective fields but remain structure.
24

10.2 Explicit Reasoning Models finement processes, or hybrid search algorithms that adapt
The following works explore techniques that extend beyond dynamically to the task’s complexity. These strategies can
basic mechanisms applied during pre-training or inference. be tailored using the token probability distribution anal-
These methods involve additional computation to itera- ysis tools provided, leading to more effective generation
tively refine reasoning paths, often increasing computa- strategies that optimize reasoning steps through probabilis-
tional demands during training and/or inference. tic insights. The blueprint also provides a foundation for
Dong et al. [43] provide a taxonomy and survey of developing nested architectures where reasoning structures
inference-time self-improvement methods, including in- such as trees and graphs are embedded hierarchically. These
dependent, context-aware, and model-aided approaches. designs can address multi-layered reasoning tasks, expand-
Guan et al. [51] propose verifier engineering, a post-training ing the scope of RLM applications to domains requiring
paradigm for foundation models involving three stages: deep, structured reasoning processes.
Search, Verify, and Feedback, to enhance model outputs Scalability remains a key focus of this work. The
with scalable supervision signals. Zeng et al. [167] provide blueprint’s modular design supports future scalable cloud
a comprehensive roadmap for reproducing OpenAI’s o1 deployments that enable efficient distribution of compute-
reasoning model from a reinforcement learning perspective. intensive tasks across cloud infrastructures. These deploy-
Although the work thoroughly examines all core com- ments will not only enhance scalability but also optimize
ponents: policy initialization, reward design, search, and cost and resource utilization, making RLMs more accessible
learning, no implementation is provided. Various specific for real-world applications.
implementations of RLMs exist, we provide a summary in By exploring and integrating these ideas, this work aims
Table 1. There are also other works related to Explicit RLMs, to empower the next generation of reasoning language mod-
considering both coarse reasoning steps [149], [155] and fine els, democratize access to advanced reasoning capabilities,
reasoning steps [40], [149], [155]. and foster innovation across research and industry. The
Our blueprint provides a more foundational and uni- blueprint’s versatility, combined with the x1 platform, will
versally applicable framework for RLMs. We further sup- make it one of the factors in the progress in RLM research
plement the theoretical and algorithmic overview with a and applications.
modular and scalable implementation to enable practical
development and experimentation. ACKNOWLEDGEMENTS
We thank Nicolas Dickenmann for writing the initial MCTS
11 C ONCLUSION codebase. We thank Hussein Harake, Colin McMurtrie,
This work introduces a comprehensive blueprint for reason- Mark Klein, Angelo Mangili, and the whole CSCS team
ing language models (RLMs), providing a flexible and mod- granting access to the Ault, Piz Daint and Alps machines,
ular toolbox that demystifies the intricate design and oper- and for their excellent technical support. We thank Timo
ation of these advanced systems. By encompassing diverse Schneider for help with infrastructure at SPCL. This project
reasoning structures, operations, and training schemes, the received funding from the European Research Council
blueprint establishes a robust foundation for constructing, (Project PSAP, No. 101002047), and the European High-
analyzing, and extending RLMs tailored to various appli- Performance Computing Joint Undertaking (JU) under
cations. The accompanying x1 implementation enhances grant agreement No. 955513 (MAELSTROM). This project
this contribution, offering a modular, minimalist, and user- received funding from the European Union’s HE research
friendly platform for experimentation and rapid prototyp- and innovation programme under the grant agreement
ing of novel RLM architectures. No. 101070141 (Project GLACIATION). We gratefully ac-
Our blueprint and x1 pave the way for several exciting knowledge Polish high-performance computing infrastruc-
avenues of future research and development in reasoning ture PLGrid (HPC Center: ACK Cyfronet AGH) for provid-
AI. One example is Trace-Based Supervision (TBS), which ing computer facilities and support within computational
extends process-based supervision by incorporating labeled grant no. PLG/2024/017103.
traces of traversal through reasoning structures. TBS has the
potential to train more powerful implicit RLMs capable of
internalizing reasoning structures and improving general-
ization.
The work also explores new directions in value and
reward modeling, introducing a hierarchy of models and
formally identifying several recent designs as instances of a
new class of models, namely the Outcome-Driven Process
Reward Model. This model class bridges the gap between
outcome-based evaluation and process-based supervision
by dynamically connecting intermediate reasoning steps to
terminal outcomes, enabling more granular feedback during
training without the need.
Additionally, the blueprint’s extensive set of operators
can inspire the development of innovative reasoning strate-
gies, such as advanced tree-based searches, multi-step re-
25

A PPENDIX A The State-Action value function Q(st , at ) Oftentimes, it


M ATHEMATICAL F OUNDATION OF M ARKOV D ECI - is useful to use the state-action value function Q(st , at ) in-
SION P ROCESSES FOR R EASONING TASKS stead of the state value function. Specifically, the state-action
value function Q(st , at ) extends the state value function so
y so far to put this on the first page of the appendix that the function value is defined on a state and a specific
In this section, we provide a rigorous mathematical action at :
framework for RLMs. We achieve this by integrating the the- " T #
ory of Markov Decision Processes (MDPs) with the Monte X
k−t
Qπ (st , at ) = Eπ γ r(sk , ak , sk+1 ) | st , at
Carlo Tree Search (MCTS) algorithm. The MDP serves as
k=t
a foundational formulation for modeling various types of
processes, and it can be applied to model reasoning chains,
= r(st , at ) + γ Est+1 [Vπ (st+1 ) | st , at ] ,
which constitute the reasoning structure of the RLMs. Si- where Bellman’s equation is used in the second equality.
multaneously, MCTS serves as an efficient search algorithm
for exploring and navigating the extensive space of possible A.1.2 MDPs in the RLM Setting
reasoning chains. The resulting state space is then used as In the context of RLMs, a state s ∈ S is typically defined as a
a basis for modeling the RLM. An overview of the notation sequence of reasoning steps s = (z0 . . . zn ), where each rea-
used in this section is provided in Table 2. soning step zi is a sequence of Mi tokens zi = (t0i , . . . , tM
i ).
i

j
Each ti is a token from the RLM’s vocabulary, and the total
A.1 Markov Decision Process number of tokens per reasoning step Mi can vary. One can
use a special token tMi = tend to indicate the end of the
Markov Decision Process (MDP) is defined as a 5-tuple
reasoning step. Typically, the initial query q is used as the
M = (S, A, p, r, γ), where S is the state space, A is the
first reasoning step z0 = q . In the study of RLMs, an action
action space with As ⊆ A denoting the set of actions which
a ∈ As usually represents appending a new reasoning step
can be taken in the state s, p represents the dynamics of
z (a) to thecurrent state s = (z0 , ..., zn ) resulting in a new
transitions between states, i.e., p : S × A × S → [0, 1] where
p(s, a, s′ ) is the probability of transitioning to the state s′ state s′ = z0 , ..., zn , z (a) . Since every action a is uniquely
when action a was selected in the state s, r : S × A × S → R associated with exactly one reasoning step z (a) for every
is the reward function, i.e., r(s, a, s′ ) represents the reward s = (z0 , ..., zn ) and s′ = (z0 , ..., zn , zn+1 ), we have
for arriving in the state s′ after selecting the action a in the (
state s, and γ ∈ [0, 1] is a discount factor. ′ 1 if zn+1 = z (a)
p(s, a, s ) =
0 if zn+1 ̸= z (a)
The definition of the reward function depends on the
A.1.1 Solving an MDP
specific task. A reward commonly seen in reasoning tasks
Before stating what it means formally to solve an MDP, we assigns non-zero reward only in the terminal states and
first need several definitions. hence only at the final reasoning step. This approach reflects
A trajectory τπ = (s0 , a0 , . . . , sT , aT , sT +1 ) is a sequence the fact that for most tasks, the only final answer can be
of interleaved states and actions, selected according to the evaluated against the ground-truth solution to the origi-
policy π (see below for the policy definition). Each trajectory nal query. We call such reward functions sparse to clearly
starts at an initial state s0 ∈ S and ends with sT +1 ∈ S distinguish it from other setting in which intermediate
which represents the terminal state where no further actions rewards can be observed by the algorithm in the non-
can be taken. terminal states. The discount factor γ determines how future
A policy π(s) is a function assigning a probability rewards influence the current decision-making process. A
distribution over the action space to a given state s; π : higher discount factor (γ → 1) places greater emphasis on
S → ∆(A) where ∆(A) is a set of probability distributions long-term reasoning success, allowing the model to generate
over action space A. The expression π(a | s) denotes the long reasoning sequences, while a lower discount factor
probability of selecting the action a in the state s according prioritizes immediate rewards, incentivizing faster progress
to the policy π . and shorter reasoning sequences.
State value function Vπ (st ) represents the expected In the RLM setting, a trajectory τπ =
cumulative future reward for a given state st under policy (s0 , a0 , . . . , sT , aT , sT +1 ) represents the progression of
π: " T # states st and actions at ending with a terminal state sT +1
X
Vπ (st ) = E k−t
γ r(sk , ak , sk+1 ) | st (3) in which no further reasoning steps can be added. The final
k=t reasoning step contains the RLM’s answer to the original
query.
where T is a predefined time-horizon. Note that, in order to
The policy π(a | s) in the context of RLMs defines
obtain the state sk+1 , an action ak is first derived by sam-
the probability of selecting an action a that corresponds to
pling from a distribution π(sk ). Once the action ak is chosen,
appending a reasoning step z (a) to the current reasoning
the environment dynamics p(sk+1 | sk , ak ) determine the
sequence represented by the state s. Since there exists a
probability distribution of the next state sk+1 .
bijective mapping f : A → Z between the action space A
Tthe goal of solving an MDP is to find a policy π ∗
and the reasoning step space Z , the probability distributions
which maximizes the value function as defined above for
can be equated using the change of variables. Formally:
all states s ∈ S , π ∗ = arg max Vπ (s)
π
π(a | s) = π(z | s), where z = f (a).
26
TABLE 2: Overview of mathematical notation used in the paper

Symbol Description
M = (S, A, p, r, γ) Markov Decision Process (MDP) definition.
s∈S A state in the state space, representing a sequence of reasoning steps.
a∈A An action in the action space, corresponding to selecting the next reasoning step.
As ⊆ A a set of actions available in state s.
p(s′ | s, a) The probability of transition to state s′ from state s taking action a in state s.
r(s) The reward received when arriving in state s.
γ ∈ [0, 1] Discount factor, determining the present value of future rewards.
πθ (a | s) Policy parameterized by θ, representing the probability of taking action a in state s.
Vπ (s) Value function under policy π , representing the expected return starting from state s.
Qπ (s, a) State-action value function under policy πθ , representing the expected return of taking action a in state s.
τπ A trajectory consisting of states and actions, (s0 , a0 , s1 , . . . , sT +1 ) following policy π .

Based on the definition of the reasoning step and apply- 1) Selection - a leaf-node in the current tree is selected for
ing the chain rule we can then rewrite the policy as: expanding its child (children).
Mt+1 2) Expansion - if the selected node does not correspond
π(tjt+1 | st , zt+1 j−1 to a terminal state, it is expanded by taking an action
Y
0
π(zt+1 | st ) = , . . . , zt+1 ),
j=0 (or multiple actions) in the underlying MDP and by
adding the resulting state (states) to the tree as children
In the RLM setting, the state value function V (st ) as-
of the current node. A trajectory unroll is performed
sesses the expected cumulative reward of a partial reasoning
for every added node to obtain a reward. “Unroll”
sequence st , estimating its overall potential to lead to a
refers to simulating a sequence of steps from a newly
successful solution. The state-action value function Q(st , at )
added node in the tree down to a terminal state. This
extends this by quantifying the expected cumulative reward
simulated trajectory represents a hypothetical path the
for taking a specific action at (e.g., appending a reasoning
system might take if it continued from the current node.
step zt+1 ) to the current state st and then following the
Once the simulation reaches a terminal state, a reward
policy π . It incorporates both the immediate reward for
value is calculated based on the outcome of that path.
appending the reasoning step and the anticipated future
3) Backpropagation - update the value estimates and the
rewards from completing the reasoning sequence. Together,
visit counts for the selected node and all its ancestors
these functions inform and guide the policy π to prioritize
based on the obtained reward.
actions that maximize the expected cumulative reward. By
leveraging V (st ) or Q(st , at ), the policy can be trained The MCTS algorithm finishes when the stop criterion
to select reasoning steps that progress toward correct and such as the the number of iterations, the predefined com-
complete solutions, transforming an LLM into a RLM. putational budget, or the convergence criterion is met.

A.2 Monte Carlo Tree Search (MCTS) A PPENDIX B


Monte Carlo Tree Search (MCTS) is a heuristic search VALUE AND R EWARD M ODELS
algorithm used for solving MDP problems. MCTS iteratively
We now proceed to discuss details of value and reward
builds a search tree, representing the underlying MDP state-
models.
action space, by aggregating the information obtained from
executed MDP trajectories. Let T = (N, E) denote the
MCTS search tree where N ⊆ S is the set of nodes and B.1 Outcome-Based Reward Models (ORM)
E ⊆ N × A × N is the set of directed edges between the vs. Process-Based Reward Models (PRM)
nodes. Every node in the MCTS search tree corresponds In reinforcement learning environments, reward models
to a single state in the MDP and every edge corresponds estimate the reward for taking an action a in state s which
to a single action. Every path from the root to the leaf of leads to state s′ . For reasoning tasks and algorithms like
the search tree T corresponds to a single trajectory in the MCTS, which rely on evaluating intermediate steps, it is
underlying MDP. essential to have models capable of estimating the qual-
ity of each step. Two primary families of reward models
Edge statistics The MCTS algorithm stores the following for such process-based tasks are Outcome-Based Reward
three values for every edge s, a in the search tree: Models (ORMs) and Process-Based Reward Models (PRMs).
• N (s, a) - the visit count of the edge (s, a) by the Figure 12 compares both classes of models.
algorithm, Outcome-Based Reward Models (ORMs), first intro-
• q(s, a) - the estimated state action value of (s, a), duced by Uesato et al. [143], evaluate the reasoning process

• r(s, a) = r(s, a, s ) - the reward received after taking solely based on the final outcome. These models estimate the
the action a in the state s leading to the state s′ , reward of the final step in the chain, often modeled in the
• β(s, a) - the terminality function indicating if the action literature as the likelihood of a correct final answer given the
a leads to a terminal state. entire reasoning chain P (correct(zT +1 ) | z0 , ..., zT +1 ) [83],
The Algorithm At the high level, the MCTS begins by [143] where sT +1 := z0 , ..., zT +1 is the complete reasoning
initializing the tree with a single starting state s0 as a root chain consisting of reasoning steps zi and T + 1 marks
node and performing the following three phases in a loop: the last reasoning step. ORMs are particularly ill-suited
27

for evaluating intermediate steps for several reasons. First, still oversimplify complex dependencies within reasoning
the training data and objective are inherently misaligned chains.
with step-wise evaluation, as they focus exclusively on
Outcome Process Outcome-Driven
final outcomes. Second, ORM evaluations tend to be overly Based Models Based Models Process Based Models
pessimistic for intermediate steps since a subsequent erro-
neous step can obscure the correctness of earlier steps. This
human/ is_correct?
observation aligns with Havrilla et al. [55], who noted that not
model
available
ORMs often underestimate the solvability of a problem from available

an intermediate state and are prone to a high false-negative


rate. Furthermore, ORMs lack robustness against false pos-
itives, potentially favoring erroneous reasoning steps and
misleading the evaluation process. is_correct? is_correct? is_correct? is_correct?
available available available available
Process-Based Reward Models (PRMs), introduced
Legend: node to evaluate terminal node
by Lightman et al. [83] and Uesato et al. [143], evaluate
reasoning on a step-by-step basis. These models estimate Fig. 12: Comparison of Outcome vs. Process-Based label generation, and the
the reward of a step, which can be seen as the likelihood introduction of Outcome-Driven Process Based Reward Models (O-PRMs). Gray
nodes mark terminal nodes.
of correctness for the t-th step given its preceding context
P (correct(zt ) | z0 , ..., zt ) where st := z0 , ..., zt is a
potentially incomplete reasoning chain and zi are reasoning
steps and z0 is the query. PRMs provide more fine-grained B.3 Reward Models vs. Value Models
feedback and can pinpoint errors in the chain. This step-
While the distinction between reward models and value
wise evaluation provides dense rewards given partial
models is often blurred in the literature—and their
responses and helps identify where reasoning deviates from
terminology is sometimes used interchangeably—we
correctness, offering improved interpretability and enabling
explicitly differentiate between these model types for
more targeted improvements in reasoning processes.
evaluating reasoning steps. Additionally, we distinguish
However, PRMs are computationally expensive to train
two variants of value modes: v-value and q-value models.
and require extensive annotations of reasoning steps. These
This differentiation arises from the distinct roles these
annotations, whether provided by humans or other LLMs,
models play in reinforcement learning environments.
often suffer from limitations: human annotations are scarce,
costly, and prone to bias, while prompted LLM-generated
annotations [146] are typically of lower quality due to
B.3.1 Reward Model (RM)
their limited self-evaluation capabilities [94]. Automated
methods using for example MCTS such as [91], [147] A reward model predicts immediate rewards. In RL, this
introduce large computational costs and are prone to false corresponds to the reward obtained for a transition (s, a, s′ )
negatives. from state s when taking action a which results in step s′ .
For reasoning, this corresponds to adding a new reasoning
step a to the structure. The new structure is then represented
by s′ . Specifically, PRMs – which are preferred over ORMs
B.2 Outcome-Driven Process-Based Reward Models for MCTS due to the need for action-based evaluation –
Motivated by the need for process-based reward models but learn these rewards and can be used to evaluate states
constrained by the lack of annotated step-wise labels, certain (or the transition into a state). This formulation provides
models that we will refer to as Outcome-Driven Process-Based a localized, step-level evaluation independent of the overall
Reward Models (O-PRMs) have been proposed; they combine outcome of the reasoning chain. The reward model is typi-
outcome-based signals with process-based objectives. We show cally trained using labeled data where individual reasoning
these models in Figure 12. These models rely on process- steps are associated with reward values. While this localized
based data, often automatically generated using MCTS algo- view is advantageous for step-by-step evaluation, it lacks
rithms, where simulations starting from a given step st are the ability to consider how the current step contributes
performed. The final correctness of these simulated paths is to the long-term success of the reasoning process. This
aggregated to create step-wise labels [91], [147] (for other, limitation motivates the introduction of value models.
non-MCTS approaches see [55]). This automation enables
scalable data generation for O-PRMs, eliminating the need B.3.2 Value Model (VM)
for extensive human annotation. Although O-PRMs can be Value models provide a more abstract, global evaluation of
categorized as process-based models due to their approxi- states and actions by estimating their contribution to future
mation of step-wise rewards, they remain inherently tied to rewards. Unlike reward models, which focus on immediate
outcome signals. Some authors [143] suggest that, under cer- outcomes, value models consider both current and future
tain conditions, outcome signals in mathematical domains rewards, enabling a broader perspective on reasoning qual-
can approximate intermediate labels. However, O-PRMs ity. For example in reinforcement learning and MCTS, value
inherit many limitations of ORMs, including susceptibility models play a critical role in guiding the search process. By
to false negatives, false positives, and an over-reliance on providing estimates of state or state-action values, they en-
terminal outcomes. While the aggregation of multiple simu- able more informed decisions about which nodes to expand
lations helps reduce variance, the backtracking process may and explore. We now discuss variants of value models.
28

V-Value Model (V-VM). One variant of a value model is reasoning step at = ”Substitute y = 1 − x2 ”. This
the v-value model which predicts the expected cumulative reward quantifies the quality of the resulting state st+1 ,
future reward of a state, denoted as V (s). This is equivalent independent of whether it leads to a correct solution.
to the state value function in reinforcement learning, which However, in sparse reward settings (only final steps
evaluates the long-term potential of the current state s. receive a reward), this reward would be 0.
A key advantage of V-VMs is their global perspective, as • V-Value Model (V-VM): A V-VM estimates V (st ), rep-
they aggregate future rewards across all possible trajectories resenting the expected cumulative reward for the entire
originating from the current state. However, V-VMs do not expected solution process starting from st . For instance,
explicitly evaluate individual actions, which may limit their if st = (”Start with x2 + y 2 = 1”), V (st ) considers the
utility in step-level decision-making. Additionally, v-values long-term potential of all reasoning paths originating
are often ill-defined at terminal states, where rewards may from this state.
substitute for state values during training. • Q-Value Model (Q-VM): A Q-VM evaluates Q(st , at ),
Q-Value Model (Q-VM). Another variant of a value predicting the cumulative reward√of taking a specific
model is the q-value model. Q-VMs predicts the expected action at (e.g., substituting y = 1 − x2 ) in state st .
cumulative future reward of taking a specific action a in This value directly informs whether the action at is
a given state s, denoted as Q(s, a). Unlike V-VMs, Q-VMs likely to lead to a high-quality solution, providing a
explicitly associate values with state-action pairs, offering a more granular evaluation compared to the V-VM.
more granular evaluation. This granularity makes Q-VMs
particularly useful for MCTS, where decisions about which B.3.4 Summary
edge (action) to expand at a given node (state) are critical. By differentiating reward models and value models, and
By directly evaluating actions, Q-VMs align naturally with further categorizing value models into V-VMs and Q-VMs,
the selection mechanisms in MCTS, guiding the search we provide a nuanced framework for evaluating reasoning
toward promising paths. Similar to V-VMs, Q-VMs can also steps. Reward models offer localized evaluations, while
be categorized as PQVMs (Process-based Q-Value Models), value models incorporate global, long-term perspectives.
OQVMs (Outcome-based Q-Value Models), and O-PQVMs This global evaluation enables the model to better prioritize
(Outcome-driven Process-based Q-Value Models). reasoning paths that are likely to lead to correct solutions
The choice between V-VMs and Q-VMs depends on while mitigating the challenges posed by sparse or delayed
the reasoning task and the specific requirements of the rewards. Therefore, we advocate for the use of a process-
evaluation framework. While V-VMs provide a broader, based value model due to the sparsity of reward signals for
state-centric evaluation, Q-VMs enable more precise, action- reasoning tasks. Among value models, Q-VMs are particu-
specific guidance. In practice, MCTS often benefits from the larly well-suited for MCTS due to their action-specific gran-
use of Q-VMs due to their compatibility with edge-based ularity, which aligns naturally with the tree’s edge-based
selection. exploration mechanism. We will demonstrate the practical
implications of these distinctions in Appendix D.3.

v v v v v=0
V-Value B.4 Evaluation Schemes
Model We also provide additional categorizations and details re-
garding overall evaluation.

Reward r r r r B.4.1 Evaluation Types


Model
0 0 0 1/-1 Evaluating reasoning steps in RLMs involves assessing their
quality and contribution toward solving a task. Numerical
evaluations can be categorized as relative or absolute.
Q-Value Relative evaluations compare multiple steps, often us-
Model ing ranking mechanisms and can be created with, for ex-
Q Q Q Q ample, the Bradley-Terry model [20], which is optimized
based on pairwise preferences by maximizing the reward
Fig. 13: Comparison of reward, v-value and q-value models in a sparse reward gap between chosen and rejected steps.
setting (only terminal states receive non-zero rewards). Gray nodes mark terminal
nodes. The reward model should predict the rewards for transitioning from one Absolute evaluations assign scalar values to each step,
state to another which is 0 for non-terminal states and not providing information. ssessing aspects such as coherence, correctness, or helpful-
V-VMs and Q-VMs however, predict a global value and are therefore informative
for non-terminal states.
ness, using regression-based models. Moreover, evaluation
dimensions can also be modeled as binary with classification
models. While regression models provide more information,
classification models capture correctness more naturally
B.3.3 Example: Solving a Mathematical Equation
since a statement is usually correct or incorrect. On the
To illustrate the differences between reward models, value other hand, the former ones are more suitable for measuring
models, and q-value models, consider the task of solving quality, such as the degree of coherence. Depending on
x2 + y 2 = 1 step-by-step. the specific quality being evaluated, the choice between
• Reward Model (RM): A process-based reward model regression and classification models should align with the
(PRM) might assign a reward r(st , at , st+1 ) for the evaluation’s goals. Additionally, absolute scores can be
29

transformed into rankings if needed, providing flexibility applicability is limited to well-defined domains, they pro-
across various applications. vide objective and verifiable feedback that complements
In addition to numerical evaluations, there are text- language models. By injecting precise knowledge into the
based evaluations, which are commonly used to provide evaluation process, external tools mitigate model-specific
detailed feedback and guidance for refining reasoning steps. limitations like hallucinations and offer actionable feedback
Examples include “LLM-as-a-Judge” [176] (which uses a for iterative refinement. This hybrid approach enhances
larger LLM to provide a pairwise comparison or a single reliability and ensures that the evaluation benefits from
graded answer with an explanation) and self-critique ap- both the flexibility of language models and the precision
proaches [122] that allow models to reflect on and evalu- of formal systems.
ate their own reasoning. These textual evaluations, often
including rationales, are particularly useful for structural
transformations rather than numerical guidance, enhancing A PPENDIX C
interpretability by offering context and detail. A LGORITHMIC D ESCRIPTIONS
C.1 Reasoning with Monte Carlo Tree Search
B.4.2 Evaluation of Reasoning Steps
C.1.1 Setup and Notation
Step-wise evaluations are vital for integrating reason-
ing into MCTS. Numerical evaluations-—whether relative We will now present the details of the training pipeline of x1.
or absolute-—provide straightforward metrics to compare
nodes and steer exploitation and exploration. Text-based MDP Design x1 assumes the MDP following the definition
evaluations, in contrast, are better suited for guiding struc- presented in Appendix A.1 with the γ values between
tural refinements rather than directly influencing search [0.95, 1] to avoid over-penalizing long reasoning sequences.
paths. In the RLM setup, the state space and action space of the
Given that reasoning steps are typically textual se- underlying MDP constitute a tree in which every state s
quences, language models are a natural fit for such evalua- other than the starting state s0 has exactly one action as
tion tasks. LLM-based approaches can involve external model leading to it. This allows us to simplify the notation by
approaches, where a dedicated value model is trained to omitting actions wherever it’s clear from the context that
predict scores, or internal model approaches, which leverage we are referring to only action leading to a given. For every
existing policy models. action a leading from the state s to the state s′ we will write:
External model approaches include value models that π(s′ | s) := π(as′ |s)
predict scalar reward signals (Reward models) [34], [83], r(s′ ) := r(s, a, s′ )
[143], reinforcement learning values like state-values (V- q(s′ ) := q(s, a).
value models) [128], state-action values (q-value models), or τ := (s0 , s1 , . . . , sT +1 )
pairwise models like the Bradley-Terry and PairRM frame- The final reasoning step in the terminal state contains
works. A more detailed comparison of reward models, v- the RLM’s answer to the original query. The final answer is
value, and q-value models can be found in Appendix B.3.2. compared to the ground truth solution, commonly referred
There exist a large range of internal model approaches to as the golden answer. This matches the common setup in
as substitutes for value models. They typically rely on many reasoning tasks and math problems, where no ground
methods like prompting the policy to output scores. Exam- truth and no reward source is available for the intermediate
ples include MCT Self-Refine (MCTSr) [168], querying for a reasoning steps.
binary feedback (e.g., “Is the answer correct? answer“yes” Consider a trajectory τ := (s0 , s1 , . . . , sT +1 ). We assign a
or “no””) [171] and evaluating the probability of the output, reward of r(sT +1 ) = 1 if the last reasoning step in the final
leveraging uncertainty metrics such as token entropy or state sT +1 contains the correct answer and r(sT +1 ) = −1
aggregated probabilities [174], and others [170]. otherwise. The state value function simplifies to
Heuristics may also serve as substitutes for evaluations h i
in resource-constrained scenarios. Vπ (st ) = Eπ γ T −t r(sT +1 ) ∈ [−1, 1] (4)
Simulating reasoning steps to terminal states for eval-
and the state action function can be rewritten as:
uation against golden answers is another option as done (
for example in MCTS, though often computationally pro- r(sT +1 ), if t = T + 1
hibitive. Qπ (st ) = ∈ [−1, 1] (5)
γVπ (st+1 ), otherwise
External tools provide an alternative path for eval-
uation, especially in domain-specific tasks. For program- hence both the value and the state-action value functions
ming, compilers can supervise tasks, as seen in Codex [27], are bounded between -1 and 1 for all states and state-action
self-debugging [30], and similar methods. Program-of- pairs.
Thought [28] and Program-aided-Language (PAL) [48] use
a formal language and Python interpreters to evaluate so- MCTS Design We define the MCTS tree as in Appendix A.2
lutions. In mathematical tasks, ensemble approaches like as T = (N, E), where N is a set of nodes, and E is the set of
MathPrompter [68] generate multiple algebraic expressions edges. We use the notation of a node-edge-node relationship
or Python functions to validate steps. These tool-based denoted by (s, a′ , s′ ) where s represents the origin node,
approaches excel at detecting errors due to their reliance a′ describes the action corresponding to an edge, and s′
on precise domain-specific rules, such as compilers for denotes the target node. This notation symbolically ties the
programming or interpreters for mathematics. While their action and the target state together, as the action uniquely
30

identifies the target state and is therefore indicative of it. Selection. The selection phase iteratively identifies the most
promising child node with a selection policy. We use the
The policy model We use a pretrained LM with following selection policy which is the node-based variant
parameters θ as a policy model and denote it πθ . The of the PUCT algorithm in AlphaZero [129] (which is defined
model autoregressively generates a sequence of tokens. We on edge-based values) without a prior for finding selecting
use a special token ’End of Intermediate Step’ (eois) to a child of s:
indicate the end of the reasoning step. We use a standard p
N (s) − 1

N (s) + c2

end-of-sequence (eos) token to indicate the end of the final arg max q(sc ) + · c1 + log
sc ∈C(s) 1 + N (sc ) c2
reasoning step concluding the reasoning trajectory.
where c1 and c2 are hyperparameters controlling the
The value model A parametric value model is exploration bias, and the other values can be taken from the
used to evaluate the quality of states. While MCTS node statistics.
traditionally approximates these values through extensive
simulations, such an approach is computationally Expansion. We append M nodes to the selected leaf, M
expensive and impractical in the RLM context. Inspired being a hyperparameter. One of the major challenges in
by AlphaZero [128], which replaces simulations with a applying RLMs is maintaining the diversity of reasoning
parameterized value model, we estimate state-action values paths. By adding M nodes, we increase the exploration of
(short q-value) for reasoning sequences using a value model alternative reasoning paths.
— effectively employing a process-based q-value model Backpropagation. The backpropagation step serves to prop-
Qφ (see Appendix B.3). The value model is instantiated as agate information from the terminal nodes back to their
a pretrained transformer-based LM, modified by adding ancestors. In our implementation, we update the running
three linear layers and a shifted, rescaled sigmoid activation estimates of the q-values using the following formula:
to align the output domain to the state action function  
domain [−1, 1] (see Eq. 5). This setup proved more stable X
than alternatives, such as a tanh activation or a cropped q(s) ←(1 − α)q(s) + αγ  ws (sc ) · q(sc ) ,
sc C(s)
linear layer. We will show in the following how such a
model can be trained and provide a description for the data where we look at the node-edge-node tuples (s, ac , sc ) and
generation process in Appendix D. During training, we sc ∈ C(s). The weights ws (sc ) for combining the children
assume access to a final answer verifier, which evaluates q-values are defined over the visit scores of the nodes as
the correctness of the model’s final answer and provides follows:
the true reward. N (sc )
ws (sc ) = P .
sc̃ ∈C(s) N (sc̃ )
C.1.2 MCTS Algorithm True Reward Propagation. We improve the quality of the
We now present the algorithmic steps of a Monte Carlo q-values by propagating the real final rewards back through
Tree Search variant similar to AlphaZero as implemented the tree when a terminal state sT +1 is reached. During
in the x1 reasoning framework. The MCTS search operates training, terminal nodes can be evaluated against a reference
in two distinct modes: training and inference. The core golden answer g ∗ using an external verifier. For actions
difference is that, during training, a final answer verifier leading to terminal states, the associated reward is equal
evaluates and scores the final reasoning steps, providing to the q-value see Eq. 5. Therefore, instead of using the pre-
a true reward signal that is backpropagated through the diction of the q-value model, we initialize q(sT +1 ) with the
MCTS tree. This reward serves as a reliable learning signal true reward r(sT +1 ) based on the evaluation of the external
for the value model Qφ . During inference, however, the verifier. The reward is then backpropagated via the q-values
verifier is unavailable, and decisions rely solely on the through the tree with our backpropagation operator. This
value model. adjustment anchors the q-value model predictions with real
reward signals and prevents the q-value model predictions
Notation. We chose to store all values in nodes instead of to diverge.
edges, which defines the following set of statistics saved for Best Path Selection. After N iterations, MCTS will have
each node s: formed a tree in which every path corresponds to one of the
• N (s) - the visit count of node s explored reasoning trajectories. The final reasoning step in a
• q(s) - the running estimate of the q-value of the transi- path with the highest terminal value estimate is returned as
tion leading to state s, the final solution.
• β(s) - the binary terminality function, returns 1 if the
node s is terminal 0 otherwise.
31

Algorithm 1 MCTS for Reasoning (Training mode in blue) C.2 Training Phase 1
Input: Policy model πθ , value model Qφ , question z0 ,
golden answer g ∗ , binary correctness verifier Γ, number of Overall Training Pipeline. To adequately employ the
MCTS iterations N , number of children expanded in every MCTS-based reasoning scheme introduced in the Ap-
selection phase M , exploration constants c1 , c2 , Backpropa- pendix C.1, the policy model must be fine-tuned to generate
gation weight α. responses in the format of semantically-relevant reasoning
Output: Search tree T = (N , E) containing the best path steps. The value model – a q-value model in our case – must
τ ∗. be trained to accurately estimate the values of the sequences
of reasoning steps.
1: s0 ← (z0 ) {Initialize root node}
We propose a two-phase training approach designed to
2: N (s0 ) = 0
let the policy effectively leverage the structured exploration
3: N ← {s0 } {Initialize node set}
and iterative refinement capabilities of the search process to
4: E ← ∅ {Initialize edge set}
generate optimal sequences of reasoning steps. A detailed
5: i ← 1
algorithmic description of the pipeline is in Figure 14.
6: while i ≤ N or β(s) ̸= 1 do
7: s ← s0 {Start from root node}
8: ————– Selection —————————————— Phase 1: Supervised Fine-Tuning. The first phase fo-
9: while s is not a leaf node do cuses on preparing the policy and value models to gen-
{Select child sc ∈ C(s) with selection score} erate and evaluate reasoning trajectories effectively. This
10: √ highest   is achieved by supervised fine-tuning (SFT) training on a
N (s)−1
11: sc ← arg max q(sc ) + 1+N (sc ) c1 + log N (s)+c c2
2
dataset of example sequences of reasoning steps (where
sc ∈C(s)
12: s ← sc {Move to the selected child} intermediate reasoning steps are terminated by an ”End of
13: end while Intermediate Step” eois token). The objective is twofold: (1)
14: ————– Expansion —————————————– to fine-tune the policy model πθ to produce semantically
15: for j = 1 to M do coherent reasoning steps, and (2) to train the q-value model
16: zc ← (t1 , . . . tMzc ) ∼ πθ {Sample a new reasoning Qφ to accurately assign scalar scores to reasoning trajec-
step} tories, distinguishing between high-quality and suboptimal
17: sc ← s ⌢ zc {Append zc to the current state s} reasoning paths.
18: q(sc ) ← Qφ (s) {Predict with the Q-VM} This supervised fine-tuning phase ensures that the policy
19: N (sc ) ← 1 {Initialize visit count} can generate reasoning steps consistent with the structured
20: β(sc ) ← 0 {Initialize terminality function} format required for downstream MCTS-based exploration,
21: if sc terminal then while the q-value model provides reliable evaluations of
22: β(sc ) ← 1( {Mark as terminal} intermediate and terminal states. Together, these compo-
1, if Γ(sc , g ∗ ) = 1, nents form the foundation for the subsequent online re-
23: r(sc ) ← {Check for cor- inforcement learning in Phase 2, where the policy and q-
−1, if Γ(sc , g ∗ ) = 0.
value models are further refined through interaction with
rectness to determine the reward}
the reasoning framework.
24: q(sc ) ← r(sc ) {Overwrite by true reward}
25: end if
26: N ← N ∪ {sc } {Add the node to the tree} C.2.1 Datasets Generation and Preparation
27: E ← E ∪ {(s, sc )} {Add the edge to the tree}
28: end for Dataset for SFT of the Policy. Performing SFT of the policy
29: ————– Backpropagation ——————————– requires a dataset of high-quality
 reasoning sequences
(i) (i)
30: while s ̸= s0 do denoted as DSFT = { xSFT , ySFT }. Each pair in the dataset
31: N (s) ← N (s) + 1 P {Update the visit count}
(i)
consists of a prompt xSFT composed of a sequence of
32: q(s) ← (1 − α)q(s) + αγ sc ∈C(s) ws (sc )q(sc ) (i) (i) (i)
reasoning steps (for example xSFT = (z0 , ..., zj )), and
33: {Update the value} (i) (i)
34: s ← sp {Move to the parent} a target completion ySFT = zj+1 which is the subsequent
35: end while reasoning step or final answer. Appendix D contains a
36: i←i+1 detailed account of the dataset creation and processing. It
37: end while covers how the special eois token is appended to reasoning
38: Best Path Selection: steps mark the end of a step during inference.
39: Select the best reasoning sequence s∗T .
40: Dataset for Q-Value Model Training. Similarly to SFT,
(i) training the q-value model requires a supervised dataset
41: return s∗T , all reasoning sequences {sj }j
of reasoning sequences and corresponding scores. We de-
(i) (i)
note this dataset DQVM-train = {(xQVM-train , yQVM-train )}, with
(i) (i) (i)
reasoning sequences xQVM-train = (z0 , ..., zt ) and target
(i)
q-value yQVM-train . Appendix D explains how this dataset
can be generated using an initial list of questions, a base
LLM for querying, and a verifier program to label reasoning
sequences as conducive to a correct final answer or not.
32

Phase 1 Phase 2 MCTS with


value model
Policy Model Process based SFT training Policy Model Process based RL training Policy Model
LLM Thought LLM RLM
CoT/MCTS MCTS with value model
Training data Replay Buffer
solu�on filtering
MCTS with
value model
Value Model Process based SFT training Value Model Process based RL/SFT training Value Model
LLM PBV LLM PBV LLM
Lin. layer + ac�va�on MCTS with simula�ons Lin. layer + ac�va�on MCTS with value model Lin. layer + ac�va�on

Training data Replay Buffer


solu�on filtering

Fig. 14: The two phases of the training pipeline.

Algorithm 2 SFT of Policy Model πθ (completion-only) C.2.3 Q-Value Model Training


Input: Policy Model πθ , tokenized dataset The q-value model Qφ is trained on DQVM-train to assign
DSFT = {(x(i) , y (i) )}, training hyperparameters (optimizer, appropriate scalar scores to the candidate reasoning trajec-
learning rate η , batch size B , and maximum number of tories. It is instantiated as a pre-trained LLM with additional
epochs E .). linear layer and to which a shifted and rescaled classification
Output: Fine-tuned policy model πθ . head is added; we denote all of its trainable weights as φ.
Depending on the reward design, the q-value model can
1: for epoch e = 1 to E do be trained via scalar (least squares) regression if continuous
2: Shuffle dataset DSFT . rewards are chosen, or with a classification objective such
3: Divide DSFT into batches {Bk } of size B . as the Binary Cross-Entropy (BCE) loss, if trajectories are la-
4: for each batch Bk do belled with binary rewards or as chosen-rejected preference
5: Initialize batch loss: Lbatch = 0. pairs.
6: for each sample (x(i) , y (i) ) ∈ Bk do By the end of training, Qφ should output accurate
7: Iteratively predict completion tokens: q-value scores, which will later guide policy refinement in
(i) (i) Phase II and will improve the search accuracy when used
ŷt ∼ πθ (x1:t−1 ), in the MCTS.
(i)
where x1:t−1 represents the context (prompt +
previously predicted tokens).
8: Compute CE loss for each completion token: Algorithm 3 Fine-Tuning the Q-Value Model Qφ
P|y(i) | (i) (i)
L(i) = − t=1 log P (ŷt = yt |x(i) , πθ ). Input: Q-value model Qφ (QVM), dataset DQVM-train =
9: Accumulate the loss: Lbatch += L(i) . {(x(i) , y (i) )}, training hyperparameters (optimizer, learning
10: end for rate η , batch size B , and maximum epochs E ).
11: Normalize batch loss: Lbatch = Lbatch /|Bk |. Output: Fine-tuned q-value model Qφ .
12: Backpropagate gradients, update θ via optimizer.
13: end for 1: for epoch e = 1 to E do
14: end for 2: Shuffle the dataset DQVM-train .
3: Divide DQVM-train into batches {Bk } of size B .
C.2.2 SFT of the Policy 4: for each batch Bk do
Supervised fine-tuning (SFT) of the policy is performed on 5: for each sample (x(i) , y (i) ) ∈ Bk do
the dataset DSFT of prompts and target completions of the 6: Predict the q-value with QVM ŷ (i) = Qφ (x(i) ).
next reasoning step. The policy πθ is instantiated as a gen- 7: {Compute the loss:}
eral pretrained LLM. Specifically, we perform ’completion- 8: if Regression
P Loss then
only’ SFT such that for every (prompt, target completion) 9: L = B1 (x(i) ,y(i) ) (ŷ (i) − y (i) )2 .
pair, the base model is trained to minimize the cross- 10: end if
entropy loss between its predicted token probabilities and 11: if Classification Loss then
L = B1 (x(i) ,y(i) ) BCE(ŷ (i) , y (i) ).
P
the ground-truth target completion. 12:
13: end if
14: Backpropagate gradients, update φ via optimizer.
15: end for
16: end for
17: end for
33

C.3 Training Phase 2: RL Tuning of Policy with MCTS Algorithm 4 Phase 2: RL of the Policy and Q-Value Model
Phase 2 involves generating reasoning reasoning sequences Input: Policy πθ , q-value model Qφ , dataset Dp = {p(i) },
from the policy with MCTS and the q-value model, MCTS hyperparameters ΞM CT S .
and fine-tuning the policy with an RL-based alignment Output: Trained πθ and updated Qφ .
algorithm to generate better completions. The q-value
model must also be continually updated in this training 1: for each training iteration do
loop to keep in-distribution with the policy’s outputs. 2: ————– Rollout ———————————
Sufficient Phase 1 pre-training of the policy and q-value 3: for each question p(i) ∈ Dp do
model is crucial to ensure stable training of the models in 4: {Generate MCTS tree with πθ and Qφ (Algorithm 1)}
Phase 2. The MCTS structure which provides a balanced 5: T (i) ← M CT S(p(i) , Qφ , πθ , ΞM CT S )
exploration-exploitation search combined with repeated 6: {Remove incomplete paths from the tree}
sampling of the policy ensures sufficient exploration during 7: T̃ (i) ← P rune(T (i) )
this online-RL phase. This final training phase returns the 8: {Extract nodes and values, store them in replay buffer}
finetuned policy and q-value model. (i) (i) (i)
9: R ← R ∪ {(sj , zj , q(sj )}sj ∈Ñ (i)
10: end for
11: ————– Training ———————————
C.3.1 Phase 2 Algorithm
12: for each epoch do
Phase 2 uses a set Dp = {p(i) } of prompt questions - these 13: Sample a batch B from replay buffer R.
questions may be isolated from the phase 1 dataset DSFT . 14: Update policy πθ (Algorithm 5).
The training process (Algorithm 4) involves a repetition of a 15: Update q-value model Qφ (Algorithm 7).
MCTS rollout phase followed by a training (reinforcement) 16: end for
phase. 17: end for

Data generation: MCTS rollout. To obtain data for


C.3.2 Policy Update
the training, a MCTS tree T (i) is build w.r.t. each question
p(i) using the algorithm in Algorithm 1 in training mode. The policy update is performed on a batch D of reasoning
The set of hyperparameters for MCTS ΞM CT S , denotes sequences. As mentioned above, the reasoning sequences
the number of MCTS iterations N (per question), the can be decomposed into state-action-value triplets to
number of children expanded in every selection phase M , then perform RL training. We distinguish between three
exploration constants c1 , c2 , and backpropagation weight α. reinforcement methods: standard RL, preference-based RL,
To enhance the quality of the data, we prune the generated or SFT training.
MCTS tree T̃ (i) = (Ñ (i) , Ẽ (i) ) to only include paths that
reached a terminal state since only these paths received Standard Policy Gradient RL Methods. Standard policy
the reward. Then, we extract all nodes and a set of node gradient methods such as Proximal Policy Optimization
characteristics from the pruned tree. The dataset comprises (PPO) [125] or REINFORCE [1], [136] are particularly suited
of state, action and q-value triplets of the pruned tree: for tasks where trajectories are collected (online) and reliably
(i) (i) (i)
{(sj , zj , q(sj )}sj ∈Ñ (i) . The data is stored in a replay evaluated by the q-value model Qφ .
PPO relies on the computation of trajectory (reasoning
buffer R.
sequence) advantages Â(st ), which quantify how much
The training: RL phase. The reinforcement phase sam- better or worse an action taken in a given state is compared
ples a batch of reasoning sequences from the replay buffer. to the expected baseline value of that state. The advantage
From each trajectory, constituent states, actions and value function is estimated by:
estimatesand uses the corresponding values attributed dur- Â(st ) = Rt + γV (st+1 ) − V (st ),
ing MCTS to perform RL training (for example with PPO
or Reinforce). Alternative schemes may involve selecting where Rt is the immediate environment reward at step t,
preference pairs among trajectories and then aligning the V (st ) is the state value of of state st , and γ is the discount
policy using DPO, or simply selecting the most desirable factor. We can derive the state value easily from the q-values
trajectory per question and performing further SFT training. obtained via the q-value model or the running estimates in
During this reinforcement phase, the value model is the MCTS as follows:
updated to mimic the (backpropagated) values from the 1
V (st+1 ) = Qφ (st , at ),
MCTS process (Algorithm 7). γ
since rewards are sparse. The standard PPO approach trains
the critic model from scratch on bootstrapped rewards for
this purpose. We introduce an alternative advantage com-
putation scheme that leverages the backpropagated values
from Monte Carlo Tree Search (MCTS) in conjunction with
Qφ , as detailed in Algorithm 6. This integration combines
MCTS’s exploration and evaluation capabilities with the RL
update, enhancing robustness and efficiency in reasoning
tasks.
34

Further regularization can be imposed on the PPO train- Algorithm 5 Policy Update (PPO, DPO, or SFT)
ing procedure. To align the policy πθ with a reference Input: Batch D, policy πθ , reference policy πref , learning
policy πref (usually instantiated as πθ before phase 2) during rate η , clipping parameter ε, preference data Dpref for DPO.
training, the KL divergence KL(πθ ||πref ) , between the two Output: Updated policy πθ .
distributions can be added to the training loss. Additionally,
to maintain the diversity of policy generations (and explo- 1: ————– Train via PPO ———————————
ration during training), the entropy of the policy distribu- 2: Select state-action-value triplets from sequences in D
tion can be enhanced by subtracting it from the loss. The 3: for each (st , at , qt ) ∈ D do
entropy penalty is estimated over a batch D of state-action Compute the policy ratio: rθ =
πθ (at |st )
4: πθref (at |st ) .
pairs (s, a) where s denotes a reasoning sequence and a the
next reasoning step. The entropy of a single completion a is 5: Compute the advantages Â(st ) (Algorithm 6).
computed by summing the entropy of its individual tokens 6: Compute the PPO loss:
a1:|a| of a: LPPO = min(rθ Â(st ), clip(rθ , 1 − ε, 1 + ε)Â(st )).
7: end for
1 X X 8: Optional: add KL divergence or entropy regularization.
LH = − πθ (ai |[s, a1:i−1 ]) log πθ (a|[s, a1:i−1 ]).
|D| a ∈a
(s,a)∈D i
LPPO ← LPPO + λKL KL(πθ ||πref ) + λH LH .
Direct Preference Optimization (DPO). DPO [115]
aligns the policy to user preferences expressed as pairwise 9: Perform gradient update to refine πθ .
comparisons between reasoning sequences. Given pairs 10:
(s+ , s− ), where s+ is preferred over s− . This method may 11: ————– Train via DPO (pairwise preferences) ——
not require a process reward/value model. The loss involves 12: Select preference pairs of reasoning sequences in D
the sigmoid function which we denote as σ . 13: for each pair (s+ , s− ) ∈ Dpref do
Supervised Fine-Tuning (SFT). As a straightforward 14: Compute DPO objective:
alternative to RL, high-value reasoning sequences can be
πθ (s+ )
  
selected to perform SFT, i.e. train the policy to maximize the 1 X
LDPO = log σ β log .
likelihood of these reasoning steps. The high-value reason- |Dpref | + − πθ (s− )
(s ,s )
ing sequences may be selected as terminal nodes having
the highest q-value, or highest aggregated intermediate- 15: end for
step values. This approach is inspired by AlphaZero-like 16: Perform gradient update to refine πθ .
frameworks, focusing on iteratively refining the policy to 17:
generate high-quality reasoning trajectories without requir- 18: ————– Train via SFT (single target sequence) ——
ing explicit rewards. 19: Select high-value reasoning sequences s+ from D
20: for each reasoning sequence s+ do
C.3.3 Advantage Calculation (for PPO Policy Updates) 21: Perform SFT on s+
While standard advantage computation in PPO (e.g., via 22: end for
Generalized Advantage Estimation (GAE) [125]) is widely
applicable, we propose an alternative approach tailored to
our reasoning framework in Algorithm 6. Specifically, for Algorithm 6 Advantage Calculation in MCTS Framework
each state/node s, we leverage the q-value estimates q(s) Input: MCTS Tree T = (N, E), node statistics: rewards and
obtained during the MCTS process. They were updated q-values, q-value model Qφ , discount factor γ , and λ.
in the backpropagation phase to provide a more informed Output: Advantages {Â(st )}.
estimate of the q-values incorporating the estimates of the
children and potentially true reward signals from terminal 1: for each node si ∈ N do
paths in the tree. We expect these MCTS-derived values to 2: Compute state values: vsMCTS = γ1 q MCTS (si )
i+1
be more reliable as they incorporate the ground-truth termi-
3: Compute state values: vsMCTS = γ1 q MCTS (si−1 )
nal reward, propagated back through the tree, ensuring that i

a node’s value reflects both its immediate reward and the 4: Compute the advantage on the TD error: Â(si ) =
aggregated values of subsequent child states. r(si , ai ) + γvsMCTS
i+1
− vsMCTS
i
.
5: end for

C.3.4 Q-Value Model Update


During phase 2, the q-value model Qφ is also updated to
track MCTS-backtracked value estimates q MCTS (st ) which
should be of higher quality (thanks to the final answer
verifier and score aggregation from child nodes). For each
state-action pair (s, a), we train the q-value model Qφ via
squared error minimization, to match its q-value Qφ (s, a)
as closely as possible to the corresponding MCTS-value
q MCTS (s′ ) which saves the updated q-value of action a taken
in state s leading to state s′ .
35

This has the benefit of both improving the accuracy of D.2 Generating Data for Phase 1 Value Model Training
the value model, and keeping it ”in-distribution” with the The original MCTS framework relies on simulations to
new policy outputs during this online-RL training. evaluate a state. Given the state, n rollouts are performed till
a terminal state is reached. The terminal states usually can
Algorithm 7 Q-Value Model Update be evaluated (e.g., in math by comparing it with the golden
Input: Batch D, q-value model Qφ , learning rate η . answer). This enables the distribution of terminal rewards
Output: Updated Qφ . based on their success which are then aggregated to provide
1: Compute loss: a value estimate of the state. These Monte Carlo simulations
1 P MCTS ′ 2 serve as an estimate of a state’s ability to lead to a correct
Lq = |D| (s,a,s′ ) (Qφ (s, a) − q (s )) .
2: Perform gradient update on Lq . answer. The value estimated in this manner corresponds to
the expected cumulative future reward for a given state:
" T #
X
t−i
Vπθ (s) = Eτ ∼πθ γ r(st , at ) | si = s ,
A PPENDIX D t=i
DATA G ENERATION where T is the terminal step of the (sub-) reasoning chain
D.1 Generating Data for Phase 1 Policy Model Training τ = (si , ai , ri , si+1 , . . . , sT , aT , rT , sT +1 ).
Since rewards are sparse (i.e., r(st , at ) = 0 for all t < T ),
The objective of this training process is to introduce a
the value function simplifies to:
new ’End of Intermediate Step’ (EOIS) token that serves h i
to delimit individual reasoning steps while preserving the Vπθ (st ) = Eπθ γ T −t r(sT , aT ) | st .
original distribution of the model as much as possible. To
achieve this, the model is trained on data generated by itself This represents the expected terminal reward, which can
using greedy decoding. be empirically estimated using Monte Carlo (MC) estimates:
The training data are derived from eight chain-of- N
1 X T −t (i) (i)
thought (CoT) completions generated for 1,000 questions Vπθ (st ) ≈ γ r(sT , aT ) := V̂ (st ),
sampled from the training split of the MATH dataset [59]. N i=1
These completions are produced using the same model where N is the number of sampled reasoning chains, and
intended for subsequent training with greedy decoding. (i) (i) (i)
sT , aT , sT +1 denote the last transition of the simulation
During this generation process, the reasoning steps in the (i) (i) (i) (i) (i)
data are observed to be separated by two consecutive ’\n\n’. trajectory τ (i) = (st , at , st+1 , . . . , sT , aT , sT +1 ) for i ∈
This observation informs the method of delimitation used {1, . . . , N }.
to construct pairs of questions and their corresponding To avoid sample inefficiencies and high computational
sequences of reasoning steps. burdens, AlphaGo Zero [129] and AlphaZero [128]
introduce a value model to replace simulations by using
For each data point, consisting of a question prompt and
its predictions for a state. We follow this approach by
its associated target response comprising multiple reasoning
(i) (i) defining a process-based value model Vφ . Notably, we
steps (q (i) , [s1 , . . . , sn ]), additional tokens are introduced
train this model with simulation data (instead of true value
to explicitly mark the boundaries of the reasoning steps.
functions), thereby building a model that predicts state
Specifically, the ’End of Intermediate Step’ (EOIS) token is
(i) value function estimates V̂ . We denote this model as V̂φ ,
defined and inserted after each reasoning step sj , resulting parameterized by φ.
(i)∗
in a modified step sj . Additionally, the ’End of Sequence’
(i) Given that the input of a value model is a sequence of
(EOS) token is appended to the final reasoning step sn ,
(i)∗ (i)
yielding sn = [sn ; eos]. This augmentation ensures that reasoning steps - therefore a sequence of tokens, the natural
the model can consistently identify when a final solution value model architecture is to use an LLM on which one
has been reached during inference. adds linear layer(s) and a suitable ouput activation function.
For Llama models, it has been empirically observed that Typically, it is designed to output a scalar value V̂φ (st ) ∈
introducing an ’assistant’ token after each reasoning step C ⊆ R.
enhances the model’s effective utilization of the EOIS token. The core distinction between different modeling ap-
However, this behavior may not generalize to other base proaches to state value functions lies in how rewards are
models, necessitating careful consideration when applying modeled. Depending on whether a binary reward setting or
this approach. a continuous (bounded) one is used, the aggregation mech-
Accordingly, the target sequence for supervised fine- anism, model architecture, training loss, and interpretation
tuning (SFT) is constructed as: of the predictions vary. We provide an overview of both
scenarios and, although often omitted for simplicity, we
(i) (i) (i)
ySFT = [s1 , eois, assistant, s2 , . . . , s(i) consider both γ = 1 and γ ∈ (0, 1] for continuous rewards
n , eos].
in our analysis.
This approach yields a training dataset comprising pairs
of prompts and their corresponding target completions, D.2.1 Binary Rewards: Modeling the Likelihood of a Correct
formally represented as: Terminal State
For this approach the rewards are modeled binary, therefore
(i)
DSFT = {(q (i) , ySFT )}. r(sT , aT ) = +1 for correct solutions and r(sT , aT ) = 0 for
36

incorrect solutions. We will adopt a discount factor of γ = 1 will output values between 0 and 1. To accommodate the
which we will see aligns more with the interpretation this binary classification nature of this task, the model should
reward model provides and is widely adopted in literature. employ a sigmoid activation function in the output layer.
This approach corresponds to the value model proposed in The training objective is then to minimize the binary cross-
AlphaGo Zero [129]. entropy (CE) loss between the predicted probabilities and
D.2.1.1 State Value Estimation: The value function the empirical estimates derived from the simulations:
then further simplifies to:
L(φ) =
Vπθ (st ) = Eπθ [r(sT , aT ) | st ] = Pπθ (r(sT , aT ) = 1 | st ) N
1 Xh 
(i)
 
(i)
i
This formulation represents the probability of reaching a − yi log V̂φ (st ) + (1 − yi ) log 1 − V̂φ (st )
N i=1
correct terminal state from a given state st . Empirically, this
probability is estimated using simulations as follows: where yi ∈ {0, 1} denotes the binary label indicating
#correct simulations whether the i-th simulation resulted in a correct terminal
Vπθ (st ) ≈ := V̂ (st ). state.
#simulations
D.2.1.2 Data Generation: To generate labels for es- Employing a binary reward structure offers several bene-
timating the state-value function during the training of a fits. First of all, simplicity since binary rewards simplify the
value model, we use MCTS with simulations till a terminal learning process, reducing the complexity associated with
node is reached and calculate the ratio between the num- continuous reward signals. Moreover, the clear distinction
ber of correct simulations to the number of simulations. between correct and incorrect states facilitates faster con-
There is one very important detail, for a trajectory τ = vergence during training making this approach effective.
(si , ai , ri , si+1 , . . . , sT +1 ) where sT +1 is a terminal state. In addition, binary classification is less susceptible to noise
By definition, the true state value function at sT +1 is zero. in reward signals, ensuring more stable value estimates.
However, in training the value model, we avoid instructing Furthermore, this approach aligns with the objectives of
it to output zero for terminal states. Instead, in a supervised reinforcement learning in achieving clear and unambiguous
learning setting, we can identify terminal states and directly rewards, thereby streamlining the optimization of the policy
compare the model’s predictions against the known correct πθ .
outcomes (referred to here as ”golden answers”). This com-
parison negates the need to rely solely on the value model to
D.2.2 Continuous and Bounded Rewards: Modeling the Ex-
estimate the value of terminal states or to determine the re-
pected Future Reward
ward associated with transitioning into these states. During
inference, while we can still recognize terminal states, we We model the rewards to be continuous and bounded by
cannot evaluate them by comparing the model’s output to a allowing values in [a, b]:
golden answer. Therefore, an alternative metric is necessary.
We train the value model to predict whether transitioning Vπθ (st ) ∈ [a, b]
to sT +1 leads to a correct terminal outcome. By learning the
A common design, is to set the borders to −1 and 1 such that
relationship between a node’s content and the correctness
a terminal reward is r(sT , aT ) = +1 for correct terminal
of the resulting terminal state, the model can estimate the
states and r(sT , aT ) = −1 for incorrect states. This ap-
likelihood that a terminal state leads to a correct answer.
proach models the expected future reward as a continuous
To approximate the terminal reward during inference, we
and bounded value, capturing the degree of correctness or
define:r(sT , aT , sT +1 ) ≈ 1[0.5,1] (V̂φ (sT +1 )) Here V̂φ (sT +1 )
quality of the terminal state. In contrast to the binary reward
represents the value predicted by the value model for the
structure, continuous and bounded rewards provide a more
terminal state sT +1 . If this predicted likelihood exceeds
nuanced representation of the outcomes in reasoning tasks.
a threshold (e.g., 0.5), we assign a terminal reward of 1;
Note, that without discounting this approach resembles the
otherwise, we assign a reward of 0. This approach allows
proposed value model of AlphaZero [128].
the value model to indirectly influence the terminal reward
D.2.2.1 Bounded rewards: By constraining rewards
by predicting the likelihood of a correct outcome. Conse-
within a predefined interval [a, b], we effectively create a
quently, during training, terminal rewards serve as labels
correctness scale where the extremities represent the defini-
for terminal states in the value model. It is important to note
tive outcomes of the reasoning process. Specifically, the
that V̂φ (sT +1 ) is not used in any other context but solely to
lower bound a corresponds to reaching an incorrect terminal
estimate the terminal reward.
state, while the upper bound b signifies a correct terminal
V̂φ (sT +1 ) ̸= V̂ (sT +1 ) state. This bounded framework mirrors the spectrum of
possible correctness, allowing the model to capture varying
This distinction clarifies that the predicted value for the degrees of solution quality between these extremes. Such a
terminal state V̂φ (sT +1 ) differs from the standard value scale facilitates a more nuanced evaluation of intermediate
function’s definition V̂ (sT +1 ) = 0. states, reflecting partial correctness or varying levels of
D.2.1.3 Model Training V̂φ : S → [0, 1]: When reasoning quality. Moreover, this approach ensures that the
trained with these labels we obtain a value model V̂φ , pa- reward signals remain interpretable and consistent, foster-
rameterized by φ, that represents the likelihood of a correct ing a clear distinction between successful and unsuccessful
terminal state emanating from state st . Therefore, the model outcomes.
37

D.2.2.2 State Value Estimation: With a discount fac- all moves contribute indirectly and trajectories are not
tor γ ∈ (0, 1], the value function is defined as: penalized for length, reasoning benefits from discouraging
h i unnecessary or redundant steps. The inclusion of the
Vπθ (st ) = E γ T −t r(sT , aT ) | st , discount factor γ ensures that rewards achieved sooner
have a greater impact on the value function, the model
where r(sT , aT ) = b for correct terminal states and incentivizes reaching correct solutions with fewer steps
r(sT , aT ) = a for incorrect ones. Empirically, this expec- which ultimately enhances efficiency and suppresses
tation is approximated by averaging the rewards of the redundancies. Moreover, this models the uncertainty decay
simulations: in the trajectories; the further into the future a reward lies,
N the more uncertain its prediction becomes. Discounting
1 X T −t (i) (i)
Vπθ (st ) ≈ γ r(sT , aT ) := V̂ (st ), naturally reduces the reliance on these uncertain long-
N i=1
term rewards, thereby stabilizing the learning process by
where N denotes the number of sampled reason- focusing on more predictable and immediate outcomes.
(i) (i) (i) However, the model’s performance becomes sensitive
ing chains, and (sT , aT , sT +1 ) represent the final
transition of the i-th simulation trajectory τ (i) = to the choice of γ , requiring careful tuning to balance
(i) (i) (i) (i) (i)
(st , at , st+1 , . . . , sT , aT , sT +1 ) for i ∈ {1, . . . , N }. If a the influence of immediate versus long-term rewards.
discount factor is applied γ ∈ (0, 1) then each terminal Balancing the discount factor is essential to ensure that the
reward is discounted proportional to the number of steps model effectively captures the importance of both progress
needed to reach the terminal state. This corresponds to the and the final correctness of the reasoning chain.
soft estimation proposed by Wang et al. [147]. We want to
note that this estimator typically underestimates V due to Employing a continuous and bounded reward structure
its proneness to false negatives [55], [163]. offers several benefits. Unlike binary rewards, continuous
rewards provide a finer distinction between varying degrees
D.2.2.3 Data Generation: Therefore, to generate la-
of correctness, allowing the model to capture subtle differ-
bels for state-value function estimate pairs to train a value
ences in terminal states. Continuous rewards can encode
model, we use MCTS with simulations and average the
more information about the quality of solutions, facilitating
outcomes of the simulations. Therefore, at each newly gen-
more informed decision-making during the search process.
erated node s we simulate till a terminal node is reached
Bounded rewards prevent extreme values, promoting nu-
and we record the depth - the number of steps needed
merical stability and consistent training dynamics. How-
starting from s (since T is not identical per trajectory). We
ever, this also shows that the choice of reward values and
then record the the terminal reward which in our case is
their scaling can significantly impact the learning process,
r(sT , aT ) = 1 for correct and r(sT , aT ) = −1 for incorrect
necessitating careful calibration to ensure effective training.
answers. Discounted by the depth we can average these
rewards nd obtain an estimation of the node value which
serves as a label for the initial value model training. D.3 State Action Value Function Modeling
D.2.2.4 Model Training V̂φ : S → [a, b]: The value The state-action value function, commonly denoted as
model V̂φ , parameterized by φ, is designed to predict the Qπθ (st , at ), represents the expected cumulative reward of
expected terminal reward from any given state st . To ac- taking action at in state st under policy πθ . Formally, it is
commodate the continuous and bounded nature of this task, defined in our framework as:
the model employs a scaled and shifted sigmoid activation
function in the output layer, ensuring that the predictions
remain within the range [a, b]. The training objective is to Qπθ (st , at )
T
" #
minimize the mean squared error (MSE) loss between the X
predicted values and the empirical estimates derived from = Eτ ∼πθ γ i−t r(si , ai ) | st , at
the simulations: i=t
T
" #
N X
i−(t+1)
1 X (i) (i) (i) 2
 = r(st , at ) + γ Eτ ∼πθ γ r(si , ai ) | st , at
L(φ) = V̂φ (st ) − γ T −t r(sT , aT ) .
N i=1 i=t+1
= r(st , at ) + γ Est+1 [Vπθ (st+1 ) | st , at ]
We also experimented with a tanh activation output and det. P
a linear layer with clipping of the values. However, both = r(st , at ) + γVπθ (st+1 ),
methods proved to be unstable in training in contrast to the where T denotes the terminal step of the (sub-) reasoning
scaled and shifted sigmoid layer. A tanh and sigmoid layer chain τ = (st , at , rt , st+1 , . . . , sT , aT , rT , sT +1 ). In environ-
naturally bound the output but also push values towards ments characterized by sparse rewards, where r(st , at ) = 0
the extremes, enhancing the separation between high and for all t < T , the q-value simplifies to:
low value estimates. This characteristic can improve the
model’s ability to distinguish between highly correct and Qπθ (st , at ) = γVπθ (st+1 ).
highly incorrect states which is why we are particularly
interested in these activation functions. At terminal states, where the state value Vπθ (sT +1 ) = 0,
D.2.2.5 Discounting: Introducing a discount factor the q-value further reduces to:
γ aligns the value function with the incremental nature
of reasoning tasks. Unlike traditional games, where Qπθ (sT , aT ) = r(sT , aT ).
38

D.3.1 Process-Based Q-Value Modeling We introduced q-value models since they address a
A process-based q-value model utilizes the same architec- critical inconsistency of value models in terminal states.
ture as a process-based Value Model, typically leveraging Specifically, while value models assign a flat value of
a LLM enhanced with additional linear layers and an ap- zero to terminal states, q-value models provide a mean-
propriate output activation function. The output is a scalar ingful evaluation of the final action’s correctness through
value Q̂φ (st , at ) ∈ C ⊆ R. Specifically, the q-value model Qπθ (sT , aT ) = r(sT , aT ). This distinction is essential for
takes a state-action pair—comprising a sequence of past accurately assessing whether a terminal step leads to a
steps and the current action—and predicts the correspond- correct or incorrect response during inference.
ing q-value based on the aforementioned formulation.
D.3.1.1 Training Data Generation: To train the q-
value model, it is essential to compute the q-values for R EFERENCES
various state-action pairs. For t < T , q-values can be [1] A. Ahmadian, C. Cremer, M. Gallé, M. Fadaee, J. Kreutzer,
estimated using N Monte Carlo simulations as follows: O. Pietquin, A. Üstün, and S. Hooker. Back to Basics: Revisit-
ing REINFORCE-Style Optimization for Learning from Human
Feedback in LLMs. In L.-W. Ku, A. Martins, and V. Srikumar,
Qπθ (st , at ) = r(st , at ) + γVπθ (st+1 ) editors, Proceedings of the 62nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), ACL ’24,
= γVπθ (st+1 ) (since r(st , at ) = 0) pages 12248–12267, Bangkok, Thailand, Aug. 2024. Association
N for Computational Linguistics.
1 X T −(t+1) (i) (i) [2] J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin. Large
≈γ· γ r(sT , aT ) Language Models for Mathematical Reasoning: Progresses and
N i=1
Challenges. In N. Falk, S. Papi, and M. Zhang, editors, Proceedings
N of the 18th Conference of the European Chapter of the Association
1 X T −t (i) (i) for Computational Linguistics: Student Research Workshop, EACL
= γ r(sT , aT ) := Q̂(st , at ),
N i=1 ’24, pages 225–237, St. Julian’s, Malta, Mar. 2024. Association for
Computational Linguistics.
where N is the number of sampled reasoning chains, [3] AI-MO. Aime 2024. https://huggingface.co/datasets/AI-MO/
(i) (i) (i) (i) (i) aimo-validation-aime, July 2024. accessed 2025-01-19.
and τ (i) = (st , at , st+1 , . . . , sT , aT , sT +1 ) represents the
[4] AI-MO. Amc 2024. https://huggingface.co/datasets/AI-MO/ai
i-th simulation trajectory for i ∈ {1, . . . , N }. This estimation mo-validation-amc, July 2024. accessed 2025-01-19.
aligns with the state value estimation under the sparse [5] A. Amini, S. Gabriel, P. Lin, R. Koncel-Kedziorski, Y. Choi,
reward formulation: and H. Hajishirzi. MathQA: Towards Interpretable Math Word
Problem Solving with Operation-Based Formalisms, May 2019.
arXiv:1905.13319.
Q̂(st , at ) = V̂ (st ). [6] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Do-
han, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program Syn-
For t = T , the q-value is directly given by the immediate thesis with Large Language Models, Aug. 2021. arXiv:2108.07732.
reward: [7] A. Bakhtin, L. van der Maaten, J. Johnson, L. Gustafson, and
R. Girshick. PHYRE: A New Benchmark for Physical Reason-
ing. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Qπθ (sT , aT ) = r(sT , aT ) = Vπθ (sT +1 ) ̸= V̂ (sT ) = 0. Buc, E. Fox, and R. Garnett, editors, Proceedings of the Thirty-
third Annual Conference on Neural Information Processing Systems
D.3.1.2 Reward Modeling: For q-value models the (NeurIPS ’19), volume 32 of Advances in Neural Information Pro-
cessing Systems, pages 5082–5093, Vancouver, Canada, Dec. 2019.
same discussions about reward modeling apply here since Curran Associates.
the models are trained very similar. This is why omit it here. [8] T. Ben-Nun and T. Hoefler. Demystifying Parallel and Distributed
Deep Learning: An In-depth Concurrency Analysis. ACM Com-
D.3.2 The Difference between Value and Q-Value Models put. Surv., 52(4):65:1–65:43, Aug. 2019.
[9] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, L. Gianinazzi,
The difference of VMs and QVMs can be easily shown J. Gajda, T. Lehmann, M. Podstawski, H. Niewiadomski, P. Ny-
in how they are used in the evaluation processes of an czyk, and T. Hoefler. Graph of Thoughts: Solving Elaborate
MCTS algorithm. QVMs predict Q̂φ (st , at ), which evaluates Problems with Large Language Models. Proceedings of the AAAI
Conference on Artificial Intelligence, 38(16):17682–17690, Mar. 2024.
the action at taken in state st that deterministically transi- [10] M. Besta, A. C. Catarino, L. Gianinazzi, N. Blach, P. Nyczyk,
tions to st+1 . Thus, the value Q̂(st , at ) is used to evaluate H. Niewiadomski, and T. Hoefler. HOT: Higher-Order Dynamic
adding the node st+1 to the tree. On the other hand, for Graph Representation Learning with Efficient Transformers. In
S. Villar and B. Chamberlain, editors, Proceedings of the Second
VMs, adding a node st+1 to the tree is determined by Learning on Graphs Conference (LOG ’23), volume 231 of Proceed-
V̂ (st+1 ) = γ1 Q̂φ (st , at ), where γ is the discount factor. ings of Machine Learning Research, pages 15:1–15:20, Virtual Event,
This distinction is making the training processes differ- Nov. 2023. PMLR.
[11] M. Besta, R. Grob, C. Miglioli, N. Bernold, G. Kwaśniewski,
ent. Note that st ⌢ at = st+1 . For QVMs, the training G. Gjini, R. Kanakagiri, S. Ashkboos, L. Gianinazzi, N. Dryden,
tuples are ((st , at ), Q̂(st , at )) = (st+1 , Q̂(st , at )) due to the and T. Hoefler. Motif Prediction with Graph Neural Networks.
deterministic transition. For VMs, the corresponding train- In Proceedings of the 28th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, KDD ’22, pages 35–45, Washington
ing tuples are (st+1 , V̂ (st+1 )). Since we propose training DC, USA, Aug. 2022. Association for Computing Machinery.
VMs on terminal rewards for terminal states instead of [12] M. Besta and T. Hoefler. Parallel and Distributed Graph Neural
assigning a label of 0, VMs and QVMs become equivalent Networks: An In-Depth Concurrency Analysis. IEEE Transactions
under the following transformation for any t ∈ {0, . . . , T } on Pattern Analysis and Machine Intelligence, 46(5):2584–2606, May
2024.
for evaluating adding node st+1 : [13] M. Besta, A. Kubicek, R. Niggli, R. Gerstenberger, L. Weitzendorf,
1 M. Chi, P. Iff, J. Gajda, P. Nyczyk, J. Müller, et al. Multi-Head
V̂ (st+1 ) = Q̂φ (st , at ). RAG: Solving Multi-Aspect Problems with LLMs, Nov. 2024.
γ arXiv:2406.05085.
39

[14] M. Besta, F. Memedi, Z. Zhang, R. Gerstenberger, N. Blach, Dataset. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings
P. Nyczyk, M. Copik, G. Kwaśniewski, J. Müller, L. Gianinazzi, of the 2023 Conference on Empirical Methods in Natural Language
et al. Demystifying Chains, Trees, and Graphs of Thoughts, Apr. Processing, EMNLP ’23, pages 7889–7901, Singapore, Dec. 2023.
2024. arXiv:2401.14295. Association for Computational Linguistics.
[15] M. Besta, L. Paleari, A. Kubicek, P. Nyczyk, R. Gerstenberger, [30] X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching Large
P. Iff, T. Lehmann, H. Niewiadomski, and T. Hoefler. Check- Language Models to Self-Debug, Oct. 2023. arXiv:2304.05128.
Embed: Effective Verification of LLM Solutions to Open-Ended [31] K. Chernyshev, V. Polshkov, E. Artemova, A. Myasnikov,
Tasks, June 2024. arXiv:2406.02524. V. Stepanov, A. Miasnikov, and S. Tilga. U-MATH: A University-
[16] M. Besta, P. Renc, R. Gerstenberger, P. Sylos Labini, A. Ziogas, Level Benchmark for Evaluating Mathematical Skills in LLMs,
T. Chen, L. Gianinazzi, F. Scheidl, K. Szenes, A. Carigiet, P. Iff, Jan. 2025. arXiv:2412.03205.
G. Kwaśniewski, R. Kanakagiri, C. Ge, S. Jaeger, J. Was, F. Vella, [32] F. Chollet. On the Measure of Intelligence, Nov. 2019.
and T. Hoefler. High-Performance and Programmable Atten- arXiv:1911.01547.
tional Graph Neural Networks with Global Tensor Formulations. [33] A. Choudhury, Y. Wang, T. Pelkonen, K. Srinivasan, A. Jain,
In Proceedings of the International Conference for High Performance S. Lin, D. David, S. Soleimanifard, M. Chen, A. Yadav, R. Tijori-
Computing, Networking, Storage and Analysis, SC ’23, Denver, CO, wala, D. Samoylov, and C. Tang. MAST: Global Scheduling of
USA, Nov. 2023. Association for Computing Machinery. ML Training Across Geo-Distributed Datacenters at Hyperscale.
[17] M. Besta, Z. Vonarburg-Shmaria, Y. Schaffner, L. Schwarz, In Proceedings of the 18th USENIX Symposium on Operating Systems
G. Kwaśniewski, L. Gianinazzi, J. Beranek, K. Janda, T. Holen- Design and Implementation, OSDI ’24, pages 563–580, Santa Clara,
stein, S. Leisinger, P. Tatkowski, E. Ozdemir, A. Balla, M. Copik, CA, USA, July 2024. USENIX Association.
P. Lindenberger, M. Konieczny, O. Mutlu, and T. Hoefler. Graph- [34] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and
MineSuite: Enabling High-Performance and Programmable D. Amodei. Deep Reinforcement Learning from Human Pref-
Graph Mining Algorithms with Set Algebra. Proc. VLDB Endow., erences, Feb. 2023. arXiv:1706.03741.
14(11):1922–1935, July 2021.
[35] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser,
[18] Z. Bi, K. Han, C. Liu, Y. Tang, and Y. Wang. Forest-of-Thought: M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and
Scaling Test-Time Compute for Enhancing LLM Reasoning, Dec. J. Schulman. Training Verifiers to Solve Math Word Problems,
2024. arXiv:2412.09078. Nov. 2021. arXiv:2110.14168.
[19] Y. Bisk, R. Zellers, R. Le bras, J. Gao, and Y. Choi. PIQA: Reason-
[36] M. Copik, R. Böhringer, A. Calotoiu, and T. Hoefler. FMI: Fast and
ing about Physical Commonsense in Natural Language. Proceed-
Cheap Message Passing for Serverless Functions. In Proceedings
ings of the AAAI Conference on Artificial Intelligence, 34(05):7432–
of the 37th International Conference on Supercomputing, ICS ’23,
7439, Apr. 2020.
pages 373–385, Orlando, FL, USA, June 2023. Association for
[20] R. A. Bradley and M. E. Terry. Rank Analysis of Incomplete Computing Machinery.
Block Designs: I. The Method of Paired Comparisons. Biometrika,
[37] M. Copik, G. Kwaśniewski, M. Besta, M. Podstawski, and T. Hoe-
39(3/4):324–345, Dec. 1952.
fler. SeBS: A Serverless Benchmark Suite for Function-as-a-
[21] M. Chang, J. Zhang, Z. Zhu, C. Yang, Y. Yang, Y. Jin, Z. Lan,
Service Computing. In Proceedings of the 22nd International Mid-
L. Kong, and J. He. AgentBoard: An Analytical Evaluation
dleware Conference, Middleware ’21, pages 64–78, Virtual Event,
Board of Multi-turn LLM Agents. In Proceedings of the Thirty-
Dec. 2021. Association for Computing Machinery.
eighth Annual Conference on Neural Information Processing Systems
(NeurIPS ’24), volume 37 of Advances in Neural Information Process- [38] G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y. Fan, T. Yu,
ing Systems, Vancouver, Canada, Dec. 2024. Curran Associates. Q. Xu, W. Chen, et al. Process Reinforcement through Implicit
Rewards. https://curvy-check-498.notion.site/Process-Reinfor
[22] E. Charniak and M. Johnson. Coarse-to-Fine n-Best Parsing and
cement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2ea
MaxEnt Discriminative Reranking. In K. Knight, H. T. Ng, and
f896f, Jan. 2025.
K. Oflazer, editors, Proceedings of the 43rd Annual Meeting of the
Association for Computational Linguistics, ACL ’05, pages 173–180, [39] D. De Sensi, T. De Matteis, K. Taranov, S. Di Girolamo, T. Rahn,
Ann Arbor, MI, USA, June 2005. Association for Computational and T. Hoefler. Noise in the Clouds: Influence of Network
Linguistics. Performance Variability on Application Scalability. Proc. ACM
[23] G. Chen, M. Liao, C. Li, and K. Fan. AlphaMath Almost Zero: Meas. Anal. Comput. Syst., 6(3):49:1–49:27, Dec. 2022.
Process Supervision without Process. In Proceedings of the Thirty- [40] M. DeLorenzo, A. B. Chowdhury, V. Gohil, S. Thakur, R. Karri,
eighth Annual Conference on Neural Information Processing Systems S. Garg, and J. Rajendran. Make Every Move Count: LLM-
(NeurIPS ’24), volume 37 of Advances in Neural Information Process- based High-Quality RTL Code Generation Using MCTS, Feb.
ing Systems, Vancouver, Canada, Dec. 2024. Curran Associates. 2024. arXiv:2402.03289.
[24] J. Chen, T. Li, J. Qin, P. Lu, L. Lin, C. Chen, and X. Liang. [41] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun,
UniGeo: Unifying Geometry Logical Reasoning via Reformulat- and Y. Su. Mind2Web: Towards a Generalist Agent for the Web.
ing Mathematical Expression. In Y. Goldberg, Z. Kozareva, and In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt,
Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical and S. Levine, editors, Proceedings of the Thirty-seventh Annual
Methods in Natural Language Processing, EMNLP ’22, pages 3313– Conference on Neural Information Processing Systems (NeurIPS ’23),
3323, Abu Dhabi, United Arab Emirates, Dec. 2022. Association volume 36 of Advances in Neural Information Processing Systems,
for Computational Linguistics. pages 28091–28114, New Orleans, LA, USA, Dec. 2023. Curran
[25] J. Chen, H. Lin, X. Han, and L. Sun. Benchmarking Large Lan- Associates.
guage Models in Retrieval-Augmented Generation. Proceedings of [42] Y. Deng, W. Zhang, Z. Chen, and Q. Gu. Rephrase and Respond:
the AAAI Conference on Artificial Intelligence, 38(16):17754–17762, Let Large Language Models Ask Better Questions for Them-
Mar. 2024. selves, Apr. 2024. arXiv:2311.04205.
[26] J. Chen, J. Tang, J. Qin, X. Liang, L. Liu, E. Xing, and L. Lin. [43] X. Dong, M. Teleki, and J. Caverlee. A Survey on LLM Inference-
GeoQA: A Geometric Question Answering Benchmark Towards Time Self-Improvement, Dec. 2024. arXiv:2412.14352.
Multimodal Numerical Reasoning. In C. Zong, F. Xia, W. Li, and [44] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert. RAGAS:
R. Navigli, editors, Findings of the Association for Computational Automated Evaluation of Retrieval Augmented Generation, Sept.
Linguistics: ACL-IJCNLP 2021, pages 513–523, Virtual Event, Aug. 2023. arXiv:2309.15217.
2021. Association for Computational Linguistics. [45] X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and
[27] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Wang. AlphaZero-Like Tree-Search Can Guide Large Language
J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Model Decoding and Training, Feb. 2024. arXiv:2309.17179.
Evaluating Large Language Models Trained on Code, July 2021. [46] J. Frohberg and F. Binder. CRASS: A Novel Data Set and
arXiv:2107.03374. Benchmark to Test Counterfactual Reasoning of Large Language
[28] W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of Models. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri,
Thoughts Prompting: Disentangling Computation from Reason- T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani,
ing for Numerical Reasoning Tasks. Transactions on Machine H. Mazo, J. Odijk, and S. Piperidis, editors, Proceedings of the
Learning Research, Nov. 2023. Thirteenth Language Resources and Evaluation Conference, LREC
[29] W. Chen, M. Yin, M. Ku, P. Lu, Y. Wan, X. Ma, J. Xu, X. Wang, ’22, pages 2126–2140, Marseille, France, June 2022. European
and T. Xia. TheoremQA: A Theorem-driven Question Answering Language Resources Association.
40

[47] Y. Fu, L. Xue, Y. Huang, A.-O. Brabete, D. Ustiugov, Y. Patel, Processing, EMNLP ’14, pages 523–533, Doha, Qatar, Oct. 2014.
and L. Mai. ServerlessLLM: Low-Latency Serverless Inference Association for Computational Linguistics.
for Large Language Models. In Proceedings of the 18th USENIX [62] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
Symposium on Operating Systems Design and Implementation, OSDI and W. Chen. LoRA: Low-Rank Adaptation of Large Language
’24, pages 135–153, Santa Clara, CA, USA, July 2024. USENIX Models. In Proceedings of the Tenth International Conference on
Association. Learning Representations, ICLR ’22, Virtual Event, Apr. 2022.
[48] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, [63] J. Huang and K. C.-C. Chang. Towards Reasoning in Large
and G. Neubig. PAL: Program-Aided Language Models, Jan. Language Models: A Survey. In A. Rogers, J. Boyd-Graber, and
2023. arXiv:2211.10435. N. Okazaki, editors, Findings of the Association for Computational
[49] E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, E. Chen, A. Gun- Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada, July
ning, C. F. Olsson, J.-S. Denain, A. Ho, E. de Oliveira Santos, 2023. Association for Computational Linguistics.
O. Järviniemi, M. Barnett, R. Sandler, M. Vrzala, J. Sevilla, Q. Ren, [64] J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and
E. Pratt, L. Levine, G. Barkley, N. Stewart, B. Grechuk, T. Grechuk, D. Zhou. Large Language Models Cannot Self-Correct Reasoning
S. V. Enugandla, and M. Wildon. FrontierMath: A Benchmark for Yet. In Proceedings of the Twelfth International Conference on Learning
Evaluating Advanced Mathematical Reasoning in AI, Dec. 2024. Representations, ICLR ’24, Vienna, Austria, May 2024.
arXiv:2411.04872. [65] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and
[50] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- J. Han. Large Language Models Can Self-Improve, Oct. 2022.
Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The arXiv:2210.11610.
Llama 3 Herd of Models, Nov. 2024. arXiv:2407.21783. [66] S. Huang, W. Zhong, J. Lu, Q. Zhu, J. Gao, W. Liu, Y. Hou,
[51] X. Guan, Y. Liu, X. Lu, B. Cao, B. He, X. Han, L. Sun, J. Lou, X. Zeng, Y. Wang, L. Shang, X. Jiang, R. Xu, and Q. Liu. Planning,
B. Yu, Y. Lu, and H. Lin. Search, Verify and Feedback: Towards Creation, Usage: Benchmarking LLMs for Comprehensive Tool
Next Generation Post-Training Paradigm of Foundation Models Utilization in Real-World Complex Scenarios. In L.-W. Ku,
via Verifier Engineering, Nov. 2024. arXiv:2411.11504. A. Martins, and V. Srikumar, editors, Findings of the Association for
[52] X. Guan, L. L. Zhang, Y. Liu, N. Shang, Y. Sun, Y. Zhu, Computational Linguistics: ACL 2024, pages 4363–4400, Bangkok,
F. Yang, and M. Yang. rStar-Math: Small LLMs Can Master Thailand, Aug. 2024. Association for Computational Linguistics.
Math Reasoning with Self-Evolved Deep Thinking, Jan. 2025. [67] Y. Huang, M. Kleindessner, A. Munishkin, D. Varshney, P. Guo,
arXiv:2501.04519. and J. Wang. Benchmarking of Data-Driven Causality Discovery
Approaches in the Interactions of Arctic Sea Ice and Atmosphere.
[53] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M.-W. Chang. REALM:
Frontiers in Big Data, 4(32):642182:1–642182:19, Aug. 2021.
Retrieval-Augmented Language Model Pre-Training, Feb. 2020.
arXiv:2002.08909. [68] S. Imani, L. Du, and H. Shrivastava. MathPrompter: Math-
ematical Reasoning using Large Language Models, Mar. 2023.
[54] S. Han, H. Schoelkopf, Y. Zhao, Z. Qi, M. Riddell, L. Benson, arXiv:2303.05398.
L. Sun, E. Zubova, Y. Qiao, M. Burtell, D. Peng, J. Fan, Y. Liu,
[69] A. Q. Jiang, W. Li, J. M. Han, and Y. Wu. LISA: Language models
B. Wong, M. Sailor, A. Ni, L. Nan, J. Kasai, T. Yu, R. Zhang,
of ISAbelle proofs. In Proceedings of the 6th Conference on Artificial
S. Joty, A. R. Fabbri, W. Kryscinski, X. V. Lin, C. Xiong, and
Intelligence and Theorem Proving, AITP ’21, Aussois, France, Sept.
D. Radev. FOLIO: Natural Language Reasoning with First-Order
2021.
Logic, Oct. 2024. arXiv:2209.00840.
[70] J. Jiang, S. Gan, Y. Liu, F. Wang, G. Alonso, A. Klimovic, A. Singla,
[55] A. Havrilla, S. C. Raparthy, C. Nalmpantis, J. Dwivedi-Yu, W. Wu, and C. Zhang. Towards Demystifying Serverless Machine
M. Zhuravinskyi, E. Hambro, and R. Raileanu. GLoRe: When, Learning Training. In Proceedings of the 2021 International Confer-
Where, and How to Improve LLM Reasoning via Global and ence on Management of Data, SIGMOD ’21, pages 857–871, Virtual
Local Refinements. In R. Salakhutdinov, Z. Kolter, K. Heller, Event, June 2021. Association for Computing Machinery.
A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,
[71] C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R.
Proceedings of the 41st International Conference on Machine Learn-
Narasimhan. SWE-bench: Can Language Models Resolve Real-
ing (ICML ’24), volume 235 of Proceedings of Machine Learning
world Github Issues? In Proceedings of the Twelfth International
Research, pages 17719–17733, Vienna, Austria, July 2024. PMLR.
Conference on Learning Representations, ICLR ’24, Vienna, Austria,
[56] C. He, R. Luo, Y. Bai, S. Hu, Z. Thai, J. Shen, J. Hu, X. Han, May 2024.
Y. Huang, Y. Zhang, J. Liu, L. Qi, Z. Liu, and M. Sun. Olympiad- [72] W. Knight. OpenAI Unveils New A.I. That Can ‘Reason’ Through
Bench: A Challenging Benchmark for Promoting AGI with Math and Science Problems. https://www.nytimes.com/2024/1
Olympiad-Level Bilingual Multimodal Scientific Problems. In 2/20/technology/openai-new-ai-math-science.html, Dec. 2024.
L.-W. Ku, A. Martins, and V. Srikumar, editors, Proceedings of the accessed 2024-12-27.
62nd Annual Meeting of the Association for Computational Linguistics
[73] L. Kocsis and C. Szepesvári. Bandit Based Monte-Carlo Planning.
(Volume 1: Long Papers), ACL ’24, pages 3828–3850, Bangkok,
In J. Fürnkranz, T. Scheffer, and M. Spiliopoulou, editors, Pro-
Thailand, Aug. 2024. Association for Computational Linguistics.
ceedings of the European Conference on Machine Learning ECML ’06,
[57] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, volume 4212 of Lecture Notes in Computer Science (LNAI), pages
E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. 282–293, Berlin, Germany, Sept. 2006. Springer.
Measuring Coding Challenge Competence with APPS. In J. Van- [74] K. Kondo, S. Sugawara, and A. Aizawa. Probing Physical Reason-
schoren and S. Yeung, editors, Proceedings of the Thirty-fifth Neural ing with Counter-Commonsense Context. In A. Rogers, J. Boyd-
Information Processing Systems: Track on Datasets and Benchmarks, Graber, and N. Okazaki, editors, Proceedings of the 61st Annual
volume 1 of NeurIPS ’21, Virtual Event, Dec. 2021. Meeting of the Association for Computational Linguistics (Volume 2:
[58] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, Short Papers), ACL ’23, pages 603–612, Toronto, Canada, July 2023.
and J. Steinhardt. Measuring Massive Multitask Language Un- Association for Computational Linguistics.
derstanding. In Proceedings of the Ninth International Conference on [75] W. Kryscinski, B. McCann, C. Xiong, and R. Socher. Evaluating
Learning Representations, ICLR ’21, Virtual Event, May 2021. the Factual Consistency of Abstractive Text Summarization. In
[59] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of
D. Song, and J. Steinhardt. Measuring Mathematical Problem the 2020 Conference on Empirical Methods in Natural Language
Solving with the MATH Dataset. In J. Vanschoren and S. Yeung, Processing, EMNLP ’20, pages 9332–9346, Virtual Event, Nov.
editors, Proceedings of the Thirty-fifth Conference on Neural Informa- 2020. Association for Computational Linguistics.
tion Processing Systems: Track on Datasets and Benchmarks, NeurIPS [76] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-T.
’21, Virtual Event, Dec. 2021. Yih, D. Fried, S. Wang, and T. Yu. DS-1000: A Natural and Reliable
[60] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The Benchmark for Data Science Code Generation. In A. Krause,
Curious Case of Neural Text Degeneration. In Proceedings of the E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett,
Eighth International Conference on Learning Representations, ICLR editors, Proceedings of the 40th International Conference on Machine
’20, Virtual Event, Apr. 2020. Learning, volume 202 of Proceedings of Machine Learning Research,
[61] M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman. Learn- pages 18319–18345, Honolulu, HI, USA, July 2023. PMLR.
ing to Solve Arithmetic Word Problems with Verb Categorization. [77] Y. Leviathan, M. Kalman, and Y. Matias. Fast Inference from
In A. Moschitti, B. Pang, and W. Daelemans, editors, Proceedings Transformers via Speculative Decoding. In A. Krause, E. Brun-
of the 2014 Conference on Empirical Methods in Natural Language skill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,
41

Proceedings of the 40th International Conference on Machine Learning Reasoning in Language Models by Automated Process Supervi-
(ICML ’23), volume 202 of Proceedings of Machine Learning Re- sion, Dec. 2024. arXiv:2406.06592.
search, pages 19274–19286, Honolulu, HI, USA, July 2023. PMLR. [92] M. Luo, S. Kumbhar, M. shen, M. Parmar, N. Varshney, P. Baner-
[78] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, jee, S. Aditya, and C. Baral. Towards LogiGLUE: A Brief Survey
H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and and a Benchmark for Analyzing Logical Reasoning Capabilities
D. Kiela. Retrieval-Augmented Generation for Knowledge- of Language Models, Mar. 2024. arXiv:2310.00836.
Intensive NLP Tasks. In H. Larochelle, M. Ranzato, R. Had- [93] Y. Lyu, Z. Li, S. Niu, F. Xiong, B. Tang, W. Wang, H. Wu, H. Liu,
sell, M. Balcan, and H. Lin, editors, Proceedings of the Thirty- T. Xu, and E. Chen. CRUD-RAG: A Comprehensive Chinese
fourth Annual Conference on Neural Information Processing Systems Benchmark for Retrieval-Augmented Generation of Large Lan-
(NeurIPS ’20), volume 33 of Advances in Neural Information Process- guage Models, July 2024. arXiv:2401.17043.
ing Systems, pages 9459–9474, Virtual Event, Dec. 2020. Curran [94] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegr-
Associates. effe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. Ma-
[79] X. Li, G. Dong, J. Jin, Y. Zhang, Y. Zhou, Y. Zhu, P. Zhang, and jumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark.
Z. Dou. Search-o1: Agentic Search-Enhanced Large Reasoning Self-Refine: Iterative Refinement with Self-Feedback, May 2023.
Models, Jan. 2025. arXiv:2501.05366. arXiv:2303.17651.
[80] X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eisner, T. Hashimoto, [95] F. Mai, N. Cornille, and M.-F. Moens. Improving Language Mod-
L. Zettlemoyer, and M. Lewis. Contrastive Decoding: Open- eling by Increasing Test-time Planning Compute. In Proceedings of
ended Text Generation as Optimization. In A. Rogers, J. Boyd- the Eighth Widening NLP Workshop, WiNLP ’24, Miami, FL, USA,
Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Nov. 2024.
Meeting of the Association for Computational Linguistics (Volume 1: [96] A. Malinin and M. Gales. Uncertainty Estimation in Autoregres-
Long Papers), ACL ’23, pages 12286–12312, Toronto, Canada, July sive Structured Prediction. In Proceedings of the Ninth International
2023. Association for Computational Linguistics. Conference on Learning Representations, ICLR ’21, Virtual Event,
[81] X. L. Li and P. Liang. Prefix-Tuning: Optimizing Continuous May 2021.
Prompts for Generation. In C. Zong, F. Xia, W. Li, and R. Navigli, [97] R. Manvi, A. Singh, and S. Ermon. Adaptive Inference-Time
editors, Proceedings of the 59th Annual Meeting of the Association Compute: LLMs Can Predict If They Can Do Better, Even Mid-
for Computational Linguistics and the 11th International Joint Con- Generation, Oct. 2024. arXiv:2410.02725.
ference on Natural Language Processing (Volume 1: Long Papers), [98] Y. Mao, Y. Kim, and Y. Zhou. CHAMP: A Competition-level
ACL-IJCNLP ’21, pages 4582–4597, Virtual Event, Aug. 2021. Dataset for Fine-Grained Analyses of LLMs’ Mathematical Rea-
Association for Computational Linguistics. soning Capabilities. In L.-W. Ku, A. Martins, and V. Srikumar,
[82] M. Liao, W. Luo, C. Li, J. Wu, and K. Fan. MARIO: MAth Rea- editors, Findings of the Association for Computational Linguistics:
soning with code Interpreter Output – A Reproducible Pipeline, ACL 2024, pages 13256–13274, Bangkok, Thailand, Aug. 2024.
Feb. 2024. arXiv:2401.08190. Association for Computational Linguistics.
[83] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, [99] A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque. ChartQA:
J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s Verify A Benchmark for Question Answering about Charts with Visual
Step by Step. In Proceedings of the Twelfth International Conference and Logical Reasoning. In S. Muresan, P. Nakov, and A. Villav-
on Learning Representations, ICLR ’24, Vienna, Austria, May 2024. icencio, editors, Findings of the Association for Computational Lin-
[84] A. Liu, S. Swayamdipta, N. A. Smith, and Y. Choi. WANLI: guistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022.
Worker and AI Collaboration for Natural Language Inference Association for Computational Linguistics.
Dataset Creation. In Y. Goldberg, Z. Kozareva, and Y. Zhang, [100] C. Metz. In Two Moves, AlphaGo and Lee Sedol Redefined the
editors, Findings of the Association for Computational Linguistics: Future. https://www.wired.com/2016/03/two-moves-alphago
EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emi- -lee-sedol-redefined-future/, Mar. 2016. Wired.
rates, Dec. 2022. Association for Computational Linguistics. [101] X. Miao, C. Shi, J. Duan, X. Xi, D. Lin, B. Cui, and Z. Jia.
[85] C. Liu, J. Shen, H. Xin, Z. Liu, Y. Yuan, H. Wang, W. Ju, C. Zheng, SpotServe: Serving Generative Large Language Models on Pre-
Y. Yin, L. Li, M. Zhang, and Q. Liu. FIMO: A Challenge emptible Instances. In Proceedings of the 29th ACM International
Formal Dataset for Automated Theorem Proving, Dec. 2023. Conference on Architectural Support for Programming Languages and
arXiv:2309.04295. Operating Systems, Volume 2, ASPLOS ’24, pages 1112–1127, La
[86] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, Jolla, CA, USA, Apr. 2024. Association for Computing Machinery.
K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, [102] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a Suit
S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and of Armor Conduct Electricity? A New Dataset for Open Book
J. Tang. AgentBench: Evaluating LLMs as Agents, Oct. 2023. Question Answering. In E. Riloff, D. Chiang, J. Hockenmaier,
arXiv:2308.03688. and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical
[87] P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.- Methods in Natural Language Processing, EMNLP ’18, pages 2381–
W. Chang, M. Galley, and J. Gao. MathVista: Evaluating Math- 2391, Brussels, Belgium, Nov. 2018. Association for Computa-
ematical Reasoning of Foundation Models in Visual Contexts. tional Linguistics.
In Proceedings of the Twelfth International Conference on Learning [103] I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and
Representations, ICLR ’24, Vienna, Austria, May 2024. M. Farajtabar. GSM-Symbolic: Understanding the Limitations of
[88] P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S.-C. Zhu. Mathematical Reasoning in Large Language Models, Oct. 2024.
Inter-GPS: Interpretable Geometry Problem Solving with Formal arXiv:2410.05229.
Language and Symbolic Reasoning. In C. Zong, F. Xia, W. Li, and [104] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf.
R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Distinguishing Cause from Effect Using Observational Data:
Association for Computational Linguistics and the 11th International Methods and Benchmarks. Journal of Machine Learning Research,
Joint Conference on Natural Language Processing (Volume 1: Long 17(32):1–102, 2016.
Papers), ACL-IJCNLP ’21, pages 6774–6786, Virtual Event, Aug. [105] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang,
2021. Association for Computational Linguistics. M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica. Ray:
[89] P. Lu, L. Qiu, K.-W. Chang, Y. N. Wu, S.-C. Zhu, T. Rajpuro- A Distributed Framework for Emerging AI Applications. In
hit, P. Clark, and A. Kalyan. Dynamic Prompt Learning via Proceedings of the 13th USENIX Symposium on Operating Systems
Policy Gradient for Semi-Structured Mathematical Reasoning. Design and Implementation, OSDI ’18, pages 561–577, Carlsbad,
In Proceedings of the Eleventh International Conference on Learning CA, Oct. 2018. USENIX Association.
Representations, ICLR ’23, Kigali, Rwanda, May 2023. [106] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela.
[90] P. Lu, L. Qiu, W. Yu, S. Welleck, and K.-W. Chang. A Survey Adversarial NLI: A New Benchmark for Natural Language Un-
of Deep Learning for Mathematical Reasoning. In A. Rogers, derstanding. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault,
J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st editors, Proceedings of the 58th Annual Meeting of the Association
Annual Meeting of the Association for Computational Linguistics for Computational Linguistics, ACL ’20, pages 4885–4901, Virtual
(Volume 1: Long Papers), ACL ’23, pages 14605–14631, Toronto, Event, July 2020. Association for Computational Linguistics.
Canada, July 2023. Association for Computational Linguistics. [107] T. Niven and H.-Y. Kao. Probing Neural Network Comprehen-
[91] L. Luo, Y. Liu, R. Liu, S. Phatale, M. Guo, H. Lara, Y. Li, L. Shu, sion of Natural Language Arguments. In A. Korhonen, D. Traum,
Y. Zhu, L. Meng, J. Sun, and A. Rastogi. Improve Mathematical and L. Màrquez, editors, Proceedings of the 57th Annual Meeting of
42

the Association for Computational Linguistics, ACL ’19, pages 4658– vironments for Interactive Learning. In Proceedings of the Inter-
4664, Florence, Italy, July 2019. Association for Computational national Conference on Learning Representations, ICLR ’21, Virtual
Linguistics. Event, May 2021.
[108] OpenAI. Introducing ChatGPT. https://openai.com/index/cha [127] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den
tgpt/, Nov. 2022. accessed 2024-12-27. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
[109] OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4 M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,
o/, May 2024. accessed 2025-01-01. I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel,
[110] OpenAI. Introducing OpenAI o1. https://openai.com/o1/, and D. Hassabis. Mastering the Game of Go With Deep Neural
2024. accessed 2024-12-27. Networks and Tree Search. Nature, 529:484–489, Jan. 2016.
[111] R. Y. Pang, W. Yuan, H. He, K. Cho, S. Sukhbaatar, and J. E. [128] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai,
Weston. Iterative Reasoning Preference Optimization. In Proceed- A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap,
ings of the Thirty-eighth Annual Conference on Neural Information K. Simonyan, , and D. Hassabis. A General Reinforcement
Processing Systems (NeurIPS ’24), volume 37 of Advances in Neural Learning Algorithm that Masters Chess, Shogi, and Go Through
Information Processing Systems, Vancouver, Canada, Dec. 2024. Self-Play. Science, 362(6419):1140–1144, Dec. 2018.
Curran Associates. [129] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
[112] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, C. Tan, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen,
F. Huang, and H. Chen. Reasoning with Language Model T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel,
Prompting: A Survey. In A. Rogers, J. Boyd-Graber, and and D. Hassabis. Mastering the Game of Go without Human
N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Knowledge. Nature, 550:354–359, Oct. 2017.
Association for Computational Linguistics (Volume 1: Long Papers), [130] K. Sinha, S. Sodhani, J. Dong, J. Pineau, and W. L. Hamilton.
ACL ’23, pages 5368–5393, Toronto, Canada, July 2023. Associa- CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from
tion for Computational Linguistics. Text. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings
[113] Y. Qin, X. Li, H. Zou, Y. Liu, S. Xia, Z. Huang, Y. Ye, W. Yuan, of the 2019 Conference on Empirical Methods in Natural Language
H. Liu, Y. Li, and P. Liu. O1 Replication Journey: A Strategic Processing and the 9th International Joint Conference on Natural
Progress Report – Part 1, Oct. 2024. arXiv:2410.18982. Language Processing, EMNLP-IJCNLP ’19, pages 4506–4515, Hong
[114] Y. Qu, T. Zhang, N. Garg, and A. Kumar. Recursive Introspection: Kong, China, Nov. 2019. Association for Computational Linguis-
Teaching Language Model Agents How to Self-Improve, July tics.
2024. arXiv:2407.18219. [131] C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM Test-Time
[115] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and Compute Optimally Can be More Effective than Scaling Model
C. Finn. Direct Preference Optimization: Your Language Model is Parameters, Aug. 2024. arXiv:2408.03314.
Secretly a Reward Model. In A. Oh, T. Naumann, A. Globerson, [132] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid,
K. Saenko, M. Hardt, and S. Levine, editors, Proceedings of the A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-
Thirty-seventh Annual Conference on Neural Information Processing Alonso, et al. Beyond the Imitation Game: Quantifying and
Systems (NeurIPS ’23), volume 36 of Advances in Neural Information Extrapolating the Capabilities of Language Models, June 2023.
Processing Systems, pages 53728–53741, New Orleans, LA, USA, arXiv:2206.04615.
Dec. 2023. Curran Associates.
[133] S. Srivastava, A. M. B, A. P. V, S. Menon, A. Sukumar, A. S. T,
[116] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani,
A. Philipose, S. Prince, and S. Thomas. Functional Benchmarks
J. Michael, and S. R. Bowman. GPQA: A Graduate-Level Google-
for Robust Evaluation of Reasoning Performance, and the Rea-
Proof Q&A Benchmark, Nov. 2023. arXiv:2311.12022.
soning Gap, Feb. 2024. arXiv:2402.19450.
[117] C. D. Rosin. Multi-Armed Bandits with Episode Context. Annals
[134] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss,
of Mathematics and Artificial Intelligence, 61(3):203–230, Mar. 2011.
A. Radford, D. Amodei, and P. F. Christiano. Learning to
[118] S. Roy and D. Roth. Solving General Arithmetic Word Problems.
Summarize with Human Feedback. In H. Larochelle, M. Ranzato,
In L. Màrquez, C. Callison-Burch, and J. Su, editors, Proceedings
R. Hadsell, M. Balcan, and H. Lin, editors, Proceedings of the
of the 2015 Conference on Empirical Methods in Natural Language
Thirty-fourth Annual Conference on Neural Information Processing
Processing, EMNLP ’15, pages 1743–1752, Lisbon, Portugal, Sept.
Systems (NeurIPS ’20), volume 33 of Advances in Neural Information
2015. Association for Computational Linguistics.
Processing Systems, pages 3008–3021, Virtual Event, Dec. 2020.
[119] K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi. Wino- Curran Associates.
Grande: An Adversarial Winograd Schema Challenge at Scale.
Proceedings of the AAAI Conference on Artificial Intelligence, [135] J. Sun, C. Zheng, E. Xie, Z. Liu, R. Chu, J. Qiu, J. Xu, M. Ding,
34(05):8732–8740, Apr. 2020. H. Li, M. Geng, et al. A Survey of Reasoning with Foundation
Models, Jan. 2024. arXiv:2312.11562.
[120] M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y. Choi. Social IQa:
Commonsense Reasoning about Social Interactions. In K. Inui, [136] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduc-
J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 tion. MIT Press, 2015.
Conference on Empirical Methods in Natural Language Processing and [137] O. Tafjord, B. Dalvi, and P. Clark. ProofWriter: Generating
the 9th International Joint Conference on Natural Language Processing, Implications, Proofs, and Abductive Statements over Natural
EMNLP-IJCNLP ’19, pages 4463–4473, Hong Kong, China, Nov. Language. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Find-
2019. Association for Computational Linguistics. ings of the Association for Computational Linguistics: ACL-IJCNLP
[121] A. Saparov and H. He. Language Models Are Greedy Rea- 2021, pages 3621–3634, Virtual Event, Aug. 2021. Association for
soners: A Systematic Formal Analysis of Chain-of-Thought. In Computational Linguistics.
Proceedings of the Eleventh International Conference on Learning [138] A. Talmor, J. Herzig, N. Lourie, and J. Berant. Common-
Representations, ICLR ’23, Kigali, Rwanda, May 2023. senseQA: A Question Answering Challenge Targeting Common-
[122] W. Saunders, C. Yeh, J. Wu, S. Bills, L. Ouyang, J. Ward, and sense Knowledge. In J. Burstein, C. Doran, and T. Solorio, editors,
J. Leike. Self-Critiquing Models for Assisting Human Evaluators, Proceedings of the 2019 Conference of the North American Chapter
June 2022. arXiv:2206.05802. of the Association for Computational Linguistics: Human Language
[123] T. Sawada, D. Paleka, A. Havrilla, P. Tadepalli, P. Vidas, A. Kra- Technologies, Volume 1 (Long and Short Papers), NAACL ’19, pages
nias, J. J. Nay, K. Gupta, and A. Komatsuzaki. ARB: Advanced 4149–4158, Minneapolis, Minnesota, June 2019. Association for
Reasoning Benchmark for Large Language Models, July 2023. Computational Linguistics.
arXiv:2307.13692. [139] Z. Tang, X. Zhang, B. Wang, and F. Wei. MathScale: Scal-
[124] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, ing Instruction Tuning for Mathematical Reasoning, Mar. 2024.
S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lil- arXiv:2403.02884.
licrap, and D. Silver. Mastering Atari, Go, Chess and Shogi by [140] Q. Team. QwQ: Reflect Deeply on the Boundaries of the Un-
Planning With a Learned Model. Nature, 588:604–609, Dec. 2020. known. https://qwenlm.github.io/blog/qwq-32b-preview/,
[125] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and Nov. 2024. accessed 2025-01-01.
O. Klimov. Proximal Policy Optimization Algorithms, Aug. 2017. [141] Y. Tian, B. Peng, L. Song, L. Jin, D. Yu, L. Han, H. Mi, and D. Yu.
arXiv:1707.06347. Toward Self-Improvement of LLMs via Imagination, Searching,
[126] M. Shridhar, X. Yuan, M.-A. Côté, Y. Bisk, A. Trischler, and and Criticizing. In Proceedings of the Thirty-eighth Annual Con-
M. Hausknecht. ALFWorld: Aligning Text and Embodied En- ference on Neural Information Processing Systems (NeurIPS ’24),
43

volume 37 of Advances in Neural Information Processing Systems, [157] J. Xiong, J. Shen, Y. Yuan, H. Wang, Y. Yin, Z. Liu, L. Li, Z. Guo,
Vancouver, Canada, Dec. 2024. Curran Associates. Q. Cao, Y. Huang, C. Zheng, X. Liang, M. Zhang, and Q. Liu.
[142] R. Tu, K. Zhang, B. Bertilson, H. Kjellstrom, and C. Zhang. Neu- TRIGO: Benchmarking Formal Mathematical Proof Reduction for
ropathic Pain Diagnosis Simulator for Causal Discovery Algo- Generative Language Models. In H. Bouamor, J. Pino, and K. Bali,
rithm Evaluation. In H. Wallach, H. Larochelle, A. Beygelzimer, editors, Proceedings of the 2023 Conference on Empirical Methods
F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Proceedings of in Natural Language Processing, EMNLP ’23, pages 11594–11632,
the Thirty-third Annual Conference on Neural Information Processing Singapore, Dec. 2023. Association for Computational Linguistics.
Systems (NeurIPS ’19), volume 32 of Advances in Neural Information [158] Y. Yan, J. Su, J. He, F. Fu, X. Zheng, Y. Lyu, K. Wang, S. Wang,
Processing Systems, pages 12793–12804, Vancouver, Canada, Dec. Q. Wen, and X. Hu. A Survey of Mathematical Reasoning in the
2019. Curran Associates. Era of Multimodal Large Language Model: Benchmark, Method
[143] J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, & Challenges, Dec. 2024. arXiv:2412.11936.
A. Creswell, G. Irving, and I. Higgins. Solving Math Word [159] K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil,
Problems with Process-and Outcome-Based Feedback, Nov. 2022. R. J. Prenger, and A. Anandkumar. LeanDojo: Theorem Proving
arXiv:2211.14275. with Retrieval-Augmented Language Models. In A. Oh, T. Nau-
[144] A. Vijayakumar, M. Cogswell, R. Selvaraju, Q. Sun, S. Lee, mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,
D. Crandall, and D. Batra. Diverse Beam Search for Improved Proceedings of the Thirty-seventh Annual Conference on Neural Infor-
Description of Complex Scenes. Proceedings of the AAAI Conference mation Processing Systems (NeurIPS ’23), volume 36 of Advances
on Artificial Intelligence, 32(1):7371–7379, Apr. 2018. in Neural Information Processing Systems, pages 21573–21612, New
[145] J. Wang, M. Fang, Z. Wan, M. Wen, J. Zhu, A. Liu, Z. Gong, Orleans, LA, USA, Dec. 2023. Curran Associates.
Y. Song, L. Chen, L. M. Ni, L. Yang, Y. Wen, and W. Zhang. [160] S. Yao, H. Chen, J. Yang, and K. Narasimhan. WebShop: Towards
OpenR: An Open Source Framework for Advanced Reasoning Scalable Real-World Web Interaction with Grounded Language
with Large Language Models, Oct. 2024. arXiv:2410.09671. Agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
[146] K. Wang, H. Ren, A. Zhou, Z. Lu, S. Luo, W. Shi, R. Zhang, K. Cho, and A. Oh, editors, Proceedings of the Thirty-sixth Annual
L. Song, M. Zhan, and H. Li. MathCoder: Seamless Code Conference on Neural Information Processing Systems (NeurIPS ’22),
Integration in LLMs for Enhanced Mathematical Reasoning, Oct. volume 35 of Advances in Neural Information Processing Systems,
2023. arXiv:2310.03731. pages 20744–20757, New Orleans, LA, USA, Dec. 2022. Curran
[147] P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, Associates.
and Z. Sui. Math-Shepherd: Verify and Reinforce LLMs Step- [161] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and
by-Step without Human Annotations. In L.-W. Ku, A. Martins, K. Narasimhan. Tree of Thoughts: Deliberate Problem Solving
and V. Srikumar, editors, Proceedings of the 62nd Annual Meeting with Large Language Models. In A. Oh, T. Naumann, A. Glober-
of the Association for Computational Linguistics (Volume 1: Long son, K. Saenko, M. Hardt, and S. Levine, editors, Proceedings of the
Papers), ACL ’24, pages 9426–9439, Bangkok, Thailand, Aug. 2024. Thirty-seventh Annual Conference on Neural Information Processing
Association for Computational Linguistics. Systems (NeurIPS ’23), volume 36 of Advances in Neural Information
[148] X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Subramaniam, Processing Systems, pages 11809–11822, New Orleans, LA, USA,
A. Loomba, S. Zhang, Y. Sun, and W. Wang. SCIBENCH: Dec. 2023. Curran Associates.
Evaluating College-Level Scientific Problem-Solving Abilities of [162] N. Young, Q. Bao, J. Bensemann, and M. Witbrock. Abduction-
Large Language Models. In Proceedings of the 3rd Workshop on Rules: Training Transformers to Explain Unexpected Inputs. In
Mathematical Reasoning and AI, MATH-AI ’23, New Orleans, LA, S. Muresan, P. Nakov, and A. Villavicencio, editors, Findings of
USA, Dec. 2023. the Association for Computational Linguistics: ACL 2022, pages 218–
[149] X. Wang, L. Song, Y. Tian, D. Yu, B. Peng, H. Mi, F. Huang, 227, Dublin, Ireland, May 2022. Association for Computational
and D. Yu. Towards Self-Improvement of LLMs via MCTS: Linguistics.
Leveraging Stepwise Knowledge with Curriculum Preference [163] L. Yuan, W. Li, H. Chen, G. Cui, N. Ding, K. Zhang, B. Zhou,
Learning, Oct. 2024. arXiv:2410.06508. Z. Liu, and H. Peng. Free Process Rewards without Process
[150] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, Labels, Dec. 2024. arXiv:2412.01981.
A. Chowdhery, and D. Zhou. Self-Consistency Improves Chain [164] Z. Yuan, H. Yuan, C. Tan, W. Wang, and S. Huang. How Well
of Thought Reasoning in Language Models. In Proceedings of the Do Large Language Models Perform in Arithmetic Tasks?, Mar.
Eleventh International Conference on Learning Representations, ICLR 2023. arXiv:2304.02015.
’23, Kigali, Rwanda, May 2023. [165] R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi. SWAG: A Large-Scale
[151] Z. Wang, S. Zhou, D. Fried, and G. Neubig. Execution- Adversarial Dataset for Grounded Commonsense Inference. In
Based Evaluation for Open-Domain Code Generation, May 2023. E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceed-
arXiv:2212.10481. ings of the 2018 Conference on Empirical Methods in Natural Language
[152] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, Processing, EMNLP ’18, pages 93–104, Brussels, Belgium, Nov.
E. Chi, Q. V. Le, and D. Zhou. Chain-of-Thought Prompting 2018. Association for Computational Linguistics.
Elicits Reasoning in Large Language Models. In S. Koyejo, [166] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi.
S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, HellaSwag: Can a Machine Really Finish Your Sentence? In
editors, Proceedings of the Thirty-sixth Annual Conference on Neu- A. Korhonen, D. Traum, and L. Màrquez, editors, Proceedings
ral Information Processing Systems (NeurIPS ’22), volume 35 of of the 57th Annual Meeting of the Association for Computational
Advances in Neural Information Processing Systems, pages 24824– Linguistics, ACL ’19, pages 4791–4800, Florence, Italy, July 2019.
24837, New Orleans, LA, USA, Dec. 2022. Curran Associates. Association for Computational Linguistics.
[153] P. Wiesner, I. Behnke, D. Scheinert, K. Gontarska, and L. Tham- [167] Z. Zeng, Q. Cheng, Z. Yin, B. Wang, S. Li, Y. Zhou, Q. Guo,
sen. Let’s Wait Awhile: How Temporal Workload Shifting Can X. Huang, and X. Qiu. Scaling of Search and Learning: A
Reduce Carbon Emissions in the Cloud. In Proceedings of the Roadmap to Reproduce o1 from Reinforcement Learning Per-
22nd International Middleware Conference, Middleware ’21, pages spective, Dec. 2024. arXiv:2412.14135.
260–272, Virtual Event, Dec. 2021. Association for Computing [168] D. Zhang, X. Huang, D. Zhou, Y. Li, and W. Ouyang. Accessing
Machinery. GPT-4 Level Mathematical Olympiad Solutions via Monte Carlo
[154] Z. Xi, Y. Ding, W. Chen, B. Hong, H. Guo, J. Wang, D. Yang, Tree Self-Refine with LLaMa-3 8B, June 2024. arXiv:2406.07394.
C. Liao, X. Guo, W. He, S. Gao, L. Chen, R. Zheng, Y. Zou, T. Gui, [169] D. Zhang, J. Wu, J. Lei, T. Che, J. Li, T. Xie, X. Huang, S. Zhang,
Q. Zhang, X. Qiu, X. Huang, Z. Wu, and Y.-G. Jiang. AgentGym: M. Pavone, Y. Li, W. Ouyang, and D. Zhou. LLaMA-Berry:
Evolving Large Language Model-based Agents across Diverse Pairwise Optimization for O1-like Olympiad-Level Mathematical
Environments, June 2024. arXiv:2406.04151. Reasoning, Nov. 2024. arXiv:2410.02884.
[155] Y. Xie, A. Goyal, W. Zheng, M.-Y. Kan, T. P. Lillicrap, [170] D. Zhang, S. Zhoubian, Z. Hu, Y. Yue, Y. Dong, and J. Tang.
K. Kawaguchi, and M. Shieh. Monte Carlo Tree Search ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree
Boosts Reasoning via Iterative Preference Learning, June 2024. Search. In Proceedings of the Thirty-eighth Annual Conference on
arXiv:2405.00451. Neural Information Processing Systems (NeurIPS ’24), volume 37
[156] G. Xiong, Q. Jin, Z. Lu, and A. Zhang. Benchmark- of Advances in Neural Information Processing Systems, Vancouver,
ing Retrieval-Augmented Generation for Medicine, Feb. 2024. Canada, Dec. 2024. Curran Associates.
arXiv:2402.13178. [171] L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and
44

R. Agarwal. Generative Verifiers: Reward Modeling as Next-


Token Prediction, Oct. 2024. arXiv:2408.15240.
[172] M. Zhao, S. Pan, N. Agarwal, Z. Wen, D. Xu, A. Natarajan, P. Ku-
mar, S. S. P, R. Tijoriwala, K. Asher, H. Wu, A. Basant, D. Ford,
D. David, N. Yigitbasi, P. Singh, C.-J. Wu, and C. Kozyrakis.
Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML
Training. In Proceedings of the USENIX Annual Technical Conference,
ATC ’23, pages 433–449, Boston, MA, USA, July 2023. USENIX
Association.
[173] Y. Zhao, Y. Li, C. Li, and R. Zhang. MultiHiertt: Numerical
Reasoning over Multi Hierarchical Tabular and Textual Data. In
S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of
the 60th Annual Meeting of the Association for Computational Linguis-
tics (Volume 1: Long Papers), ACL ’22, pages 6588–6600, Dublin,
Ireland, May 2022. Association for Computational Linguistics.
[174] Y. Zhao, H. Yin, B. Zeng, H. Wang, T. Shi, C. Lyu, L. Wang,
W. Luo, and K. Zhang. Marco-o1: Towards Open Reasoning
Models for Open-Ended Solutions, Nov. 2024. arXiv:2411.14405.
[175] K. Zheng, J. M. Han, and S. Polu. miniF2F: A Cross-System
Benchmark for Formal Olympiad-Level Mathematics. In Proceed-
ings of the Tenth International Conference on Learning Representa-
tions, ICLR ’22, Virtual Event, Apr. 2022.
[176] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang,
Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt,
and S. Levine, editors, Proceedings of the Thirty-seventh Annual
Conference on Neural Information Processing Systems (NeurIPS ’23),
volume 36 of Advances in Neural Information Processing Systems,
pages 46595–46623, New Orleans, LA, USA, Dec. 2023. Curran
Associates.
[177] S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng,
T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig. WebArena:
A Realistic Web Environment for Building Autonomous Agents,
Apr. 2024. arXiv:2307.13854.
[178] D.-H. Zhu, Y.-J. Xiong, J.-C. Zhang, X.-J. Xie, and C.-M.
Xia. Understanding Before Reasoning: Enhancing Chain-of-
Thought with Iterative Summarization Pre-Prompting, Jan. 2025.
arXiv:2501.04341.

You might also like