Softcot: Soft Chain-Of-Thought For Efficient Reasoning With Llms

Chain of thought reasoning for LLMS

Uploaded by

Rohit Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views13 pages

Softcot: Soft Chain-Of-Thought For Efficient Reasoning With Llms

Chain of thought reasoning for LLMS

Uploaded by

Rohit Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs

Yige Xu1,2∗, Xu Guo1,∗†, Zhiwei Zeng1†, Chunyan Miao1,2

1
Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly
2
College of Computing and Data Science
Nanyang Technological University, Singapore
{yige002,xu008}@e.ntu.edu.sg, {zhiwei.zeng,ascymiao}@ntu.edu.sg

Abstract but also improves a range of reasoning-intensive

tasks (Zhang et al., 2023; Sprague et al., 2024).
Chain-of-Thought (CoT) reasoning enables It has inspired many advanced prompting frame-
Large Language Models (LLMs) to solve com-
works, marking a paradigm shift from scaling
arXiv:2502.12134v1 [cs.CL] 17 Feb 2025

plex reasoning tasks by generating intermediate

reasoning steps. However, most existing ap- training-time compute (Kojima et al., 2022) to scal-
proaches focus on hard token decoding, which ing inference-time compute (Wang et al., 2023; Yao
constrains reasoning within the discrete vocab- et al., 2023) to further boost LLM performance.
ulary space and may not always be optimal. Nevertheless, CoT’s effectiveness depends on
While recent efforts explore continuous-space the quality of intermediate thoughts, as the auto-
reasoning, they often require full-model fine-
regressive generation process can propagate er-
tuning and suffer from catastrophic forgetting,
limiting their applicability to state-of-the-art rors. To mitigate this challenge, methods like self-
LLMs that already perform well in zero-shot consistency (Wang et al., 2023) generate multiple
settings with a proper instruction. To address reasoning paths, while Tree-of-Thought (Yao et al.,
this challenge, we propose a novel approach 2023) and Graph-of-Thought (Besta et al., 2024)
for continuous-space reasoning that does not frameworks organize these paths to select higher-
require modifying the LLM. Specifically, we quality steps. Despite these improvements, such
employ a lightweight fixed assistant model to methods are computationally inefficient due to the
generate instance-specific soft thought tokens
need for extensive thought sampling.
speculatively as the initial chain of thoughts,
which are then mapped into the LLM’s repre- To enhance CoT efficiency, recent research ex-
sentation space via a trainable projection mod- plores skipping the decoding of hard tokens at in-
ule. Experimental results on five reasoning termediate steps. Methods like Continuous CoT
benchmarks demonstrate that our method en- (Cheng and Durme, 2024) and Coconut (Hao et al.,
hances LLM reasoning performance through 2024) conduct reasoning in a continuous space by
supervised, parameter-efficient fine-tuning.
using latent representations instead of discrete to-
1 Introduction ken sequences. Their results have shown to be supe-
rior to long-sequence discrete reasoning chains us-
In recent years, Large Language Models (LLMs) ing only a short length of continuous representation.
have become a cornerstone in Natural Language Yet, these methods require full-model fine-tuning,
Processing (NLP), exhibiting advanced natural lan- which incurs substantial computational costs, risks
guage understanding and generation (Brown et al., catastrophic forgetting, and limits their transferabil-
2020; Chowdhery et al., 2023; OpenAI, 2023; ity across tasks.
Dubey et al., 2024; Yang et al., 2024). Scaling We empirically observe that supervised fine-
model sizes has not only improved instruction- tuning of the LLaMA3.1-8B (Dubey et al., 2024)
following (Kojima et al., 2022) but also triggered model with a language modeling objective on rea-
emergent reasoning abilities, as first evidenced soning tasks (which is employed by both Coconut
by chain-of-thought (CoT) prompting (Wei et al., and CCoT) can lead to performance degradation
2022). CoT prompts LLMs to generate interme- compared with the zero-shot settings. We conjec-
diate reasoning steps before providing the final ture that this is due to catastrophic forgetting, a
answer, which not only enhances interpretability phenomenon also observed by Kalajdzievski (2024)
∗
The first two authors contributed equally. and Lobo et al. (2024). Thus, the methodologies of
†
Corresponding authors. Coconut, which is based on GPT-2 (Radford et al.,
2019), and CCoT, which is built upon LLaMA2- LLM’s reasoning performance. Moreover, Soft-
7B (Touvron et al., 2023), may not be directly ap- CoT employs parameter-efficient fine-tuning, ef-
plicable to more recent models such as LLaMA3.1- fectively mitigating catastrophic forgetting seen in
8B. Therefore, it is crucial to explore alternative full-model fine-tuning. Our results highlight the
methodologies that mitigate catastrophic forgetting effectiveness of using an assistant model to gen-
while effectively leveraging continuous reasoning erate soft thoughts for enhancing LLMs’ reason-
techniques in large-scale, instruction-tuned models, ing capabilities while preserving their pre-trained
which is the main research question of this work. knowlegde.
To mitigate catastrophic forgetting, a straight-
forward approach is to freeze the backbone LLM 2 Related Works
and instead optimize an external model for reason-
Early research on chain-of-thought (CoT) reason-
ing. Inspired by prompt tuning (Lester et al., 2021)
ing can be traced back to Wei et al. (2022), who
and speculative decoding (Leviathan et al., 2023),
first introduced a prompting strategy that guides
we propose an approach that utilizes an auxiliary
LLMs through decomposed intermediate reason-
small assistant model to generate a sequence of
ing steps using few-shot exemplars. Concurrently,
“thought” tokens conditioned on a task instruction
Kojima et al. (2022) demonstrated that LLMs are
followed by a specific instance (Li et al., 2023;
capable of zero-shot CoT reasoning by simply ap-
Shao et al., 2023). These tokens serve as instance-
pending the phrase “Let’s think step by step” to
specific prompts to boost LLM’s inference. Such
the prompt template. This discovery underscored
an auxiliary prompting mechanism dynamically
the latent reasoning abilities of LLMs, even in the
adapts to different reasoning tasks, thereby improv-
absence of explicit demonstrations.
ing generalization while preserving the pre-trained
Building upon these foundational works, the
knowledge of the LLM.
NLP community has extensively explored the po-
To facilitate reasoning in a continuous space, we tential of CoT reasoning. As summarized by Chu
use soft thought tokens (i.e., the last-layer hidden et al. (2024), recent advancements in CoT meth-
states from the small assistant model before map- ods can be broadly categorized into three areas:
ping to the vocabulary space) instead of discrete (1) Prompt Construction, which aims to optimize
tokens. This ensures reasoning remains within the prompts for improved CoT reasoning (Wei et al.,
continuous latent space. However, a representa- 2022; Kojima et al., 2022; Zhang et al., 2023); (2)
tional gap between the assistant model and the Topological Variants, which leverage structured
LLM may hinder effective knowledge transfer. To representations such as trees and graphs to enhance
bridge this gap, we train a projection module to CoT reasoning (Yao et al., 2023; Besta et al., 2024);
map the soft thought tokens generated by the as- and (3) Enhancement Methods, which introduce ex-
sistant model to the LLM’s representation space. ternal strategies to further improve CoT reasoning,
Training the projection module for each task can such as question decomposition (Zhou et al., 2023)
be seen as soft prompt tuning for the LLM. The and self-consistency decoding (Wang et al., 2023).
overall Soft thoughts for CoT (SoftCoT) reasoning Despite the effectiveness of these approaches, the
framework is illustrated in Figure 1. majority of existing CoT methods rely on discrete
To evaluate our proposed SoftCoT, we conduct token-by-token generation, which imposes inherent
experiments on five reasoning benchmarks and two constraints and limits their expressiveness.
state-of-the-art LLM architectures. The five bench- To address the limitations of discrete language
marks we choose include mathematical reasoning, space, an effective approach is to leverage con-
commonsense reasoning, and symbolic reasoning. tinuous representation space for reasoning. Co-
For further exploration, we create a hard version conut (Hao et al., 2024) introduces a Chain-of-
of the ASDiv dataset (Miao et al., 2020), which Continuous-Thought, while CCoT (Cheng and
requires stronger mathematical reasoning ability. Durme, 2024) employs Compressed Chain-of-
The new dataset is named “ASDiv-Aug” in this pa- Thought, generating content-rich and continuous
per. Experimental results show that SoftCoT con- contemplation tokens. Heima (Shen et al., 2025)
sistently improves accuracy on both public datasets further advances this idea by utilizing a single con-
and our augmented ASDiv-Aug dataset, demon- tinuous vector to represent compressed reasoning
strating the effectiveness of SoftCoT in enhancing tokens in multi-modal tasks. However, both Co-
(a) SoftCoT (Ours) (b) Chain-of-Thought

Large Language Model Large Language Model

Instruction for Assistant Projection Instruction for LLM

Reasoning Question Module Soft Thoughts for LLM (c) Coconut

Special Token Intermediate Reasoning Step

Soft Thoughts Generated by Assistant Answer Large Language Model

Assistant Language Model Computation in Continuous Space

Decode in Language Space

Figure 1: A comparison of SoftCoT, vanilla Chain-of-Thought, and Coconut.

conut and CCoT rely on a language modeling objec- framework answer Â from the answer Ā generated
tive for supervised fine-tuning, which is infeasible by the LLM. Then we compute the accuracy of Â
for state-of-the-art LLMs due to the catastrophic comparing with the ground-truth answer A.
forgetting problem. Moreover, Heima underper-
forms compared to its backbone model, LLaVA- 3.2 Overview of the SoftCoT Framework
CoT (Xu et al., 2024). These challenges under- SoftCoT is a novel framework designed to enhance
score the need to develop methodologies that mit- reasoning capabilities in large language models
igate catastrophic forgetting in the application of (LLMs). Given an input question Q, the frame-
continuous-space CoT reasoning. work produces both a sequence of reasoning steps
R and the final answer A. SoftCoT consists of
3 Methodology
three key components: the soft thought token gen-
3.1 Problem Definition and Notations eration module, the projection module, and the
Given a question Q = [q1 , q2 , · · · , q|Q| ], the CoT CoT reasoning module. The overall architecture is
reasoning will solve the problem on the following illustrated in Figure 1(a).
two steps: (1) Auto-regressively generate a list of The soft thought token generation module is in-
rationale steps R = [r1 , r2 , · · · , r|R| ] according spired by prompt tuning techniques (Lester et al.,
to the question; (2) Auto-regressively generate the 2021). In conventional prompt tuning, learnable
answer A = [a1 , a2 , · · · , a|A| ] according to the prompts facilitate the retrieval of knowledge stored
question as well as the rationale steps. The genera- within the LLM (Xu et al., 2023). In SoftCoT, soft
tion process can be described as: thought tokens are generated by an assistant lan-
guage model, which is typically smaller than the
ri+1 = LLM(Q; R≤i ), (1) backbone LLM (e.g., LLaMA-3.2-1B-Instruct as
aj+1 = LLM(Q; R; A≤j ), the assistant model and LLaMA-3.1-8B-Instruct as
the backbone model).
where LLM(·) indicates a large language model, A key challenge in this setup is that the assistant
R≤i = [r1 , · · · , ri ] indicates the previous gener- model can only generate discrete token sequences
ated i reasoning tokens, and A≤j = [a1 , · · · , aj ] as input to the backbone LLM, which imposes con-
indicates the previous generated j answer tokens. straints and may not always yield optimal prompts.
The majority of recent works (Zhang et al., To address this limitation, we introduce continuous
2023; Zhou et al., 2023; Yao et al., 2023) focus soft thought tokens that enable more expressive
on generating discrete hard tokens in R, which is and flexible prompting. However, a representa-
named as “Hard-CoT” in this paper. On contrast, tion gap exists between the assistant model and
some recent works (Hao et al., 2024; Cheng and the backbone LLM, necessitating an intermediate
Durme, 2024) focus on the continuous representa- transformation.
tions (a.k.a latent space) of R, which is named as To bridge this gap, the projection module maps
“Soft-CoT” in this paper. the soft thought tokens’ representations into a space
In this paper, we mannually define some rules more compatible with the backbone LLM. This en-
(e.g., regular expression matching) to extract the sures that the soft thought tokens effectively guide
the reasoning process. LLM’s reasoning process while maintaining adapt-
Finally, the CoT reasoning module leverages ability to various contexts.
both the learned soft thought tokens and word em-
beddings to generate intermediate reasoning steps 3.4 Soft Thought Tokens for CoT Reasoning
R̄ and the final answer Ā. The model is trained One of the advantages of Hard-CoT is that the
using a language modeling objective, optimizing generated discrete tokens can be tokenized by the
the learnable parameters across the rationale steps LLMs, which does not require a external mapping
and the answer spans. module. However, there are two main limitations of
Hard-CoT: (1) The decoded token space is discrete,
3.3 Prompt Tuning for CoT Reasoning
which is constrained and sometimes not optimal;
Prompt tuning for CoT reasoning aims to optimize (2) The gradient cannot backpropagate to the as-
the structure and content of the prompt template sistant model since the decoding process cut off
to enhance the reasoning capabilities of a large the gradient information. A convincing solution is
language model (LLM). This process can be math- replace the hard tokens to soft thought tokens.
ematically formulated as follows:
Generating Soft Thought Tokens with an As-

ŷ = LLM Pp (x) , (2) sistant Model To generate instance-specific soft
∗
p = arg min L(ŷ, y), thoughts, we utilize an auxiliary assistant model
p that produces soft thoughts based on the given rea-
soning task. The input to the assistant model, de-
where ŷ represents the predicted output, x denotes
noted as xassist , consists of three main components:
the input sequence, and Pp (x) is the input aug-
mented with a prompt p. The objective function
xassist = I, Q, [UNK]1:N , (3)
L(·) measures the discrepancy between the model’s
prediction ŷ and the ground-truth label y. The pri-
mary goal of prompt tuning is to determine an op-
timal prompt configuration that effectively guides where
the LLM to perform CoT reasoning with improved
accuracy and interpretability. • I represents the instructional context provided
A straightforward yet effective approach to opti- to the assistant model, guiding it on how to
mizing prompts involves leveraging an auxiliary as- generate relevant thoughts.
sistant model to generate instance-specific prompts,
• Q denotes the reasoning question that the pri-
which provide contextual hints or question sum-
mary LLM will solve, which has been defined
maries to facilitate reasoning (Li et al., 2023; Shao
in Section 3.1.
et al., 2023). In this framework, the prompt p can
be decomposed into two components: (1) a fixed,
• N [UNK] tokens serve as placeholders for the
task-specific prompt p , which remains constant
soft thoughts.
across all instances and encodes general problem-
solving heuristics, and (2) a learnable, instance-
Once the input sequence is constructed, the assis-
specific prompt p , which dynamically adapts to
tant model processes it, and the soft thought tokens
each input instance to provide tailored guidance.
are obtained as follows:
Given the rapid advancements in LLMs, many
LLMs are capable of solving complex reasoning
hassist = Assistant(xassist ), (4)
tasks under zero-shot settings. Instead of fine-
tuning the assistant model for each task, we adopt tassist = hassist
|I|+|Q|+1:|I|+|Q|+N .
a more efficient strategy by employing a relatively
small, frozen language model to generate p . This Here hassist denotes the final-layer hidden states
approach not only reduces computational costs but of the assistant model, and tassist corresponds to
also ensures stability and generalization across dif- the segment of hassist associated with the N [UNK]
ferent problem domains. By systematically in- tokens. This extracted representation serves as the
tegrating instance-specific prompting with fixed instance-specific soft thoughts, dynamically adapt-
task-specific instructions, this method enhances the ing to the input reasoning question.
Projection Module Since there exist both a rep- or problem-solving heuristics, ultimately leading
resentation gap and a dimensional gap between the to the generation of the final answer:
assistant language model and the primary LLM, a
direct utilization of tassist may lead to suboptimal R̄ = LLM(xLLM ), (7)
performance. The assistant model and the LLM of- Ā = LLM(xLLM , R̄),
ten operate in different embedding spaces, with dis-
Â = E(Ā),
tinct hidden state distributions and dimensionalities.
To bridge this gap, we introduce a projection mod-
where E(·) is mannual rules for answer extraction.
ule that maps the assistant-generated soft thoughts
By integrating both fixed task-specific instruc-
tassist from the assistant model’s embedding space
tions and instance-specific soft thought tokens, our
to the LLM’s embedding space:
approach enables the LLM to systematically de-
t = Linearθ (tassist ), (5) compose complex reasoning tasks while leverag-
ing auxiliary knowledge provided by the assistant
where Linearθ : Rdassist → RdLLM is a trainable model. The structured input ensures that the LLM
projection layer parameterized by θ. This layer en- benefits from both general domain knowledge and
sures that the assistant-generated soft thoughts are tailored instance-level guidance, ultimately improv-
transformed into a suitable format for the LLM, pre- ing its reasoning effectiveness.
serving relevant semantic information while adapt-
ing to the LLM’s feature space. Parameter-Efficient Training In this work, we
By incorporating this projection module, we ef- focus on reasoning tasks that include annotated rea-
fectively mitigate discrepancies between the assis- soning steps, which provide explicit intermediate
tant model and the LLM, enabling smooth inte- reasoning trajectories leading to the final answer.
gration of instance-specific thoughts into the CoT To effectively train the model, we employ the stan-
reasoning process. This design ensures that the dard language modeling objective (also known as
learned thought tokens are both informative and next-token prediction) to supervise the generation
compatible, thereby enhancing the overall reason- of soft thoughts. During the training stage, the
ing performance of the LLM. input sequence is structured as follows:
LLM Reasoning with Soft CoT With instance-
xtrain = p , Q, t , R, A . (8)
specific soft thought tokens generated by the assis-
tant model and mapped to the LLM’s embedding
To effectively learn the soft thoughts, we apply
space, we proceed to the final step: applying these
the negative log-likelihood (NLL) loss over the rea-
soft thoughts to aid LLMs in CoT reasoning.
soning steps and the answer span. Specifically, we
The input to the LLM, denoted as xLLM , follows
mask the tokens before the intermediate reasoning
a structure similar to that of xassist :
steps to prevent the model from directly relying on
xLLM = p , Q, t , (6) them during loss computation. Instead, the model
is trained to generate the reasoning steps R and
where
final answer A in an autoregressive manner.
• p is the task-specific instruction, which is
a fixed prompt shared across all instances of 4 Experiments
the same task. It provides general problem-
4.1 Datasets
solving heuristics and instructions relevant to
the reasoning task. We conduct experiments on five reasoning datasets
spanning three categories of reasoning: mathemati-
• t is the instance-specific soft thoughts com- cal reasoning, commonsense reasoning, and sym-
puted by Eq (5). This component dynamically bolic reasoning. For mathematical reasoning, we
adapts soft thought tokens to each input ques- utilize GSM8K (Cobbe et al., 2021), ASDiv (Miao
tion, enhancing contextual understanding. et al., 2020), and AQuA (Ling et al., 2017).
With this structured input, the LLM generates For commonsense reasoning, we employ Strate-
step-by-step reasoning chains, following the princi- gyQA (Geva et al., 2021). For symbolic reasoning,
ples of CoT reasoning. The reasoning process un- we use Date Understanding (BIG.Bench.authors,
folds by systematically applying logical deductions 2023) from the BIG-benchmark.
Given that LLaMA-3.1-8B-Instruct is a well- with findings from prior studies, including Kala-
trained LLM, we augment the ASDiv dataset to jdzievski (2024) and Lobo et al. (2024), which have
ensure that the model encounters novel instances. reported similar issues.
Specifically, we replicate each instance five times (2) Incorporating [UNK] tokens mitigates per-
and systematically extract and replace numerical formance variance: We examined the effect of
values in the questions with randomly selected alter- directly adding [UNK] tokens as thoughts t in
natives. This augmentation is designed to evaluate Eq. (6). The results demonstrate a slight improve-
the model’s reasoning capability rather than its abil- ment in overall performance and a reduction in vari-
ity to recognize patterns from memorized data. The ance. The [UNK] token, also known as the “pause
augmented dataset is named as “ASDiv-Aug” in token” (Goyal et al., 2024), appears to expand the
the following part of this paper. All detail statistics model’s computational capacity, leading to more
of the datasets is shown in Table 1. stable and consistent outputs.
(3) Assistant model is effective to facilitate CoT
4.2 Baselines reasoning: We utilize instruction to require the
We consider the following baselines: assistant model generate some hard prompts, which
can be regarded as the initial thoughts for CoT
Coconut (Hao et al., 2024). It enables LLMs reasoning. Experiment results show that although
to reason in a continuous latent space by itera- it has a larger variance than CoT-Unk, it facilitates
tively feeding hidden states as input embeddings, the LLM for more diverse CoT generation, which
allowing for more efficient and flexible multi-path leads to the performance improvement from 68.49
reasoning compared to traditional language-based to 69.67 in average.
chain-of-thought methods. (4) SoftCoT consistently benefits from the su-
pervised fine-tuning: Overall, our proposed Soft-
Zero-Shot CoT To evaluate whether our trained CoT consistently outperforms baselines across all
model has performance degration after supervised five reasoning datasets, involving the mathemat-
fine-tuning, we apply zero-shot CoT based on the ical reasoning, the commonsense reasoning, and
prompt templates from Sprague et al. (2024). the symbolic reasoning. The experimental result
highlights that our SoftCoT benefits from the su-
Zero-Shot CoT-Unk We directly add some
pervised fine-tuning and mitigates the catastrophic
[UNK] tokens to represent the un-tuned prompts
forgetting problems in state-of-the-art LLMs.
for CoT reasoning. This baseline exams the effec-
tiveness of the tuned prompts.
5.2 Generalization to Other LLM Backbones
Zero-Shot Assist-CoT We directly require the In addition to LLaMA-3.1, we evaluate Soft-
assistant model generates some hard prompt tokens CoT on another state-of-the-art LLM family:
for CoT reasoning. This baseline exams the effec- Qwen2.5 (Yang et al., 2024). Specifically, we se-
tiveness of the tuned soft prompts. lect Qwen2.5-7B-Instruct as the backbone LLM to
assess the generalization capability of SoftCoT. As
5 Results and Discussions shown in Table 3, our analysis yields the following
three key findings:
5.1 Comparison with Baselines
(1) SoftCoT is effective across different back-
To evaluate SoftCoT, we compare its performance bone models: Experimental results on Qwen2.5-
against the baselines introduced in Section 4.2. The 7B-Instruct show that SoftCoT consistently im-
results are summarized in Table 2: proves performance across all reasoning datasets,
(1) Coconut is not applicable to larger language underscoring its robustness. These findings suggest
models: We modify and run the official imple- that SoftCoT serves as a generalizable framework
mentation of Coconut, adapting it to LLaMA-3.1- that can be effectively adapted to diverse state-of-
8B-Instruct. Our findings indicate that Coconut the-art LLM architectures.
exhibits performance degradation following super- (2) SoftCoT enhances LLMs’ weaker areas
vised fine-tuning with the language modeling ob- while preserving their strengths: Experiments
jective, which can be attributed to the catastrophic on both LLaMA and Qwen LLMs reveal that Soft-
forgetting phenomenon. This observation aligns CoT yields the most significant improvements in
Dataset Task Type Answer Type # Train samples # Evaluation samples
GSM8K Number 7,473 1,319
ASDiv-Aug Mathematical Number 4,183 1,038
AQuA Option 97,467 254
StrategyQA Commonsense Yes/No 1,832 458
DU Symbolic Option - 250

Table 1: Summary statistics of fiva datasets we used. “-” indicates that there is no training samples available.

GSM8K ASDiv-Aug AQuA StrategyQA DU Avg.

Model
Mathematical Commonsense Symbolic
GPT-2
Coconut (Hao et al., 2024) 34.10∗±1.50 38.92†±0.00 22.83†±0.00 - - -

LLaMA-3.1-8B-Instruct
Zero-Shot CoT 79.61±0.81 86.78±0.63 54.65±2.43 65.63±3.31 54.40±2.40 68.21
Zero-Shot CoT-Unk 79.95±0.59 86.90±0.41 55.28±1.88 66.16±2.70 54.16±1.46 68.49
Zero-Shot Assist-CoT 80.76±1.53 86.96±0.46 55.83±2.98 66.55±3.99 58.24±3.56 69.67
Coconut (Hao et al., 2024)† 76.12±0.00 86.80±0.00 53.15±0.00 - - -
SoftCoT (Ours) 81.03±0.42 87.19±0.40 56.30±1.67 69.04±1.23 59.04±1.93 70.52

Table 2: Model comparison with baselines. “DU” indicates the Date Understanding (BIG.Bench.authors, 2023)
dataset. The first row is under the backbone of GPT-2 (Radford et al., 2019) as backbone. The following rows are
under the backbone of LLaMA-3.1-8B-Instruct (Dubey et al., 2024). The last two rows are models trained via the
language modeling objective. We run for 5 random seeds and report the average accuracy as well as the standard
variance. “*” indicates that the accuracy is reported by Hao et al. (2024). “† ” indicates the results that we modify
and run the official code of Coconut. ±0.00 indicates that we only run once for baseline results.

GSM8K ASDiv-Aug AQuA StrategyQA DU Avg.

Model
Mathematical Commonsense Symbolic
Zero-Shot CoT 83.70±0.78 87.19±0.28 64.53±3.27 49.65±3.18 66.40±2.26 70.29
Zero-Shot CoT-Unk 84.12±0.71 86.94±0.89 64.72±2.06 50.74±1.90 66.48±1.43 70.60
Zero-Shot Assist-CoT 84.85±0.35 88.63±1.05 64.96±2.83 52.71±2.65 67.04±2.84 71.64
SoftCoT (Ours) 85.81±1.82 88.90±1.01 72.44±2.19 60.61±1.55 67.52±2.92 75.06

Table 3: Model performance using Qwen2.5-7B-Instruct.

commonsense reasoning tasks, where LLMs typi- ing samples, we train the model on other similar
cally underperform compared to mathematical rea- datasets and apply zero-shot transfer to evaluate its
soning. This advantage may stem from SoftCoT’s generalization on Date Understanding. The results
ability to generate contextually relevant continu- indicate that SoftCoT consistently enhances per-
ous thought processes, effectively activating the formance in zero-shot domain transfer scenarios,
corresponding knowledge areas within the model. further demonstrating its adaptability.
Furthermore, SoftCoT helps mitigate catastrophic
forgetting in domains where LLMs already excel, 5.3 Model Analysis and More Studies
such as mathematical reasoning, thereby preserv- 5.3.1 Model-Related Factors
ing and reinforcing existing capabilities. To better understand SoftCoT, we conduct exper-
(3) SoftCoT facilitates domain transfer: Given iments to examine the impact of the number of
that the Date Understanding dataset lacks train- thought tokens. The results, presented in Figure 2,
SoftCoT Model N =1 N = 10
Zero-Shot Assist-CoT Zero-Shot CoT 79.61 90.37
87.5
Accuracy (%)

Zero-Shot CoT Zero-Shot CoT-Unk 79.95 90.43

Zero-Shot Assist-CoT 80.76 90.54
87
SoftCoT 81.03 90.98
86.5
Table 4: Self Consistency for SoftCoT on LLaMA-3.1-
8B-Instruct. “N ” indicates the number of reasoning
86 chains.
12 4 6 8 16 24 32
Number of Thought Tokens
To further assess the effectiveness of Soft-
Figure 2: The impact of thought token numbers in CoT, we conduct experiments incorporating self-
ASDiv-Aug using LLaMA-3.1-8B-Instruct. consistency. As shown in Table 4, SoftCoT con-
sistently outperforms baseline models, demonstrat-
ing that its benefits are complementary to those
lead to the following key observations:
of self-consistency rather than being redundant or
(1) Soft thoughts reduce the required number of
conflicting. This suggests that SoftCoT introduces
thought tokens: We observe that SoftCoT achieves
an independent improvement mechanism, which
optimal performance with only six thought tokens,
can be effectively combined with self-consistency
whereas Zero-Shot Assist-CoT requires 24 thought
for enhanced reasoning performance.
tokens to reach a similar level of effectiveness. This
A key advantage of SoftCoT in this context is its
suggests that soft thoughts, which operate in a con-
ability to provide a more expressive and compact
tinuous space, exhibit a stronger representational
representation of intermediate reasoning steps in
capacity than hard thoughts expressed in the dis-
continuous space. Unlike traditional CoT reason-
crete language space. Our experiments indicate
ing, where discrete thought tokens may introduce
that the optimal number of hard thought tokens is
inconsistencies or redundant reasoning paths, Soft-
approximately four times that of soft thought to-
CoT enables more efficient reasoning trajectories
kens, aligning with the 5x ratio reported by Cheng
with fewer tokens. This allows self-consistency
and Durme (2024).
methods to aggregate results from higher-quality
(2) SoftCoT mitigates the catastrophic forgetting
reasoning paths, leading to a more robust and accu-
problem: Experimental results show that SoftCoT
rate final prediction.
consistently outperforms Zero-Shot CoT across all
tested numbers of soft thought tokens. In contrast,
6 Conclusion
Zero-Shot Assist-CoT underperforms compared to
Zero-Shot CoT when the number of thought tokens In this paper, we introduce SoftCoT, a soft chain-
is insufficient. This is likely because the assistant of-thought prompting approach for efficient LLM
model fails to generate a sufficiently informative reasoning. SoftCoT consists of three steps: (1) an
set of thought tokens under these constraints, intro- assistant model generates soft thought tokens, (2) a
ducing noise and leading to confusion in the LLM’s projection module trained to map the soft thoughts
reasoning process. to LLM’s representation space, and (3) the LLM
applies soft thoughts for reasoning. To enhance
5.3.2 Model-Orthogonal Factors efficiency, SoftCoT speculatively generates all the
Self-Consistency (Wang et al., 2023) is a widely soft thought tokens in a single forward pass. To mit-
adopted technique for enhancing Chain-of-Thought igate the catastrophic forgetting, SoftCoT freezes
(CoT) reasoning by expanding the search space. the backbone LLM and only tunes the projection
One of the most straightforward implementations module. Experiments on five datasets across three
involves generating multiple CoT reasoning paths types of reason tasks demonstrate the effectiveness
and determining the final answer through majority of our proposed SoftCoT. Experiments on multi-
voting. This approach is effective in mitigating ple LLMs as well as orthogonal method such as
errors in reasoning steps by leveraging the diversity self-consistency shows the robustness of SoftCoT,
of generated thought processes. which can be adapted in widely scenarios.
Limitations Annual Conference on Neural Information Process-
ing Systems 2020, NeurIPS 2020, December 6-12,
While SoftCoT represents a promising advance- 2020, virtual.
ment in Chain-of-Thought (CoT) reasoning within
Jeffrey Cheng and Benjamin Van Durme. 2024. Com-
a continuous space, several limitations must be ac- pressed chain of thought: Efficient reasoning
knowledged. through dense representations. arXiv preprint
arXiv:2412.13171.
SoftCoT Cannot Fully Replace the Reasoning
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Path : Although SoftCoT employs soft thought
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
tokens for reasoning, it does not entirely replace Barham, Hyung Won Chung, Charles Sutton, Sebas-
the reasoning path. The decoding stage functions tian Gehrmann, et al. 2023. PaLM: Scaling language
as a search process, which is a crucial component modeling with pathways. Journal of Machine Learn-
of CoT reasoning. Soft thought tokens primarily ing Research, 24(240):1–113.
serve to enrich the probability space for exploration Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang
rather than acting as the search mechanism itself. Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu,
Bing Qin, and Ting Liu. 2024. Navigate through
Need for Further Empirical Evidence on Scala- enigmatic labyrinth A survey of chain of thought rea-
bility : SoftCoT has been evaluated on LLaMA- soning: Advances, frontiers and future. In Proceed-
ings of the 62nd Annual Meeting of the Association
3.1-8B-Instruct and Qwen2.5-7B-Instruct. How- for Computational Linguistics (Volume 1: Long Pa-
ever, larger backbone LLMs exist within the same pers), ACL 2024, Bangkok, Thailand, August 11-16,
model families. While its success on models with 2024, pages 1173–1203. Association for Computa-
approximately 7–8 billion parameters suggests po- tional Linguistics.
tential applicability to larger-scale models, its scal- Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
ability to extremely large LLMs remains an open Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
question and requires thorough empirical valida- Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Nakano, Christopher Hesse, and John Schulman.
tion. 2021. Training verifiers to solve math word prob-
lems. arXiv preprint arXiv:2110.14168.

References Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,

Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger- Akhil Mathur, Alan Schelten, Amy Yang, Angela
stenberger, Michal Podstawski, Lukas Gianinazzi, Fan, et al. 2024. The llama 3 herd of models. arXiv
Joanna Gajda, Tomasz Lehmann, Hubert Niewiadom- preprint arXiv:2407.21783.
ski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph
of thoughts: Solving elaborate problems with large Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot,
language models. In Thirty-Eighth AAAI Conference Dan Roth, and Jonathan Berant. 2021. Did aristotle
on Artificial Intelligence, AAAI 2024, Thirty-Sixth use a laptop? A question answering benchmark with
Conference on Innovative Applications of Artificial implicit reasoning strategies. Trans. Assoc. Comput.
Intelligence, IAAI 2024, Fourteenth Symposium on Linguistics, 9:346–361.
Educational Advances in Artificial Intelligence, EAAI
2014, February 20-27, 2024, Vancouver, Canada, Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Kr-
pages 17682–17690. AAAI Press. ishna Menon, Sanjiv Kumar, and Vaishnavh Nagara-
jan. 2024. Think before you speak: Training lan-
BIG.Bench.authors. 2023. Beyond the imitation game: guage models with pause tokens. In The Twelfth
Quantifying and extrapolating the capabilities of lan- International Conference on Learning Representa-
guage models. Trans. Mach. Learn. Res., 2023. tions, ICLR 2024, Vienna, Austria, May 7-11, 2024.
OpenReview.net.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li,
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Zhiting Hu, Jason Weston, and Yuandong Tian. 2024.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Training large language models to reason in a contin-
Askell, Sandhini Agarwal, Ariel Herbert-Voss, uous latent space. arXiv preprint arXiv:2412.06769.
Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Damjan Kalajdzievski. 2024. Scaling laws for forget-
Clemens Winter, Christopher Hesse, Mark Chen, Eric ting when fine-tuning large language models. arXiv
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, preprint arXiv:2401.05605.
Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
2020. Language models are few-shot learners. In Ad- taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
vances in Neural Information Processing Systems 33: guage models are zero-shot reasoners. In Advances
in Neural Information Processing Systems 35: An- 2023, 23-29 July 2023, Honolulu, Hawaii, USA, vol-
nual Conference on Neural Information Processing ume 202 of Proceedings of Machine Learning Re-
Systems 2022, NeurIPS 2022, New Orleans, LA, USA, search, pages 30706–30775. PMLR.
November 28 - December 9, 2022.
Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. Wang, Pu Zhao, and Jiuxiang Gu. 2025. Efficient
The power of scale for parameter-efficient prompt reasoning with hidden thinking. arXiv preprint
tuning. In Proceedings of the 2021 Conference on arXiv:2501.19201.
Empirical Methods in Natural Language Processing,
EMNLP 2021, Virtual Event / Punta Cana, Domini- Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez,
can Republic, 7-11 November, 2021, pages 3045– Dongwei Jiang, Manya Wadhwa, Prasann Singhal,
3059. Association for Computational Linguistics. Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Dur-
rett. 2024. To cot or not to cot? chain-of-thought
Yaniv Leviathan, Matan Kalman, and Yossi Matias. helps mainly on math and symbolic reasoning. arXiv
2023. Fast inference from transformers via spec- preprint arXiv:2409.12183.
ulative decoding. In International Conference on
Machine Learning, ICML 2023, 23-29 July 2023, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Honolulu, Hawaii, USA, volume 202 of Proceedings bert, Amjad Almahairi, Yasmine Babaei, Nikolay
of Machine Learning Research, pages 19274–19286. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
PMLR. Bhosale, et al. 2023. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, arXiv:2307.09288.
Jianfeng Gao, and Xifeng Yan. 2023. Guiding large
language models via directional stimulus prompting. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V.
In Advances in Neural Information Processing Sys- Le, Ed H. Chi, Sharan Narang, Aakanksha Chowd-
tems 36: Annual Conference on Neural Information hery, and Denny Zhou. 2023. Self-consistency
Processing Systems 2023, NeurIPS 2023, New Or- improves chain of thought reasoning in language
leans, LA, USA, December 10 - 16, 2023. models. In The Eleventh International Conference
on Learning Representations, ICLR 2023, Kigali,
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun- Rwanda, May 1-5, 2023. OpenReview.net.
som. 2017. Program induction by rationale genera-
tion: Learning to solve and explain algebraic word Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
problems. In Proceedings of the 55th Annual Meet- Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le,
ing of the Association for Computational Linguistics, and Denny Zhou. 2022. Chain-of-thought prompting
ACL 2017, Vancouver, Canada, July 30 - August 4, elicits reasoning in large language models. In Ad-
Volume 1: Long Papers, pages 158–167. Association vances in Neural Information Processing Systems 35:
for Computational Linguistics. Annual Conference on Neural Information Process-
ing Systems 2022, NeurIPS 2022, New Orleans, LA,
Elita A. Lobo, Chirag Agarwal, and Himabindu USA, November 28 - December 9, 2022.
Lakkaraju. 2024. On the impact of fine-tuning
on chain-of-thought reasoning. arXiv preprint Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao
arXiv:2411.15382. Sun, and Li Yuan. 2024. Llava-cot: Let vision lan-
guage models reason step-by-step. arXiv preprint
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. arXiv:2411.10440.
2020. A diverse corpus for evaluating and developing
english math word problem solvers. In Proceedings Yige Xu, Zhiwei Zeng, and Zhiqi Shen. 2023. Efficient
of the 58th Annual Meeting of the Association for cross-task prompt tuning for few-shot conversational
Computational Linguistics, ACL 2020, Online, July emotion recognition. In Findings of the Association
5-10, 2020, pages 975–984. Association for Compu- for Computational Linguistics: EMNLP 2023, pages
tational Linguistics. 11654–11666, Singapore. Association for Computa-
tional Linguistics.
OpenAI. 2023. GPT-4 technical report. arXiv preprint
arXiv:2303.08774. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu,
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jian-
Dario Amodei, Ilya Sutskever, et al. 2019. Language hong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang,
models are unsupervised multitask learners. OpenAI Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu,
blog, 1(8):9. Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng
Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian-
Zhihong Shao, Yeyun Gong, Yelong Shen, Min- hao Li, Tingyu Xia, Xingzhang Ren, Xuancheng
lie Huang, Nan Duan, and Weizhu Chen. 2023. Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan,
Synthetic prompting: Generating chain-of-thought Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan
demonstrations for large language models. In Inter- Qiu. 2024. Qwen2.5 technical report. arXiv preprint
national Conference on Machine Learning, ICML arXiv:2412.15115.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Tom Griffiths, Yuan Cao, and Karthik Narasimhan.
2023. Tree of thoughts: Deliberate problem solving
with large language models. In Advances in Neural
Information Processing Systems 36: Annual Confer-
ence on Neural Information Processing Systems 2023,
NeurIPS 2023, New Orleans, LA, USA, December 10
- 16, 2023.
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex
Smola. 2023. Automatic chain of thought prompting
in large language models. In The Eleventh Inter-
national Conference on Learning Representations,
ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open-
Review.net.

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,

Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H.
Chi. 2023. Least-to-most prompting enables com-
plex reasoning in large language models. In The
Eleventh International Conference on Learning Rep-
resentations, ICLR 2023, Kigali, Rwanda, May 1-5,
2023. OpenReview.net.
A Prompt Templates Place Holder.
We release the examples for GSM8K for reference.

Example Input-Output for Zero-Shot CoT (GSM8K)

Input:
Solve the following math problem efficiently and clearly:
- For simple problems (2 steps or fewer): Provide a concise solution with minimal equation.
- For complex problems (3 steps or more):
Use this step-by-step format:
## Step 1: [Brief calculations]
## Step 2: [Brief calculations]
...
Regardless of the approach, always conclude with:
Therefore, the final answer is: $ \boxed{{answer}}$. I hope it is correct.
Where [answer] is just the final number or expression that solves the problem.
Problem: John runs 60 miles a week. He runs 3 days a week. He runs 3 hours the first day and
half as much the other two days he runs. How fast does he run?
Answer:
## Step 1: Calculate the total hours John runs in a week
To find out how much he runs in total in a week, we multiply the number of hours he runs on
the first day by 3 (for the 3 days in a week) and add the same multiplied by 0.5 (since he runs
half as much on the other two days).
## Step 2: Calculate the number of hours John runs in a week
John runs 3 hours on the first day, so he runs (3 * 3) + (3 * 0.5 * 2) hours in a week.
## Step 3: Calculate John’s total hours and miles run in a week
3 * 3 = 9 hours for the first day, and 3 * 0.5 * 2 = 3 hours for the other two days. His total
hours are 9 + 3 = 12 hours in a week.
## Step 4: Calculate John’s speed in miles per hour
John runs 60 miles in 12 hours. So, we divide 60 / 12 to find his speed.
60 / 12 = 5 miles per hour.
Therefore, the final answer is: $\boxed{5}$.
Evaluate:
Incorrect, the ground-truth is $\boxed{10}$.
Example Input-Output for Zero-Shot Assist-CoT (GSM8K)

Input:
Solve the following math problem efficiently and clearly:
- For simple problems (2 steps or fewer): Provide a concise solution with minimal equation.
- For complex problems (3 steps or more):
Use this step-by-step format:
## Step 1: [Brief calculations]
## Step 2: [Brief calculations]
...
Regardless of the approach, always conclude with:
Therefore, the final answer is: $ \boxed{{answer}}$. I hope it is correct.
Where [answer] is just the final number or expression that solves the problem.
Problem: John runs 60 miles a week. He runs 3 days a week. He runs 3 hours the first day and
half as much the other two days he runs. How fast does he run?
There are some prompts generated by a weaker assistant model. Some prompts maybe useful
while others maybe unuseful for your reasoning. If the prompts are correct, you can use it as
reference. If the prompts are not correct, you can ignore them and focus back to solving the
problem.
Here are prompts:
1. a
2. miles
3. week
4. days
5. 3-mile
6. run
Answer:
## Step 1: Calculate the total hours John runs in a week
John runs 3 hours on the first day. On the other two days, he runs half as much, which is 3 / 2
= 1.5 hours per day. So, the total hours John runs in a week is 3 + 1.5 + 1.5 = 6 hours.
## Step 2: Calculate John’s speed
To find John’s speed, divide the total distance he runs in a week (60 miles) by the total hours
he runs in a week (6 hours). Speed = Distance / Time = 60 miles / 6 hours = 10 miles per hour.
Therefore, the final answer is: $\boxed{10}$.
Evaluate:
Correct!