Softcot: Soft Chain-Of-Thought For Efficient Reasoning With Llms
Softcot: Soft Chain-Of-Thought For Efficient Reasoning With Llms
conut and CCoT rely on a language modeling objec- framework answer  from the answer Ā generated
tive for supervised fine-tuning, which is infeasible by the LLM. Then we compute the accuracy of Â
for state-of-the-art LLMs due to the catastrophic comparing with the ground-truth answer A.
forgetting problem. Moreover, Heima underper-
forms compared to its backbone model, LLaVA- 3.2 Overview of the SoftCoT Framework
CoT (Xu et al., 2024). These challenges under- SoftCoT is a novel framework designed to enhance
score the need to develop methodologies that mit- reasoning capabilities in large language models
igate catastrophic forgetting in the application of (LLMs). Given an input question Q, the frame-
continuous-space CoT reasoning. work produces both a sequence of reasoning steps
R and the final answer A. SoftCoT consists of
3 Methodology
three key components: the soft thought token gen-
3.1 Problem Definition and Notations eration module, the projection module, and the
Given a question Q = [q1 , q2 , · · · , q|Q| ], the CoT CoT reasoning module. The overall architecture is
reasoning will solve the problem on the following illustrated in Figure 1(a).
two steps: (1) Auto-regressively generate a list of The soft thought token generation module is in-
rationale steps R = [r1 , r2 , · · · , r|R| ] according spired by prompt tuning techniques (Lester et al.,
to the question; (2) Auto-regressively generate the 2021). In conventional prompt tuning, learnable
answer A = [a1 , a2 , · · · , a|A| ] according to the prompts facilitate the retrieval of knowledge stored
question as well as the rationale steps. The genera- within the LLM (Xu et al., 2023). In SoftCoT, soft
tion process can be described as: thought tokens are generated by an assistant lan-
guage model, which is typically smaller than the
ri+1 = LLM(Q; R≤i ), (1) backbone LLM (e.g., LLaMA-3.2-1B-Instruct as
aj+1 = LLM(Q; R; A≤j ), the assistant model and LLaMA-3.1-8B-Instruct as
the backbone model).
where LLM(·) indicates a large language model, A key challenge in this setup is that the assistant
R≤i = [r1 , · · · , ri ] indicates the previous gener- model can only generate discrete token sequences
ated i reasoning tokens, and A≤j = [a1 , · · · , aj ] as input to the backbone LLM, which imposes con-
indicates the previous generated j answer tokens. straints and may not always yield optimal prompts.
The majority of recent works (Zhang et al., To address this limitation, we introduce continuous
2023; Zhou et al., 2023; Yao et al., 2023) focus soft thought tokens that enable more expressive
on generating discrete hard tokens in R, which is and flexible prompting. However, a representa-
named as “Hard-CoT” in this paper. On contrast, tion gap exists between the assistant model and
some recent works (Hao et al., 2024; Cheng and the backbone LLM, necessitating an intermediate
Durme, 2024) focus on the continuous representa- transformation.
tions (a.k.a latent space) of R, which is named as To bridge this gap, the projection module maps
“Soft-CoT” in this paper. the soft thought tokens’ representations into a space
In this paper, we mannually define some rules more compatible with the backbone LLM. This en-
(e.g., regular expression matching) to extract the sures that the soft thought tokens effectively guide
the reasoning process. LLM’s reasoning process while maintaining adapt-
Finally, the CoT reasoning module leverages ability to various contexts.
both the learned soft thought tokens and word em-
beddings to generate intermediate reasoning steps 3.4 Soft Thought Tokens for CoT Reasoning
R̄ and the final answer Ā. The model is trained One of the advantages of Hard-CoT is that the
using a language modeling objective, optimizing generated discrete tokens can be tokenized by the
the learnable parameters across the rationale steps LLMs, which does not require a external mapping
and the answer spans. module. However, there are two main limitations of
Hard-CoT: (1) The decoded token space is discrete,
3.3 Prompt Tuning for CoT Reasoning
which is constrained and sometimes not optimal;
Prompt tuning for CoT reasoning aims to optimize (2) The gradient cannot backpropagate to the as-
the structure and content of the prompt template sistant model since the decoding process cut off
to enhance the reasoning capabilities of a large the gradient information. A convincing solution is
language model (LLM). This process can be math- replace the hard tokens to soft thought tokens.
ematically formulated as follows:
Generating Soft Thought Tokens with an As-
ŷ = LLM Pp (x) , (2) sistant Model To generate instance-specific soft
∗
p = arg min L(ŷ, y), thoughts, we utilize an auxiliary assistant model
p that produces soft thoughts based on the given rea-
soning task. The input to the assistant model, de-
where ŷ represents the predicted output, x denotes
noted as xassist , consists of three main components:
the input sequence, and Pp (x) is the input aug-
mented with a prompt p. The objective function
xassist = I, Q, [UNK]1:N , (3)
L(·) measures the discrepancy between the model’s
prediction ŷ and the ground-truth label y. The pri-
mary goal of prompt tuning is to determine an op-
timal prompt configuration that effectively guides where
the LLM to perform CoT reasoning with improved
accuracy and interpretability. • I represents the instructional context provided
A straightforward yet effective approach to opti- to the assistant model, guiding it on how to
mizing prompts involves leveraging an auxiliary as- generate relevant thoughts.
sistant model to generate instance-specific prompts,
• Q denotes the reasoning question that the pri-
which provide contextual hints or question sum-
mary LLM will solve, which has been defined
maries to facilitate reasoning (Li et al., 2023; Shao
in Section 3.1.
et al., 2023). In this framework, the prompt p can
be decomposed into two components: (1) a fixed,
• N [UNK] tokens serve as placeholders for the
task-specific prompt p , which remains constant
soft thoughts.
across all instances and encodes general problem-
solving heuristics, and (2) a learnable, instance-
Once the input sequence is constructed, the assis-
specific prompt p , which dynamically adapts to
tant model processes it, and the soft thought tokens
each input instance to provide tailored guidance.
are obtained as follows:
Given the rapid advancements in LLMs, many
LLMs are capable of solving complex reasoning
hassist = Assistant(xassist ), (4)
tasks under zero-shot settings. Instead of fine-
tuning the assistant model for each task, we adopt tassist = hassist
|I|+|Q|+1:|I|+|Q|+N .
a more efficient strategy by employing a relatively
small, frozen language model to generate p . This Here hassist denotes the final-layer hidden states
approach not only reduces computational costs but of the assistant model, and tassist corresponds to
also ensures stability and generalization across dif- the segment of hassist associated with the N [UNK]
ferent problem domains. By systematically in- tokens. This extracted representation serves as the
tegrating instance-specific prompting with fixed instance-specific soft thoughts, dynamically adapt-
task-specific instructions, this method enhances the ing to the input reasoning question.
Projection Module Since there exist both a rep- or problem-solving heuristics, ultimately leading
resentation gap and a dimensional gap between the to the generation of the final answer:
assistant language model and the primary LLM, a
direct utilization of tassist may lead to suboptimal R̄ = LLM(xLLM ), (7)
performance. The assistant model and the LLM of- Ā = LLM(xLLM , R̄),
ten operate in different embedding spaces, with dis-
 = E(Ā),
tinct hidden state distributions and dimensionalities.
To bridge this gap, we introduce a projection mod-
where E(·) is mannual rules for answer extraction.
ule that maps the assistant-generated soft thoughts
By integrating both fixed task-specific instruc-
tassist from the assistant model’s embedding space
tions and instance-specific soft thought tokens, our
to the LLM’s embedding space:
approach enables the LLM to systematically de-
t = Linearθ (tassist ), (5) compose complex reasoning tasks while leverag-
ing auxiliary knowledge provided by the assistant
where Linearθ : Rdassist → RdLLM is a trainable model. The structured input ensures that the LLM
projection layer parameterized by θ. This layer en- benefits from both general domain knowledge and
sures that the assistant-generated soft thoughts are tailored instance-level guidance, ultimately improv-
transformed into a suitable format for the LLM, pre- ing its reasoning effectiveness.
serving relevant semantic information while adapt-
ing to the LLM’s feature space. Parameter-Efficient Training In this work, we
By incorporating this projection module, we ef- focus on reasoning tasks that include annotated rea-
fectively mitigate discrepancies between the assis- soning steps, which provide explicit intermediate
tant model and the LLM, enabling smooth inte- reasoning trajectories leading to the final answer.
gration of instance-specific thoughts into the CoT To effectively train the model, we employ the stan-
reasoning process. This design ensures that the dard language modeling objective (also known as
learned thought tokens are both informative and next-token prediction) to supervise the generation
compatible, thereby enhancing the overall reason- of soft thoughts. During the training stage, the
ing performance of the LLM. input sequence is structured as follows:
LLM Reasoning with Soft CoT With instance-
xtrain = p , Q, t , R, A . (8)
specific soft thought tokens generated by the assis-
tant model and mapped to the LLM’s embedding
To effectively learn the soft thoughts, we apply
space, we proceed to the final step: applying these
the negative log-likelihood (NLL) loss over the rea-
soft thoughts to aid LLMs in CoT reasoning.
soning steps and the answer span. Specifically, we
The input to the LLM, denoted as xLLM , follows
mask the tokens before the intermediate reasoning
a structure similar to that of xassist :
steps to prevent the model from directly relying on
xLLM = p , Q, t , (6) them during loss computation. Instead, the model
is trained to generate the reasoning steps R and
where
final answer A in an autoregressive manner.
• p is the task-specific instruction, which is
a fixed prompt shared across all instances of 4 Experiments
the same task. It provides general problem-
4.1 Datasets
solving heuristics and instructions relevant to
the reasoning task. We conduct experiments on five reasoning datasets
spanning three categories of reasoning: mathemati-
• t is the instance-specific soft thoughts com- cal reasoning, commonsense reasoning, and sym-
puted by Eq (5). This component dynamically bolic reasoning. For mathematical reasoning, we
adapts soft thought tokens to each input ques- utilize GSM8K (Cobbe et al., 2021), ASDiv (Miao
tion, enhancing contextual understanding. et al., 2020), and AQuA (Ling et al., 2017).
With this structured input, the LLM generates For commonsense reasoning, we employ Strate-
step-by-step reasoning chains, following the princi- gyQA (Geva et al., 2021). For symbolic reasoning,
ples of CoT reasoning. The reasoning process un- we use Date Understanding (BIG.Bench.authors,
folds by systematically applying logical deductions 2023) from the BIG-benchmark.
Given that LLaMA-3.1-8B-Instruct is a well- with findings from prior studies, including Kala-
trained LLM, we augment the ASDiv dataset to jdzievski (2024) and Lobo et al. (2024), which have
ensure that the model encounters novel instances. reported similar issues.
Specifically, we replicate each instance five times (2) Incorporating [UNK] tokens mitigates per-
and systematically extract and replace numerical formance variance: We examined the effect of
values in the questions with randomly selected alter- directly adding [UNK] tokens as thoughts t in
natives. This augmentation is designed to evaluate Eq. (6). The results demonstrate a slight improve-
the model’s reasoning capability rather than its abil- ment in overall performance and a reduction in vari-
ity to recognize patterns from memorized data. The ance. The [UNK] token, also known as the “pause
augmented dataset is named as “ASDiv-Aug” in token” (Goyal et al., 2024), appears to expand the
the following part of this paper. All detail statistics model’s computational capacity, leading to more
of the datasets is shown in Table 1. stable and consistent outputs.
(3) Assistant model is effective to facilitate CoT
4.2 Baselines reasoning: We utilize instruction to require the
We consider the following baselines: assistant model generate some hard prompts, which
can be regarded as the initial thoughts for CoT
Coconut (Hao et al., 2024). It enables LLMs reasoning. Experiment results show that although
to reason in a continuous latent space by itera- it has a larger variance than CoT-Unk, it facilitates
tively feeding hidden states as input embeddings, the LLM for more diverse CoT generation, which
allowing for more efficient and flexible multi-path leads to the performance improvement from 68.49
reasoning compared to traditional language-based to 69.67 in average.
chain-of-thought methods. (4) SoftCoT consistently benefits from the su-
pervised fine-tuning: Overall, our proposed Soft-
Zero-Shot CoT To evaluate whether our trained CoT consistently outperforms baselines across all
model has performance degration after supervised five reasoning datasets, involving the mathemat-
fine-tuning, we apply zero-shot CoT based on the ical reasoning, the commonsense reasoning, and
prompt templates from Sprague et al. (2024). the symbolic reasoning. The experimental result
highlights that our SoftCoT benefits from the su-
Zero-Shot CoT-Unk We directly add some
pervised fine-tuning and mitigates the catastrophic
[UNK] tokens to represent the un-tuned prompts
forgetting problems in state-of-the-art LLMs.
for CoT reasoning. This baseline exams the effec-
tiveness of the tuned prompts.
5.2 Generalization to Other LLM Backbones
Zero-Shot Assist-CoT We directly require the In addition to LLaMA-3.1, we evaluate Soft-
assistant model generates some hard prompt tokens CoT on another state-of-the-art LLM family:
for CoT reasoning. This baseline exams the effec- Qwen2.5 (Yang et al., 2024). Specifically, we se-
tiveness of the tuned soft prompts. lect Qwen2.5-7B-Instruct as the backbone LLM to
assess the generalization capability of SoftCoT. As
5 Results and Discussions shown in Table 3, our analysis yields the following
three key findings:
5.1 Comparison with Baselines
(1) SoftCoT is effective across different back-
To evaluate SoftCoT, we compare its performance bone models: Experimental results on Qwen2.5-
against the baselines introduced in Section 4.2. The 7B-Instruct show that SoftCoT consistently im-
results are summarized in Table 2: proves performance across all reasoning datasets,
(1) Coconut is not applicable to larger language underscoring its robustness. These findings suggest
models: We modify and run the official imple- that SoftCoT serves as a generalizable framework
mentation of Coconut, adapting it to LLaMA-3.1- that can be effectively adapted to diverse state-of-
8B-Instruct. Our findings indicate that Coconut the-art LLM architectures.
exhibits performance degradation following super- (2) SoftCoT enhances LLMs’ weaker areas
vised fine-tuning with the language modeling ob- while preserving their strengths: Experiments
jective, which can be attributed to the catastrophic on both LLaMA and Qwen LLMs reveal that Soft-
forgetting phenomenon. This observation aligns CoT yields the most significant improvements in
Dataset Task Type Answer Type # Train samples # Evaluation samples
GSM8K Number 7,473 1,319
ASDiv-Aug Mathematical Number 4,183 1,038
AQuA Option 97,467 254
StrategyQA Commonsense Yes/No 1,832 458
DU Symbolic Option - 250
Table 1: Summary statistics of fiva datasets we used. “-” indicates that there is no training samples available.
LLaMA-3.1-8B-Instruct
Zero-Shot CoT 79.61±0.81 86.78±0.63 54.65±2.43 65.63±3.31 54.40±2.40 68.21
Zero-Shot CoT-Unk 79.95±0.59 86.90±0.41 55.28±1.88 66.16±2.70 54.16±1.46 68.49
Zero-Shot Assist-CoT 80.76±1.53 86.96±0.46 55.83±2.98 66.55±3.99 58.24±3.56 69.67
Coconut (Hao et al., 2024)† 76.12±0.00 86.80±0.00 53.15±0.00 - - -
SoftCoT (Ours) 81.03±0.42 87.19±0.40 56.30±1.67 69.04±1.23 59.04±1.93 70.52
Table 2: Model comparison with baselines. “DU” indicates the Date Understanding (BIG.Bench.authors, 2023)
dataset. The first row is under the backbone of GPT-2 (Radford et al., 2019) as backbone. The following rows are
under the backbone of LLaMA-3.1-8B-Instruct (Dubey et al., 2024). The last two rows are models trained via the
language modeling objective. We run for 5 random seeds and report the average accuracy as well as the standard
variance. “*” indicates that the accuracy is reported by Hao et al. (2024). “† ” indicates the results that we modify
and run the official code of Coconut. ±0.00 indicates that we only run once for baseline results.
commonsense reasoning tasks, where LLMs typi- ing samples, we train the model on other similar
cally underperform compared to mathematical rea- datasets and apply zero-shot transfer to evaluate its
soning. This advantage may stem from SoftCoT’s generalization on Date Understanding. The results
ability to generate contextually relevant continu- indicate that SoftCoT consistently enhances per-
ous thought processes, effectively activating the formance in zero-shot domain transfer scenarios,
corresponding knowledge areas within the model. further demonstrating its adaptability.
Furthermore, SoftCoT helps mitigate catastrophic
forgetting in domains where LLMs already excel, 5.3 Model Analysis and More Studies
such as mathematical reasoning, thereby preserv- 5.3.1 Model-Related Factors
ing and reinforcing existing capabilities. To better understand SoftCoT, we conduct exper-
(3) SoftCoT facilitates domain transfer: Given iments to examine the impact of the number of
that the Date Understanding dataset lacks train- thought tokens. The results, presented in Figure 2,
SoftCoT Model N =1 N = 10
Zero-Shot Assist-CoT Zero-Shot CoT 79.61 90.37
87.5
Accuracy (%)
Input:
Solve the following math problem efficiently and clearly:
- For simple problems (2 steps or fewer): Provide a concise solution with minimal equation.
- For complex problems (3 steps or more):
Use this step-by-step format:
## Step 1: [Brief calculations]
## Step 2: [Brief calculations]
...
Regardless of the approach, always conclude with:
Therefore, the final answer is: $ \boxed{{answer}}$. I hope it is correct.
Where [answer] is just the final number or expression that solves the problem.
Problem: John runs 60 miles a week. He runs 3 days a week. He runs 3 hours the first day and
half as much the other two days he runs. How fast does he run?
Answer:
## Step 1: Calculate the total hours John runs in a week
To find out how much he runs in total in a week, we multiply the number of hours he runs on
the first day by 3 (for the 3 days in a week) and add the same multiplied by 0.5 (since he runs
half as much on the other two days).
## Step 2: Calculate the number of hours John runs in a week
John runs 3 hours on the first day, so he runs (3 * 3) + (3 * 0.5 * 2) hours in a week.
## Step 3: Calculate John’s total hours and miles run in a week
3 * 3 = 9 hours for the first day, and 3 * 0.5 * 2 = 3 hours for the other two days. His total
hours are 9 + 3 = 12 hours in a week.
## Step 4: Calculate John’s speed in miles per hour
John runs 60 miles in 12 hours. So, we divide 60 / 12 to find his speed.
60 / 12 = 5 miles per hour.
Therefore, the final answer is: $\boxed{5}$.
Evaluate:
Incorrect, the ground-truth is $\boxed{10}$.
Example Input-Output for Zero-Shot Assist-CoT (GSM8K)
Input:
Solve the following math problem efficiently and clearly:
- For simple problems (2 steps or fewer): Provide a concise solution with minimal equation.
- For complex problems (3 steps or more):
Use this step-by-step format:
## Step 1: [Brief calculations]
## Step 2: [Brief calculations]
...
Regardless of the approach, always conclude with:
Therefore, the final answer is: $ \boxed{{answer}}$. I hope it is correct.
Where [answer] is just the final number or expression that solves the problem.
Problem: John runs 60 miles a week. He runs 3 days a week. He runs 3 hours the first day and
half as much the other two days he runs. How fast does he run?
There are some prompts generated by a weaker assistant model. Some prompts maybe useful
while others maybe unuseful for your reasoning. If the prompts are correct, you can use it as
reference. If the prompts are not correct, you can ignore them and focus back to solving the
problem.
Here are prompts:
1. a
2. miles
3. week
4. days
5. 3-mile
6. run
Answer:
## Step 1: Calculate the total hours John runs in a week
John runs 3 hours on the first day. On the other two days, he runs half as much, which is 3 / 2
= 1.5 hours per day. So, the total hours John runs in a week is 3 + 1.5 + 1.5 = 6 hours.
## Step 2: Calculate John’s speed
To find John’s speed, divide the total distance he runs in a week (60 miles) by the total hours
he runs in a week (6 hours). Speed = Distance / Time = 60 miles / 6 hours = 10 miles per hour.
Therefore, the final answer is: $\boxed{10}$.
Evaluate:
Correct!