0% found this document useful (0 votes)
27 views

Guiding Large Language Models With Divide-and-Conquer Program For Discerning Problem Solving

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Guiding Large Language Models With Divide-and-Conquer Program For Discerning Problem Solving

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Guiding Large Language Models with Divide-and-Conquer Program for

Discerning Problem Solving

Yizhou Zhang 1 Lun Du 2 Defu Cao 1 Qiang Fu 2 Yan Liu 1

Abstract Mao et al., 2023; Zong & Krishnamachari, 2023). These de-
Foundation models, such as Large language Mod- velopments paves the way towards general-purpose problem
solvers (Bubeck et al., 2023).
arXiv:2402.05359v1 [cs.AI] 8 Feb 2024

els (LLMs), have attracted significant amount


of interest due to their large number of appli- However, as pointed out in (Wei et al., 2022), significant
cations. Existing works show that appropriate challenges arise when scale-up models are applied to tasks
prompt design, such as Chain-of-Thoughts, can involved with long solution paths, such as those requiring
unlock LLM’s powerful capacity in diverse areas. mathematical or knowledge reasoning. A series theoretic
However, when handling tasks involving repet- works attribute this challenge to Parallelism Tradeoff (Mer-
itive sub-tasks and/or deceptive contents, such rill & Sabharwal, 2023a), a fundamental limitation of Trans-
as arithmetic calculation and article-level fake formers. Specifically, unlike Recurrent Neural Network
news detection, existing prompting strategies ei- whose computational depth is linear to the input sequence
ther suffers from insufficient expressive power length (i.e., the depth is O(n), where n is the input sequence
or intermediate errors triggered by hallucination. length), Transformer does not contain any recurrent struc-
To make LLM more discerning to such interme- ture. Such design, while achieving superior parallelizability
diate errors, we propose to guide LLM with a than RNN, makes Transformers suffer from limited expres-
Divide-and-Conquer program that simultaneously sive power. Merrill & Sabharwal proved that the expressive
ensures superior expressive power and disentan- power of fixed-depth log-precision Transformer, which is
gles task decomposition, sub-task resolution, and very close to the most commonly applied Transformer archi-
resolution assembly process. Theoretic analysis tecture for LLMs, is bounded by constant-depth logspace-
reveals that our strategy can guide LLM to ex- uniform threshold circuits. Thus, they fail to accurately
tend the expressive power of fixed-depth Trans- tackle the tasks requiring long solution paths.
former. Experiments indicate that our proposed
method can achieve better performance than typi- To address this challenge, carefully designed prompting
cal prompting strategies in tasks bothered by in- strategies have been developed to tackle tasks that requires
termediate errors and deceptive contents, such stronger expressive power (Feng et al., 2023). A series of
as large integer multiplication, hallucination works focus on prompting the LLM to output the interme-
detection and misinformation detection. diate steps that derive the final answer in an autoregressive
manner such as Chain-of-Thoughts (CoT) (Wei et al., 2022;
Wang et al., 2022; Yao et al., 2023; Zhou et al., 2022; Chen
1. Introduction et al., 2023a). Theoretically, these prompting strategies
convert the role of Transformer from the complete problem
Large language models (LLM) based on the Transformer solver to a sub-problem solver in a dynamic programming or
architecture have led to major breakthroughs in natural lan- tree searching algorithm (Merrill & Sabharwal, 2023b). In
guage processing and other related fields in artificial intel- this way, these prompting strategies expand the expressive
ligence(Brown et al., 2020; Radford et al.; Touvron et al., power of the LLMs and successfully improve the reasoning
2023). State-of-the-art general-purpose language models and searching ability of LLMs (Feng et al., 2023). How-
have demonstrated remarkable advancements in various do- ever, in most of these aforementioned prompting strategies,
mains, including question answering, graph learning, read- the processes of sub-problem decomposition, sub-problem
ing comprehension, text generation, and machine translation resolution, and sub-resolution assembly are intertwined to-
(Chen et al., 2023b; Tan et al., 2023; Hendy et al., 2023; gether during the autoregressive token decoding. Such de-
1
University of Southern California 2 Microsoft Research Asia. sign makes the sub-problem generation process lack control
Correspondence to: Yizhou Zhang <[email protected]>. from the human or agent side and susceptible to disruptions
of the task resolution process. As a result, when the task re-
Preprint. Under Review.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

Task: Verify the truthfulness of a summary.


Example of Disentangled Problem Solving
#Material#: Marseille, France (CNN)The French prosecutor leading an investigation into the
crash of Germanwings Flight 9525 insisted … not aware of any video …so far no videos Sub-task 1:
were used… He added, "A person who has such a video needs to … give it to the Q: Does the claim 1 contradict with the material? #Claim 1#: A video showing
investigators.” … Jean-Marc Menichini… that the reports were "completely wrong" and … the final moments of Germanwings Flight 9525 has been recovered by
investigators from the wreckage site.
A: Yes, the claim is contradicted by the material. The material states that Marseille
Example of Entangled Problem Solving
prosecutor Brice Robin insisted that he was not aware of any video footage from
#Summary#: A video showing the final moments of Germanwings Flight 9525 has been
onboard the plane and that no videos were used in the crash investigation.
recovered by investigators from the wreckage site. Marseille prosecutor Brice Robin urged
However, the claim states that a video showing the final moments of Germanwings
anyone who might have more footage to turn it over immediately. Andreas Lubitz, the co-
Flight 9525 has been recovered by investigators from the wreckage site.
pilot accused of deliberately crashing the plane, had a history of severe depression and
suicidal tendencies.
Sub-task 2:
Q: Does the claim 2 contradict with the material? #Claim 2#: …
Q: You are given the above material and a summary. Please answer me: Does the
A: Based on the provided document, the summary does not contain any claim. …
summary contain any claim that is contradicted with the material?
A: Based on the given material, the summary does not contain any claim that is contradicted
Sub-task 2:
with the material. The material confirms the existence of a video showing the final moments
Q: Does the claim 3 contradict with the material? #Claim 3#: …
of Germanwings Flight 9525, as well as the history of severe depression and suicidal
A: Based on the provided document, the summary does not contain any claim. …
tendencies of the co-pilot, Andreas Lubitz.

Figure 1. An illustrative example of hallucination detection with entangled problem solving (i.e., directly forward all inputs into the LLM)
and disentangled problem solving (i.e., divide the problem inputs to parallel sub-tasks and tackle them parallelly). The sentence marked
with red back font in the material is the evidence that contradict with the first claim in summary (marked with red font).

quires a large quantity of repetitive tasks (e.g. large integer Inspired by the aforementioned human experience, in this
multiplication and article-level verification), LLM is prone paper, we explore to guide LLM with Divide-and-Conquer
to intermediate errors, such as missing some inputs or program to unlock the LLM’s ability to handle tasks with
generating wrong sub-tasks, leading to problematic answers repetitive sub-tasks. This strategy breaks down the whole
(Amaro et al., 2023). This phenomenon is especially serious task resolution process into three distinct sub-processes:
when the task input contains deceptive information or con- task decomposition, sub-task resolution, and solution merge.
tents that could trigger hallucination (Chen & Shu, 2023; Li The task decomposition process will prompt the model to
et al., 2023a). An illustrative example is presented in the Fig. separate the whole task to multiple parallelly solvable sub-
1. To alleviate this issue, some program guided prompting tasks (i.e., solving one sub-tasks does not require the result
strategies such as Least-to-Most (Zhou et al., 2022) (LtM) of other sub-tasks) and list them explicitly recursively. After
and Decomposed Prompting (Khot et al., 2022) propose to that the sub-task resolution process prompts the LLM to
disentangle sub-task generation and resolution. However, output the answer for all sub-tasks. Finally, the solution-
they implement the task decomposer through multi-round merge process assembly the solutions recursively along the
conversation (or question-answering) with a LLM to se- decomposition path. The above sub-processes follow a key
quentially raise and solve the sub-problems in an alternate principle that every upstream sub-process (e.g. sub-task
manner. When tackling deceptive text like hallucination resolution) only forward the final answer to its downstream
and fake news, such intertwined process often guides the sub-processes (e.g. solution merge) and does not forward
LLM to sequentially tackle the corpus and thus follow the its input and intermediate steps. We denote this principle as
context’s flow, making LLM prone to deception. disentangled-sub-process principle. An illustrative figure
that explain the difference between our method and previous
In fact, human brains also suffer from similar hallucination
works is provided in Fig. 2
issues (Liu & Sajda, 2023), especially when the tasks are too
hard or too complex. For example, when reviewing a long To validate the expressive power of our proposed prompting
academic paper, some reviewers produce low-quality re- strategy, we provide a theoretic analysis to validate that the
views (Garcia et al., 2021; Tennant & Ross-Hellauer, 2020; proposed strategy can expand the expressive power of fixed-
Cortes & Lawrence, 2021) containing hallucination-like depth log-precision Transformer. To further empirically
intermediate errors, such as pointing out some ‘missing validate the advantage of our proposed method, we evaluate
baselines’ that have already been sufficiently discussed by our method and representative baselines on three tasks that
authors and requiring the authors to conduct ablation stud- are challenging to existing prompting strategies even on
ies for non-existent sub-modules. To avoid such ridiculous state-of-the-art LLMs: Large Integer Multiplication, Hallu-
mistakes, experienced reviewers usually think slowly (Kah- cination Detection, Article-level Fact Verification (Cheng &
neman, 2011) to follow a Divide-and-Conquer paradigm Zhang, 2023; Li et al., 2023a; Wadden et al., 2020; Hu et al.,
to handle this task. Specifically, they decompose the paper 2023; Wu et al., 2023). These tasks either require very long
review as examinations of multiple central opinions. Then reasoning paths (e.g. large integer multiplication) or contain
they retrieve supporting corpus to verify them respectively. deceptive contents (e.g. hallucination detection and fact ver-
ification), making existing methods like Chain-of-Thought
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

Intermediate Step:
Sub-task Sub-Solution Input
Subtask + Solution

Input Input Input Input


Vote

Output Output Output Output Output

(A). COT (B). Least to Most (C). COT-SC (D). Tree of thought (E). Divide and Conquer (Ours)

Figure 2. The comparison between our method and the existing methods for prompting. The ellipse marks represent sub-tasks, the
right-angled rectangles represent sub-task solutions, and the rounded rectangles represent intermediate steps that entangle sub-task and
sub-solutions. The different shades in Tree of Thoughts (subfigure D) indicate the rates of different search directions. In CoT (Chain-of-
Thoughts), CoT-SC and ToT, the Large Language Models must simultaneously generating and resolving sub-tasks. Least-to-Most (also
Decomposed Prompting) disentangle sub-task generation and resolution. However, as shown in the figure, its sub-task resolution process
and resolution assembly process are intertwined as it sequentially attach new sub-tasks onto the previous resolution. Different from all of
them, our method totally disentangle the sub-task generation, sub-task resolution, and resolution assembly process.

prompting prone to missing intermediate steps. Meanwhile, intermediate steps, respectively.


these tasks share a common nature that the correctness of
Input-Output (IO) Prompting is the standard prompting
the final answer substantially relies on the thoroughness of
strategy that attach input x with instructions and/or few-shot
all steps. Our experimental results show that the proposed
in-context-learning examples to aqcuaire a prompt, denoted
method outperforms the baselines on all three tasks.
as prompt(x) (Yao et al., 2023). The LLM takes prompt(x)
as input and predict result, i.e. y ∼ pθ (y|prompt(x)). The
2. Related Work main drawback of IO prompting is its poor expressive power.
Existing works have shown the expressive power limitation
2.1. Limitations of Transformer in Expressive Power
of IO prompting. (Merrill & Sabharwal, 2023a) proves that
As discussed in previous works (Merrill & Sabharwal, log-precision transformer based LLM equipped with prompt
2023a; Feng et al., 2023), the expressive power of fixed- of polynomial length is a subset of TC0 class.
length log-precision transformers, which are widely applied
Chain-of-Thought (CoT) Prompting (Wei et al., 2022)
in modern Pre-trained Large Language Models, is actually
aims at simulating human’s thinking process that handles
much more limited than people’s expects. Merrill & Sabhar-
complicated task (e.g. combinational reasoning and mathe-
wal give a theoretic proof that the expressive power of fixed-
matical calculation) in a step-by-step manner. More specif-
length log-precision transformers is upper-bounded with
ically, the LLM is guided to output a series of intermedi-
TC0 . Feng et al. further extend their analysis to explain that
ate steps z1 , z2 , ..., zn (also known as thoughts) autoregres-
a lot of common problems, such as calculating arithmetic
sively, i.e. zi ∼ pθ (zi |prompt(x), z1 , ..., zi−1 ). Then the
expressions, exceeds the expressive power of fixed-length
LLM output the prediction of result y based on the thoughts,
log-precision transformers. Such results explains why the
i.e. y ∼ pθ (zi |prompt(x), z1 , ..., zn ).
powerful Large Larnguage Models may make some ridicu-
lous mistakes and why CoT can improve the performance. Exploration-of-Thought (EoT) Prompting includes a
series of CoT’s variants, such as Self-consistency with
2.2. Prompting Strategies for Large Language Models CoT (CoT-SC) prompting (Wang et al., 2022) and Tree-of-
Thoughts (ToT) prompting (Yao et al., 2023), which aim at
In this sub-section, we introduce the existing prompting addressing the limitation of CoT in exploration. Their com-
and discuss their limitations and drawbacks. Following mon central idea is to generate multiple chains of thought
the notations in (Yao et al., 2023), we denote the Large through sampling or proposing prompting and then ensem-
Language Models with parameter θ as pθ and use lower ble them to acquire a final prediction.
case letters x, y, z to denote input sequence, result, and
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

Program-guided Prompting aims at controlling the LLM’s subtasks and acquire the final answer. In this process, all
generation process with symbolic programs or pre-defined three stages are isolated to avoid interruption. They are all
procedure (Zhu et al., 2022; Jung et al., 2022; Zhou et al., guided by a program rather than a LLM to avoid hallucina-
2022; Khot et al., 2022; Creswell & Shanahan, 2022; Gao tion or the deception from input context. To tackle tasks of
et al., 2023). Among them, the Least-to-Most (LtM) Prompt- different sizes, we propose two variants: Single-Level DaC
ing (Zhou et al., 2022) and Decomposed Prompting (Khot Solver and Multi-Level DaC Solver.
et al., 2022) are close to this work. They are the earliest
attempts that explicitly prompt the LLM to decompose the Algorithm 1 Single-Level Divide-and-Conquer Solver
task as a series of sub-tasks and sequentially tackle them. T (S, a, t, L, f )
LtM prompt a LLM to iteratively raise sub-tasks and se- Require: Input Sequence S, Prompt m (for solution
quentially solve them to acquire the final resolution. De- merge), Prompt t (for sub-task tackling), Prompt d (for
composed Prompting can regarded as a upgraded version of task decomposition), LLM L
LtM. It introduces special notations into the prompt to rep- Ensure: Results of the task on input sequence S
resent program states and thus can call itself (i.e., recursion) 1: {S1 , S2 , ..., Sk } ← L(d, S)
or other modules (i.e., hierarchical decomposition), endow- 2: Result ← ∅
ing it stronger expressive power. Such design increased the 3: for i = 1, 2, ..., k do
compositional generalization ability of LLMs in different 4: Result ← Result +[SEP ] + L(t, Si )
areas, such as symbolic manipulation and multi-hop QA 5: end for
(Khot et al., 2022). 6: Return L(m, Result)
The aforementioned CoT and EoT families incorporate
LLM with stronger expressive power than IO prompting. Single-level Divide-and-Conquer Solver decomposes the
However, a critical issue of them is that, they could miss task in one call to the LLM, which expands the original
or ignore some important intermediate steps or contents task as a tree of one level. The algorithm is presented in
(Liu et al., 2023). This problem is even worse when we the Alg. 1. The advantage of this variant is its simplicity
are handling tasks involved with long input (e.g. long docu- and efficiency. However, when the original input is too
ments and large numbers). Typical examples include large long, single-level Divide-and-Conquer Solver may acquire
number arithmetic calculation and fact verification in long sub-tasks with large problem sizes that will still trigger
documents. Compared to them, Least-to-Most prompting intermediate errors. In such a case, following (Khot et al.,
and Decomposed Prompting introduce explicit task decom- 2022), we can recursively expand the task as a multi-level
position to enumerate sub-tasks. However, their task decom- tree. More specifically, we repeat the aforementioned steps
posers are based on multi-round conversation or question- to further divide the sub-tasks hierarchically until they are
answering, which often guides the LLM to sequentially easy enough to be handled by the LLM. This can be done
tackle the sub-tasks. When tackling contents containing de- through a recursion program as presented the Alg. 2.
ceptive contents, e.g. hallucination and fake news detection,
such design will navigate the LLM through the deceptive
Algorithm 2 Multi-Level Divide-and-Conquer Solver Re-
content’s flow and thus make LLM prone to deception.
cursion T (S, m, t, d, f, n, L)
Require: Input Sequence S, Problem Size Metric Function
3. Proposed Method f (·) (a function that measure the problem size), hyper-
To avoid the task decomposition and task resolution from in- parameter w, Prompt m (for merge), Prompt t (for sub-
terweaving and interrupting each other, we propose to guide task tackling), Prompt d (for task decomposition), Large
LLM with Divide-and-Conquer (DaC) program that consists Language Model L
of three distinct stages: task decomposition stage, sub-task Ensure: Results of the task on input sequence S
resolution stage, solution merge stage. In task decomposi- 1: S1 , S2 , ..., Sk ← L(d, S)
tion stage, the LLM is prompted to explicitly decompose 2: Result ← ∅
the task as a series of parallel homogeneous sub-tasks with 3: for i = 1, 2, ..., k do
smaller problem sizes (e.g. divide a long paragraph to sen- 4: if f (Si ) > w then
tences). Such design avoids the multi-round conversation 5: Result ← Result +[SEP ]+T (Si , m, t, d, f, w, L)
or question-answering in LtM and Decomposed Prompting, 6: else
making the model less prone to deception. After that, in sub- 7: Result ← Result +[SEP ] + L(t, Si )
task resolution stage, the LLM is prompted to provide the 8: end if
solutions for every sub-task. Finally, in the solution merge 9: end for
stage, the LLM is prompted to assembly the solutions of 10: Return L(m, Result)
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

4. Theoretic Analysis on the Proposed Method to a 2-color binary tree isomorphism problem (2-BTI, com-
paring whether two trees are exactly isomorphic or not)
In this section, we provide a theoretic analysis to indi- under AC 0 reduction. Therefore, 2-color binary tree iso-
cate that the proposed Divide-and-Conquer program-guided morphism is N C 1 hard. Because any solver of 2-BSI can
problem solving strategy help expand the expressive ca- solve 2-color binary tree isomorphism by simply adding a
pacity of Transformers. Specifically, we exploit 2-color layer to check whether the two trees have same size, 2-BSI
Binary Subtree Isomorphism (2-BSI) problem for the proof. is harder than 2-BSI. Since T C 0 ⊆ N C 1 and we assume
2-BSI problem is a NC1 -complete problem. Under a widely- that T C 0 ̸= N C 1 , then for any fixed-length polynomial-
accepted assumption in parallel computing, we theoreti- parameter log-precision transformer, there exist a size of the
cally prove two facts: (1) Log-Precision Transformers with 2-BSI problem such that the transformer can not handle.
standard IO-prompting strategy can not solve 2-BSI prob-
lem, and (2) there exists a Log-Precision Transformers with
4.2. Fixed-Depth Transformer guided by
Divide-and-Conquer program-guided problem solving strat-
Divide-and-Conquer program solves 2-BSI
egy that can solve 2-BSI problem. In this way, we prove that
Divide-and-Conquer strategy expand the expressive power Theorem 4.2. There exists a log-precision transformer with
of Large Language Models. fixed depth L and hidden dimension d that can solve the
2-BSI of any size with fixed-length prompt m (for merge), t
4.1. Fixed-Depth Transformer can not solve 2-color (for sub-task tackling) and d (for task decomposition).
Binary Subtree Isomorphism (2-BSI)
Proof Sketch: The detailed proof to this theorem is pro-
In this section, we will present a problem that exceeds the vided in the Appendix A.1. Here we give a brief flow of
expressive power of fixed-length log-precision transformers: the proof. To prove this theorem, we first show that there
2-color Binary Subtree Isomorphism (also known as tree exists an problem size function f (·), hyper-parameter w,
matching), an important algorithmic problem which many merge function m(·), sub-task tackling function t(·) and
tasks such as semantic similarity analysis, text matching decomposition function d(·) that can solve the problem with
and structured text database querying can be reduced to divide-and-conquer strategy. Then we will prove that there
(Kilpeläinen & Mannila, 1992; Marsi & Krahmer, 2010; He exists one log-precision transformer with fixed depth L and
et al., 2018). More specifically, we first gives a theoreti- hidden dimension d that can express a(·), t(·) and d(·) re-
cal proof that 2-color Binary Subtree Isomorphism is not spectively with different but fixed-length prompts. In this
solvable for fixed-length log-precision transformers. way, we can prove the theorem.
Definition 1. 2-color Binary Subtree Isomorphism prob-
lem is that, given a pattern 2-color1 binary tree tp and a
5. Experiments
base 2-color binary tree tb , a solver is required to judge
whether the pattern tree is isomorphic to a sub-tree of tb We evaluate the capacity of our prompting strategies on
three kinds of tasks that are troubled by missing intermedi-
In (Jenner et al., 2003), the authors pointed out that the ate steps due to long inputs, even for state-of-the-art Lan-
encoding of the problem will influence the hardness of the guage Models (e.g. GPT-3.5 and GPT-4): Multiplication
problem. In this paper, we focus on pointer list encoding and Addition of Large Numbers, Hallucination Detection
of 2-BSI. Detailed information about the pointer list encod- in Long Context, and Fact-Checking for Misinformation
ing of 2-BSI can be found in Appendix. For pointer list Detection (Yang et al., 2023; Li et al., 2023a; Wadden et al.,
encoding of 2-BSI, we have the following theorem: 2020). With these experiments, we present how Divide-
Theorem 4.1. Assume that T C 0 ̸= N C 1 . For any depth L, and-Conquer (DaC) prompting handles this task better and
any polynomial Q, there exists a size n of pattern tree such suggest DaC could be a promising new paradigm in han-
that there exist no log-precision Transformer with a depth dling long documents or large inputs.
of L, hidden dimension d < Q(n), and fixed prompt p that
can directly output the solution (Yes or No) of the 2-color 5.1. Multiplication of Long Integers
Binary Subtree Isomorphism problem (2-BSI).
General-purpose Large Language Models, e.g. ChatGPT-
3.5 and ChatGPT-4, have been bothered by its poor per-
Proof. The core point of the proof is that, based on (Merrill
formance on arithmetic of big numbers, even though their
& Sabharwal, 2023a), we know that the expressive power of
expressive capacity has been proved to be enough for ac-
log-precision Transformer is in T C 0 . As proved by (Jenner
curately calculate large number multiplication (Yang et al.,
et al., 2003), any L problem (N C 1 ⊆ L) can be converted
2023). In this experiment, we show that under zero-shot
1
2-color means that each node in the tree can be of two colors. long integer multiplication, DaC can outperform other base-
We can understand it as that each node has a 0-1 label line prompting strategies.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

50 multiplication of the four pairs of n2 -digit numbers and solu-


GPT-3.5-Turbo
Accuracy (%)
40 GPT-4 tion merge prompt merge them with bit shift and addition.
30 Results: Experimental results are shown in Fig. 3 and 4.
20
As we can see, under all settings, our proposed prompt-
ing strategy outperform all the baselines. This indicate the
10 advantage of our proposed method. Specifically, the thought-
0 based methods, i.e. IO-Prompting, Chain-of-Thoughts and
IO CoT CoT-SC LtM DeP DaCw/oDP DaC
CoT-SC, all get accuracy close to 0 for both GPT-4 and GPT-
3.5. The Least-to-Most prompting acquires better results
Figure 3. Accuracy of our proposed method and different baseline
on accuracy for GPT-3.5 and GPT-4 due to its advantage
prompting strategies on GPT-3.5 and GPT-4.
in disentangling task decomposition and task resolution.
However, compared to our methods that can achieve accu-
Task Setup: For this task, we randomly generated 200 racy better than 40% for both GPT-3.5 and GPT-4, these
pairs of 5-digit integers. We choose 5 for the digit length baselines significantly perform bad. For edit distance met-
because according to previous works, ChatGPT-3.5 gets ric, the results are similar. For both GPT-3.5 and GPT-4,
0% accuracy on 4-digit multiplications (Cheng & Zhang, our method’s distances are lower than half of the baselines.
2023), and ChatGPT-4 gets close to 0% accuracy on 5- Although for GPT-4, the Least-to-Most and DeP get edit dis-
digit multiplications (Yang et al., 2023). We evaluate the tances that are significantly better than other baselines, it is
performance of calculation with two metrics: Accuracy and still not comparable to our method. We claim that the advan-
Edit Distance (Marzal & Vidal, 1993; Schaeffer et al., 2023). tage is brought by our method’s property in disentangling
For accuracy, we count the result as correct one only every the task decomposition, task resolution and resolution as-
digit of it matches with the ground-truth answer. For edit sembly. The thought-based methods like CoT and CoT-SC
distance, we calculate the minimum number of operations intertwined the task decomposition and task resolution. As
required to transform the output to the ground-truth. a result, its sub-task generation process is prone to be inter-
rupted and gets wrong sub-task. Least-of-Most disentangle
6 the the task decomposition and task resolution, and thus ac-
GPT-3.5-Turbo quires much better results. However, its task resolution and
5 GPT-4
resolution assembly processes are still intertwined and make
4
Edit Distance

the model prone to mistakes when assemblying the sub-task


3 results. Also, DaC outperforms DaC w/o DP, indicating the
2 importance of disentangled-sub-process principle.
1
0 5.2. Hallucination Detection in Long Context
IO CoT CoT-SC LtM DeP DaCw/oDP DaC
Although Large Language Models have achieved impressive
Figure 4. Edit distances of DaC and other strategies on GPT-3.5 performance on various NLP tasks, they are bothered by hal-
and GPT-4. Lower edit distance indicate better performance. lucination problem (Manakul et al., 2023), especially when
the generated content or the input context is too long for the
Setup of baselines and our method: In this task, our user to have a thoroughly review (Zhang et al., 2023). In this
baselines include IO prompting, Chain of Thought (CoT), paper, we focus on evaluating the performance of different
CoT-SC, Least-to-Most (LtM), and Decomposed Prompting strategies in guiding LLM to recognize inconsistency be-
(DeP). We also include our method’s variant: DaC with- tween given context and model response with hallucination.
out Disentangled-sub-process Principle (DaC w/o DP) for Task Setup: We use the HaluEval-Summary dataset. It is
ablation study. Tree-of-Thoughts is not applicable. This one of the three datasets in HaluEval benchmark for halluci-
is because that multiplication is deterministic calculation nation detection, which contains the hallucination generated
without requiring search in a tree. For our method, we apply by ChatGPT-3.5. HaluEval-Summary have the longest con-
Multi-Level Divide-and-Conquer program-guided solver. text and generated contents among all three tasks in this
We set f (a, b) as the minimum length of the two integers benchmark (Li et al., 2023a). Thus, detecting hallucination
a, b. We set w = 2 so that the recursion returns when both on this dataset requires repeatedly verify each sentence in
integers are no longer than 2. If both integers are shorter, the response, making standard prompting strategies acquire
then the algorithm return. The prompts for decomposition the worst accuracy across all three tasks. We report the
asks the model to separate the two n-digit numbers to four Accuracy, F1 score2 , Precision score and Recall score.
n
2 -digit numbers (e.g., separate a, b as a1 , a2 , b1 , b2 ). The 2
sub-task tackling prompt asks the model to calculate the We consider the hallucination pairs as positive samples.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

GPT-3.5-Turbo GPT-4
Strategies
F1 Acc Prec Recall F1 Acc Prec Recall
IO-prompting 61.69 61.27 62.11 61.28 64.07 72.66 93.41 48.76
Chain-of-Thoughts 46.85 64.26 91.36 31.50 71.05 76.10 90.08 58.66
CoT-SC 47.70 64.25 88.83 32.60 71.39 76.36 90.41 58.98
Tree-of-Thoughts 70.40 59.91 55.83 95.34 69.41 71.73 75.53 64.28
Least-to-Most 56.43 64.91 74.42 45.44 72.51 77.11 90.74 60.38
Divide-and-Conquer w/o DP 67.66 67.51 67.32 68.03 73.95 76.63 83.54 66.33
Divide-and-Conquer 74.84 75.55 77.41 72.03 76.92 78.99 85.36 70.01

Table 1. Performance of different prompting methods on HaluEval dataset. We report the F1 score, Accuracy, Precision and Recall.

Setup of baselines, ablation variants and our method: addition, as we can see, although DaC w/o DP outperforms
In this task, our baselines include IO prompting, Chain of other baselines in general, it still perform worse than DaC.
Thought, CoT-SC, Tree-of-Thoughts Least-to-Most, and De- This phenomenon indicates the importance of disentangled-
composed Prompting. We also include our method’s variant: sub-process principles. Meanwhile, it is noticeable that
DaC without Disentangled-sub-process Principle (DaC w/o the DaC w/o DP can be regarded as the upgrade version
DP) for ablation study. In this task, the sub-tasks are verify- of LtM/DeP that replace sequential sub-tasks with paral-
ing fragments of the summary, which are homogeneous and lel sub-tasks. Thus, the experimental results suggest that
do not require recurssion. In such a setting, Decomposed parallel sub-task strategy makes LLM more discerning for
Prompting is equivalent to LtM. For this task, we apply hallucination detection.
single level Divide-and-Conquer solver to decompose the
summary to multiple sentences, handle them separately and 5.3. Fact-Verification for Misinformation Detection
then merge the conclusions of all sentences. The detailed
prompts are provided in the Appendix. The increasing abuse of misinformation toward manipulat-
ing public opinions on social media has been observed in
Results: Experimental results are shown in Tab. 1. As we different areas, such as healthcare (e.g. the recent COVID-
can see for all other baselines, the performance on GPT- 19 pandemic) (Sharma et al., 2020; 2022). This threat is
3.5 is substantially worse than GPT-4, as the hallucinations increasingly serious due to LLM’s capacity in content gen-
are generated by GPT-3.5 and thus more deceptive to GPT- eration (Li et al., 2023b; Weidinger et al., 2021; Zhang
3.5. However, for our method, when we replace GPT-4 et al., 2022). This challenge raise the importance of fact-
with GPT-3.5, although the performance still decrease, the verification, which aims at judging the authenticity of an
drop is significantly lower than all other baselines, indi- article based on a collection of evidence from verified source
cating the robustness of our proposed method toward de- (Whitehouse et al., 2022; Zhang & Gao, 2023). In this exper-
ceptive contents. Also, for both GPT-3.5 and GPT-4, our iment, we present that DaC can outperform other baselines
proposed prompting strategy outperform the baselines, pre- in fact-verification involved with news article.
senting the advantage of our method. More specifically,
compared to IO-prompting, our model achieved better per- Task Setup: In this experiment, we mainly adopt SciFact
formance in general, indicating the advantage brought by dataset (Wadden et al., 2020). In SciFact dataset, each
stronger expressive power. Meanwhile, compared to CoT sample is a pair of news and evidence, where the evidence
and CoT-SC results, our method clearly achieved much is the abstract of a peer-reviewed paper and the news is a
better recall. Tree-of-Thoughts, benefited by its searching claim summarized by human annotators from fake news
ability, acquired significantly better recall score compared or true news. To better simulate the real-world scenario
to other baselines. However, its significantly lower preci- where news usually appears as an article, following Chen &
sion substantially harm its overall performance and leads Shu, we generate a dataset of article-level misinformation
to accuracy that is even worse than standard IO-prompting. based on SciFact dataset. Specifically, for a given claim, we
Least-to-Most, which explicitly decomposing the task and apply Large Language Model (i.e., ChatGPT-4) to extend
enumerating all sub-tasks, achieves better recall compared the claim as an article based on the evidence. For this task,
to CoT and CoT-SC. However, it follows the flow of the hal- similar as hallucination detection, we apply single level
lucination, leading to proneness worse recall compared to Divide-and-Conquer solver to decompose the news article to
Tree-of-Thoughts and our method. In contrary, our method multiple sentences, handle them separately and then merge
carefully checked all sentences, locate the one containing the conclusions of all sentences. Also, the baselines in this
factual error and merge the answers. Thus, our method bal- experiments are the same as Hallucination Detection, for
ance the overall performance (F1 and Accuracy score). In the same reason, Decomposed Prompting is equivalent to
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

GPT-3.5-Turbo GPT-4
Strategies
F1 G-Mean Prec Recall F1 G-Mean Prec Recall
Io-Prompting 72.12 72.77 83.22 63.64 69.15 71.77 94.44 54.55
Chain-of-Thoughts 56.09 60.64 90.48 40.64 74.03 75.79 94.21 60.96
CoT-SC 56.83 61.44 91.67 41.18 70.09 73.45 100.0 53.95
Tree-of-Thoughts 69.91 73.30 53.74 100.0 77.34 78.00 88.89 68.45
Least-to-Most 54.08 54.15 51.46 56.99 73.56 74.25 85.21 64.71
Divide-and-Conquer w/o DP 61.32 63.66 83.81 48.35 74.15 75.68 92.68 61.79
Divide-and-Conquer 76.88 77.13 83.65 71.12 81.11 81.24 76.67 86.10

Table 2. Performance of different prompting methods on SciFact dataset. We report the F1 score, G-Mean score, Precision and Recall.

LtM. The evaluation metrics includes F1 score, G-Mean itly separate the task decomposition stage and task resolu-
score (geometric mean of precision and recall), Precision tion stage. However, they are mainly designed for multi-step
score and Recall score. We do not apply accuracy as the reasoning for complex tasks. Thus, they sequentially tackle
positive and negative classes are not balanced. the sub-tasks and assembly the resolutions. As a result, they
tend to follow the flow of the deceptive contents, leading
Results: Experimental results are shown in Tab. 2. Notably,
to proneness to deceptive content. More details about this
GPT-3.5 incorporated with our proposed prompting strategy
comparison can be found in Appendix A.4.
even outperform the performance of GPT-4 incorporated
with IO-prompting, Least-to-Most, CoT and CoT-SC, which Although our proposed method DaC surpasses the baselines
have significantly lower recall scores, indicating their prone- on the proposed tasks, it still has some limitations. The
ness to deception. Only Tree-of-Thoughts, which is bene- first issue is that the appliance scope of DaC is still limited.
fited by its advantage in exploring various options, acquired More specifically, CoT, EoT, LtM and DaC are based on
the best results among all baselines, but is still defeated by different algorithmic paradigms, learning to different Ap-
our proposed method. Moreover, as we can see, for GPT-4 pliance Scopes. As pointed out by Feng et al., CoT and
the performance of CoT-SC is even worse than CoT, which LtM can be considered as a neural dynamic programming
is supposed to be a specific case of CoT-SC without explo- algorithm. Thus, CoT is more suitable for tasks that can
ration. These results suggests that, when facing deceptive be bridged to dynamic programming, such as multi-step
contents generated on purpose, existing incremental works’ question answering. Differently, EoT is based on explo-
improvement may not be so robust. ration and search, which is more suitable for planning and
search, such as Game of 24 (Yao et al., 2023). Our proposed
6. Discussions and Limitations method is based on Divide-and-Conquer algorithm. Thus,
it is more suitable for tasks that can be decomposed to a
In summary, the proposed method has following advantages: series sub-tasks that are disjoint or only slightly overlapped.
Our future work will focus on further expand the appliance
Comparison with IO-Prompting: Superiority in Expres-
scope of DaC to more areas like question answering.
sive Power As we proved in Sec. 4, Compared to IO-
prompting, our proposed method has stronger expressive
power and thus can solve harder problems. 7. Conclusions
Comparison with CoT and EoT: Disentangling the In this paper, we proposed a novel Divide-and-Conquer
task decomposition and task resolution Compared to the program-guided problem solving strategy. To guide large
prompting family of CoT and EoT, our proposed method language models to tackle tasks requiring large quantity of
explicitly separate the task decomposition stage and task repetitive sub-tasks and/or deceptive contents, DaC disentan-
resolution stage. Therefore, we can acquire explicit decom- gle the processes of task decomposition, sub-task resolution
posed sub-task rather than intermediate thoughts proposed and resolution assembly. In this way, we avoid the inter-
during decoding. Consequently, we can explicitly enumer- twined processes from interrupting each other and alleviate
ate all sub-tasks output by the decomposition module and the intermediate errors in generating the resolution path.
avoid the model from missing important sub-tasks. The experimental results show that such improvement leads
to better performance on a wide range of tasks such as large
Comparison with LtM and Decomposed Prompting: Par-
integer multiplication, hallucination detection and misin-
allel Sub-task Handler and Sequential Sub-task Han-
formation detection. Moreover, theoretic analysis reveals
dler Similar as our proposed method, some program-guided
that our proposed method help guide LLM to expand the
prompting like LtM and Decomposed Prompting also explic-
expressive power to surpass original Transformers.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

References Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang,
Y., Callan, J., and Neubig, G. Pal: Program-aided lan-
Amaro, I., Della Greca, A., Francese, R., Tortora, G., and
guage models. In International Conference on Machine
Tucci, C. Ai unreliable answers: A case study on chat-
Learning, pp. 10764–10799. PMLR, 2023.
gpt. In International Conference on Human-Computer
Interaction, pp. 23–40. Springer, 2023. Garcia, J. A., Rodriguez-Sánchez, R., and Fdez-Valdivia, J.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Quality censoring in peer review. Scientometrics, 126:
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., 825–830, 2021.
Askell, A., et al. Language models are few-shot learners.
Advances in neural information processing systems, 33: He, Y., Tao, S., Xu, J., Guo, J., Lan, Y., and Cheng, X. Text
1877–1901, 2020. matching with monte carlo tree search. In Information
Retrieval: 24th China Conference, CCIR 2018, Guilin,
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., China, September 27–29, 2018, Proceedings 24, pp. 41–
Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., 52. Springer, 2018.
Lundberg, S., et al. Sparks of artificial general intel-
ligence: Early experiments with gpt-4. arXiv preprint Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr,
arXiv:2303.12712, 2023. M., Matsushita, H., Kim, Y. J., Afify, M., and Awadalla,
H. H. How good are gpt models at machine trans-
Chen, C. and Shu, K. Can llm-generated misinformation be lation? a comprehensive evaluation. arXiv preprint
detected? arXiv preprint arXiv:2309.13788, 2023. arXiv:2302.09210, 2023.
Chen, Y., Fu, Q., Yuan, Y., Wen, Z., Fan, G., Liu, D., Zhang,
Hu, B., Sheng, Q., Cao, J., Shi, Y., Li, Y., Wang, D., and Qi,
D., Li, Z., and Xiao, Y. Hallucination detection: Robustly
P. Bad actor, good advisor: Exploring the role of large
discerning reliable answers in large language models. In
language models in fake news detection. arXiv preprint
Proceedings of the 32nd ACM International Conference
arXiv:2309.12247, 2023.
on Information and Knowledge Management, pp. 245–
255, 2023a. Jenner, B., Köbler, J., McKenzie, P., and Torán, J. Complete-
Chen, Z., Mao, H., Li, H., Jin, W., Wen, H., Wei, X., Wang, ness results for graph isomorphism. Journal of Computer
S., Yin, D., Fan, W., Liu, H., et al. Exploring the potential and System Sciences, 66(3):549–566, 2003.
of large language models (llms) in learning on graphs.
arXiv preprint arXiv:2307.03393, 2023b. Jung, J., Qin, L., Welleck, S., Brahman, F., Bhagavatula, C.,
Bras, R. L., and Choi, Y. Maieutic prompting: Logically
Cheng, V. and Zhang, Y. Analyzing ChatGPT’s math- consistent reasoning with recursive explanations. arXiv
ematical deficiencies: Insights and contributions. In preprint arXiv:2205.11822, 2022.
Wu, J.-L. and Su, M.-H. (eds.), Proceedings of the 35th
Conference on Computational Linguistics and Speech Kahneman, D. Thinking, fast and slow. macmillan, 2011.
Processing (ROCLING 2023), pp. 188–193, Taipei City,
Taiwan, October 2023. The Association for Compu- Khot, T., Trivedi, H., Finlayson, M., Fu, Y., Richardson, K.,
tational Linguistics and Chinese Language Processing Clark, P., and Sabharwal, A. Decomposed prompting:
(ACLCLP). URL https://aclanthology.org/ A modular approach for solving complex tasks. arXiv
2023.rocling-1.22. preprint arXiv:2210.02406, 2022.

Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Kilpeläinen, P. and Mannila, H. Grammatical tree match-
Introduction to algorithms. 2022. ing. In Annual Symposium on Combinatorial Pattern
Matching, pp. 162–174. Springer, 1992.
Cortes, C. and Lawrence, N. D. Inconsistency in confer-
ence peer review: revisiting the 2014 neurips experiment. Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y., and Wen, J.-R.
arXiv preprint arXiv:2109.09774, 2021. Halueval: A large-scale hallucination evaluation bench-
Creswell, A. and Shanahan, M. Faithful reasoning using mark for large language models. In Proceedings of the
large language models. arXiv preprint arXiv:2208.14271, 2023 Conference on Empirical Methods in Natural Lan-
2022. guage Processing, pp. 6449–6464, 2023a.

Feng, G., Gu, Y., Zhang, B., Ye, H., He, D., and Wang, L. Li, S., Yang, J., and Zhao, K. Are you in a masquer-
Towards revealing the mystery behind chain of thought: a ade? exploring the behavior and impact of large language
theoretical perspective. arXiv preprint arXiv:2305.15408, model driven social bots in online social networks. arXiv
2023. preprint arXiv:2307.10337, 2023b.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

Liu, X. and Sajda, P. Roe: A computational-efficient anti- Tan, Y., Min, D., Li, Y., Li, W., Hu, N., Chen, Y., and
hallucination fine-tuning technology for large language Qi, G. Evaluation of chatgpt as a question answering
model inspired by human learning process. In Interna- system for answering complex questions. arXiv preprint
tional Conference on Brain Informatics, pp. 456–463. arXiv:2303.07992, 2023.
Springer, 2023.
Tennant, J. P. and Ross-Hellauer, T. The limitations to our
Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Cheng, R. G. H., understanding of peer review. Research integrity and peer
Klochkov, Y., Taufiq, M. F., and Li, H. Trustworthy review, 5(1):6, 2020.
llms: a survey and guideline for evaluating large language
models’ alignment. arXiv preprint arXiv:2308.05374, Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
2023. M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
Manakul, P., Liusie, A., and Gales, M. J. Selfcheckgpt: Zero-
guage models. arXiv preprint arXiv:2302.13971, 2023.
resource black-box hallucination detection for generative
large language models. arXiv preprint arXiv:2303.08896, Wadden, D., Lin, S., Lo, K., Wang, L. L., van Zuylen,
2023. M., Cohan, A., and Hajishirzi, H. Fact or fiction: Ver-
Mao, R., Chen, G., Zhang, X., Guerin, F., and Cambria, E. ifying scientific claims. In Webber, B., Cohn, T., He,
Gpteval: A survey on assessments of chatgpt and gpt-4. Y., and Liu, Y. (eds.), Proceedings of the 2020 Confer-
arXiv preprint arXiv:2308.12488, 2023. ence on Empirical Methods in Natural Language Pro-
cessing (EMNLP), pp. 7534–7550, Online, November
Marsi, E. and Krahmer, E. Automatic analysis of semantic 2020. Association for Computational Linguistics. doi:
similarity in comparable text through syntactic tree match- 10.18653/v1/2020.emnlp-main.609. URL https://
ing. In Proceedings of the 23rd international conference aclanthology.org/2020.emnlp-main.609.
on computational linguistics (COLING), pp. 752–760.
Chinese Information Processing Society of China (CIPS), Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang,
2010. S., Chowdhery, A., and Zhou, D. Self-consistency im-
proves chain of thought reasoning in language models.
Marzal, A. and Vidal, E. Computation of normalized edit arXiv preprint arXiv:2203.11171, 2022.
distance and applications. IEEE transactions on pattern
analysis and machine intelligence, 15(9):926–932, 1993. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
Merrill, W. and Sabharwal, A. The parallelism tradeoff: Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
Limitations of log-precision transformers. Transactions prompting elicits reasoning in large language models.
of the Association for Computational Linguistics, 11:531– Advances in Neural Information Processing Systems, 35:
545, 2023a. doi: 10.1162/tacl a 00562. URL https: 24824–24837, 2022.
//aclanthology.org/2023.tacl-1.31.
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato,
Merrill, W. and Sabharwal, A. The expresssive power J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B.,
of transformers with chain of thought. arXiv preprint Kasirzadeh, A., et al. Ethical and social risks of harm
arXiv:2310.07923, 2023b. from language models. arXiv preprint arXiv:2112.04359,
2021.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
Sutskever, I., et al. Language models are unsupervised Whitehouse, C., Weyde, T., Madhyastha, P., and Komninos,
multitask learners. N. Evaluation of fake news detection with knowledge-
enhanced language models. In Proceedings of the In-
Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent
ternational AAAI Conference on Web and Social Media,
abilities of large language models a mirage? arXiv
volume 16, pp. 1425–1429, 2022.
preprint arXiv:2304.15004, 2023.
Sharma, K., Seo, S., Meng, C., Rambhatla, S., and Wu, Y., Zhu, J., Xu, S., Shum, K., Niu, C., Zhong, R.,
Liu, Y. Coronavirus on social media: Analyzing mis- Song, J., and Zhang, T. Ragtruth: A hallucination corpus
information in twitter conversations. arXiv preprint for developing trustworthy retrieval-augmented language
arXiv:2003.12309, 2020. models. arXiv preprint arXiv:2401.00396, 2023.

Sharma, K., Zhang, Y., and Liu, Y. Covid-19 vaccine mis- Yang, Z., Ding, M., Lv, Q., Jiang, Z., He, Z., Guo, Y., Bai,
information campaigns and social media narratives. In J., and Tang, J. Gpt can solve mathematical problems
Proceedings of the International AAAI Conference on without a calculator. arXiv preprint arXiv:2309.03241,
Web and Social Media, volume 16, pp. 920–931, 2022. 2023.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y.,
and Narasimhan, K. Tree of thoughts: Deliberate prob-
lem solving with large language models. arXiv preprint
arXiv:2305.10601, 2023.
Zhang, X. and Gao, W. Towards llm-based fact verification
on news claims with a hierarchical step-by-step prompt-
ing method. arXiv preprint arXiv:2310.00305, 2023.

Zhang, Y., Cao, D., and Liu, Y. Counterfactual neural


temporal point process for estimating causal influence
of misinformation on social media. Advances in Neural
Information Processing Systems, 35:10643–10655, 2022.

Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang,
X., Zhao, E., Zhang, Y., Chen, Y., et al. Siren’s song in
the ai ocean: A survey on hallucination in large language
models. arXiv preprint arXiv:2309.01219, 2023.
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang,
X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al.
Least-to-most prompting enables complex reasoning in
large language models. arXiv preprint arXiv:2205.10625,
2022.
Zhu, X., Wang, J., Zhang, L., Zhang, Y., Gan, R., Zhang, J.,
and Yang, Y. Solving math word problem via coopera-
tive reasoning induced language models. arXiv preprint
arXiv:2210.16257, 2022.
Zong, M. and Krishnamachari, B. Solving math word prob-
lems concerning systems of equations with gpt-3. In
Proceedings of the AAAI Conference on Artificial Intelli-
gence, volume 37, pp. 15972–15979, 2023.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

A. Appendix
A.1. Proof to Theorem 4.2
Before providing the proof, we first formally define how to organize the inputs (i.e., two 2-color trees) as a sequence. We
assume that we acquire two trees tp of size n and tb of size m. They are organized as two sequences of nodes with a random
order. Each node has three variables: color, left child index, and right child index. If any child is null, then the index is filled

with 0. Then we can organize them as as two sequences Xp ∈ Rn×3 and Xb ∈ Rn ×3 , where each item in the sequence is a
vector of 3 dimensions. The first dimension is the index of the left child, the second dimension is the index of the right
child, the third dimension is the color indicator (0 or 1). In addition, we have a root vector r with three dimensions. The first
dimension is the index of the root node of tp (i.e., pointing to the root node of tp ) and the second is the index of the root
node of tb (i.e., pointing to the root node of tb ). The third dimension of r is filled with 0 to make it have same dimension as
the items in Xp and Xb . This expression of trees is also called as pointer list encoding according to (Jenner et al., 2003).
Note that in the following proof, we assume that all indices start from 1. Thus 0 is regarded as a NULL pointer.
Following the proof flow we provided in Sec. 4.2, we first provide the following divide-and-conquer algorithm that can
solve the above problem:

Algorithm 3 Recursion Divide-and-Conquer Algorithm for 2-BSI BSI(r, Xp , Xb , m, t, d, f, w)


Require: Inputs r, Xp , Xb , problem size metric function f (·), hyper-parameter w, merge function m, sub-task tackling
function t, task decomposition function d
Ensure: A 0-1 indicator vector v: if there exists a subtree with node i as root that is isomorphic with pattern tree tp defined
with inputs r, Xp , Xb , then the v[i] is 1. Otherwise, v[i] is 0.
1: rl , rr ← d(r, Xp , Xb )
2: for i ∈ {l, r} do
3: if f (ri , Xp , Xb ) > w then
4: vi ← BSI(ri , Xp , Xb , m, t, d, f, w)
5: else
6: vi ← t(ri , Xp , Xb )
7: end if
8: end for
9: Return m(r, Xp , Xb , vl , vr )

Algorithm 4 Implementation of d(r, Xp , Xb ) when the depth of the tree indicated by r is not longer than 2

Require: Inputs r ∈ R3 , Xp ∈ Rn×3 , Xb ∈ Rn ×3
Ensure: A 0-1 indicator vector v: if there exists a subtree with node i as root that is isomorphic with pattern tree tp defined
with inputs r, Xp , Xb , then the v[i] is 1. Otherwise, v[i] is 0.
1: rl ←< Xp [r[1], 2], r[2], r[3] >
2: rr ←< Xp [r[1], 3], r[2], r[3] >
3: Return rl , rr

The algorithm described above is a typical divide-and-conquer algorithm for solving rooted tree isomorphism. Its justification
can be found in many textbooks introducing algorithms, such as Introduction to Algorithms (Cormen et al., 2022). Here we
provide the detailed definition and implementation of problem size metric f (·), hyper-parameter w, merge function m(),
sub-task tackling function t(·), task decomposition function d(·):

• w = 1, and f (r, Xp , Xb ) is defined as the depth of the pattern tree tp indicated with root vector r. Although precisely
calculating f (r, Xp , Xb ) is of O(n), judging whether f (r, Xp , Xb ) > 1 only require us to check whether the root
node has child. If not, then return False.
• d(r, Xp , Xb ) = rl , rr returns two new root vectors rl , rr . Both rl , rr have the same second and third dimension as
r. The rl ’s first dimension is updated to be the index of the left child of the root node that r points to. The rr ’s first
dimension is updated to be the index of the right child of the root node that r points to. The updating function can be
written as:
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

Algorithm 5 Implementation of t(r, Xp , Xb ) when the depth of the tree indicated by r is not longer than 2

Require: Inputs r ∈ R3 , Xp ∈ Rn×3 , Xb ∈ Rn ×3
Ensure: A 0-1 indicator vector v: if there exists a subtree with node i as root that is isomorphic with pattern tree tp defined
with inputs r, Xp , Xb , then the v[i] is 1. Otherwise, v[i] is 0.
1: Initialize v as all q vector with a length of n′
2: if r[1] == 0 then
3: Return v
4: end if
5: for i ∈ {1, 2, ..., m} do
6: if Xb [i, 3]! = Xp [r[1], 3] then
7: v[i] ← 0
8: end if
9: end for
10: Return v

Algorithm 6 Implementation of m(r, Xp , Xb , vl , vr )



Require: Inputs r ∈ R3 , Xp ∈ Rn×3 , Xb ∈ Rn ×3 , vl ∈ Rn , vr ∈ Rn
Ensure: A 0-1 indicator vector v: if there exists a subtree with node i as root that is isomorphic with pattern tree tp defined
with inputs r, Xp , Xb , then the v[i] is 1. Otherwise, v[i] is 0.
1: Initialize v as all 0 vector with a length of n′
2: if r[1] == 0 then
3: Return v
4: end if
5: for i ∈ {1, 2, ..., m} do
6: if Xb [i, 3] == Xp [r[1], 3] then
7: if vl [Xb [i, 1]]] == 1 and vr [Xb [i, 2]]] == 1 then
8: v[i] ← 1
9: else if vl [Xb [i, 2]]] == 1 and vr [Xb [i, 1]]] == 1 then
10: v[i] ← 1
11: end if
12: end if
13: end for
14: Return v

• t(r, Xp , Xb ) = v returns a 0-1 indicator vector v ∈ Rm with the same length of the base tree size. If there exists
a subtree with node i as root that is isomorphic with pattern tree tp defined with inputs r, Xp , Xb , then the v[i] is 1.
Otherwise, v[i] is 0. When the pattern tree’s depth is not higher than 1 (i.e., 1-node tree), t(r, Xp , Xb ) is equivalent to
output a 0-1 vector indicating the nodes in the base tree that have the same color of the root node of pattern tree. The
implementation is provided in Alg. 5.

• m(r, Xp , Xb , vl , vl ) = v merge the results vl , vl to acquire a 0-1 indicator vector v ∈ Rm with the same length of
the base tree size. If there exists a subtree with node i as root that is isomorphic with pattern tree tp defined with inputs
r, Xp , Xb , then the v[i] is 1. Otherwise, v[i] is 0. This function can be implemented by checking whether the pattern
root’s children have a perfect match with each node’s children. Since each node has at most two children, checking the
perfect match can be done in constant time. The implementation is provided in Alg. 6.

After providing the detailed implementation of the functions d(·), t(·), m(·), we are going to prove that there exists one
unified transformer that can handle all these tasks with different prompts d, t, m. First, we will provide the following
Lemma:
Lemma A.1. Any fixed-size logic circuit that only contains multi-fan-in AND gates, multi-fan-in OR gates, NOT gates and
has no recurrent structure can be precisely simulated by a multi-layer perceptron (MLP) with ReLU activation function
and a width of O(|Input| + |Circuit|) and a depth of O(|Circuit|), where |Input| denotes the size of input and |Circuit|
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

denotes the number of gates in the circuit.

Proof. Assume that we are given a series of input pins with logic variable of 0 or 1, organized as a 0-1 vector x ∈ Rh . We
first prove that all gates can be simulated by a two-layer perceptron. Then we can serialize all gates in the circuits and stack
their corresponding 2-layer simulators accordingly to acquire a MLP simulator. An AND gate that take x as input can be
simulated as:
AND(x) = σ(wA x − h + 1) (1)
where σ is the ReLU activation function, and wA is a weight vector with all dimensions equal to 1. If some dimensions of x
are not the input of the gate, we can set the corresponding dimensions in the weight vector as 0 and adjust the h as the input
pin number. Similarly, an OR gate that take x as input can be simulated as:

OR(x) = 1 − σ(wO x + h + 1) (2)

where σ is the ReLU activation function, and wO is a weight vector with all dimensions equal to -1. A NOT gate is different,
since it only takes one input pin. In such a case, we denote the index of the input pin as i, then we can simulate a NOT gate
as:
NOT(x) = σ(wN x + 1) (3)
where wN is a weight is a weight vector whose i-th dimension equals to -1 and all other dimensions equal to 0. Also, since
the x is a 0-1 vector, the activation function is equivalent to a identical function to x:

x = σ(x) (4)

To construct a MLP that can simulate a fixed-size logic circuit without recurrent structure, we apply the circuit serialization
in (Merrill & Sabharwal, 2023b) which order the gates based on topological order. In this way, we can represent the circuit as
a sequence GATE[1], GATE[2], GATE[3],...,GATE[L], where each GATE[i]’s input only contains the output of the previous
gates and the original input x. Therefore, we can construct a 2L-layer MLP base on the above serialization. Specifically,
the 2i-th and 2i + 1-th layers of the MLP will simulate the GAT E[i] as well as copy all previous inputs with activation
function and concatenate them together. This can be done by concatenate an identical matrix on the GATE’s weight vector
(wA , wO or wN ). In this way, we can construct a MLP that precisely simulate the circuit. Since every time we concatenate
the output of a gate with the input of it, the input dimension number of the final layer can be bounded by O(|x| + L). In the
worst case, for a circuit of size L, we needs 2L layers to precisely simulate it. However, in many cases, a lot of gates in the
circuits can be run parallelly. In such cases, the MLP could be much more shallow.

Now, we can start to prove our main theorem:


Theorem A.2. There exists a log-precision transformer with fixed depth and hidden dimension that can solve the 2-BSI of
any size with fixed-length prompt m (for merge), t (for sub-task tackling) and d (for task decomposition).

Proof. We prove this theorem by constructing a Transformer that can tackle this problem. First we define how to organize

the input given r, Xp , Xb and the prompt. Specifically, we construct a feature sequence X ∈ R(3+n+n )×7 . Each item in
this sequence is a feature of 7 dimensions, indicating a token. The first two dimensions indicate whether the token is a
prompt (’00’), a root vector (’01’), a pattern tree node (’10’), or a base tree node (’11’). The third to fifth dimensions carries
the information about the token. For a prompt token, ’100’ indicates merge prompt m, ’010’ indicates sub-task tackling
prompt t, and ’001’ indicates task decomposition prompt d. For other cases, these three dimensions are with the same
formula as the three dimensions in r, Xp , Xb . The rest two dimensions are allocated specifically for the merge function m(·)
to store vl and vr . More specifically, for the feature of token indicating the i-th base tree node, its sixth dimension is vl [i]
and its seventh dimension is vr [i]. For other tokens, these two dimensions are filled with 0. In X[1] we store the prompt
token. In X[2] and X[3] we store the input root vector r duplicately. We store the same token twice so that we can tackle rl
and rr separately. To separate this two token, we use the last dimension, which was padded as 0 in r, to distinguish them.
X[2, 5] is set as 0 and X[3, 5] is set as 1. From X[4] to X[3 + n], we store Xp . From X[4 + n] to X[3 + n + n′ ], we store
Xb . For all node indices of pattern tree, we add them by 3. For all node indices of base tree, we add them by 3+n, so that the
indices can be applied to directly retrieve the positional embeddings. After preparing the inputs, we start to construct our
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

Algorithm 7 Logic circuit for MLP of the second Transformer layer


Require: Input feature x′′ ∈ R42
Ensure: Output feature y ∈ R7
1: y ← x′′ [1 : 7] {Initialize y}
2: if x′′ [1 : 2] == 00 or x′′ [1 : 2] == 10{Prompt Token or Pattern Tree Node} then
3: Return y
4: else if x′′ [1 : 2] == 01 {Root Vector Token} then
5: if x′′ [24 : 26] == 001{Prompt is d} then
6: if x′′ [5] == 0 then
7: y[3] ← x′′ [10] {get rl , similar as line 1 in Alg. 4}
8: else if x′′ [5] == 1 then
9: y[3] ← x′′ [11] {get rr , similar as line 2 in Alg. 4}
10: end if
11: end if
12: else if x′′ [1 : 2] == 11 {Base Tree Node Token} then
13: if x′′ [24 : 26] == 010{Prompt is t} then
14: if x′′ [40] == x′′ [5]{Line 6 in Alg. 5} then
15: y[5] ← 1
16: else
17: y[5] ← 0
18: end if
19: else if x′′ [24 : 26] == 100{Prompt is m} then
20: if x′′ [13] == 1 and x′′ [21] == 1 {Line 7 in Alg. 6} then
21: y[5] ← 1
22: else if x′′ [14] == 1 and x′′ [20] == 1{Line 9 in Alg. 6} then
23: y[5] ← 1
24: else
25: y[5] ← 0
26: end if
27: end if
28: end if

Transformer. The transformer first attach the position index for each token (positional embedding). After that, the inputs are
forwarded into a transformer with depth of 2. Each transformer layer contains a multi-head attention layer followed by a
MLP. As proved by (Merrill & Sabharwal, 2023b; Feng et al., 2023), the attention layer of Transformer can retrieve the
feature of tokens whose positional embeddings satisfy specific conditions. For multi-head attention, different heads can
retrieve tokens with different conditions. In the following construction, we will use this conclusion to construct attention
heads with different functions.
In the first Transformer layer, the function of each attention head is defined as:

• Head 1 only attends to the token itself to store X[i] for token i.

• Head 2 attends to the token with a positional embedding matches the X[i, 3] and copy this token’s 5-dimension feature.
For tree node tokens, this head’s job is to retrieve the feature of X[i]’s left child. For root vector tokens, this head’s job
is to retrieve the feature of pattern tree root node. For the first token (prompt token), this head’s retrieved feature will
not be applied in the afterwards layers and thus does not influence the correctness of the model.

• Similar as Head 2, Head 3 attends to the token with a positional embedding matches the X[i, 4] and copy this token’s
5-dimension feature. This head’s job is to retrieve the feature of X[i]’s right child. For root vector tokens, this head’s
job is to retrieve the feature of base tree root node.

• Head 4 attends to the first token (prompt token) and copy this token’s 7-dimension feature. This head’s job is to retrieve
the prompt indicator.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

• Head 5 attends to the second token (root token) and copy this token’s 7-dimension feature. This head’s job is to retrieve
the root information.

With the above 5 heads, the attention layer will output a 35-dimension feature for each token. We denote these features

as X′ ∈ R(3+n+n )×35 . After that, these features are forwarded into a MLP fitting identical mapping to acquire the input
features for the second Transformer layer.
In the second Transformer layer, the function of each attention head is defined as:

• Head 1 only attends to the token itself to store X ′ [i] for token i.

• Head 2 attends to the token with a positional embedding matches the X′ [i, 31] and copy this token’s 1-7 dimension
features (X′ [X′ [i, 31], 1 : 7]). This head’s job is to broadcast the feature of the pattern tree root node to every token.

With the above 2 heads, the attention layer will output a 42-dimension feature for each token. We denote these features as

X′′ ∈ R(3+n+n )×42 . For root vector token, only the features from head 1 and head 4 are useful. For base tree node tokens,
all 42 dimensions are useful. Then each token’s feature are parallely forwarded into a MLP. We will use this MLP to fit the
logical circuit described in Alg. 7. The function of Alg. 7 is to aggregate the functions of m(·), t(·), d(·) together and assign
the correct value based on the prompt indicator. In Alg. 7, all operations are AND, OR, NOT, SELECTOR, and ASSIGN
and there is not loop. Thus, it is a static logical circuit and can be implemented with multi-fan-in AND, OR, NOT gates.
Thus, it can be precisely simulated by a MLP according to our Lemma A.1.

After acquiring the y ∈ R7 for each token, we can organize them as a feature sequence Y ∈ R(3+n+n )×7 . When the prompt
is d, we return Y[2, 3 : 5] as rl and Y[3, 3 : 5] as rr . If the prompt is t or m, then we can output Y[3 + n + 1 : 3 + n + n′ , 5]
as the expected v.

A.2. Prompting Details of DaC


Multiplication of Long Integers: Suppose we have two 2n-digit numbers AB and CD, where A, B, C, D are all n-digit
numbers. Then we can break AB × CD as (A × C × 102n ) + (A × D × 10n ) + (B × C × 10n ) + (B × D), where the
calculation in each bracket pair is disjoint with others bracket pairs. We only need to compute the results of multiplication in
each bracket pair parallelly and then merge all of them with addition:
Decomposer Prompt d: Please split the string a from the middle as two separated strings. The lengths of the two separated
strings should be as close as possible. Please only return the two strings separated by a comma and do not return anything
else.
Sub-task Tackling Prompt t: (1)Please compute a ∗ b. (2) Please only return the final results and do not return anything else
(ensure disentangled-sub-process principle).
Merge Prompt m: Please compute x = a ∗ 102n + b ∗ 10n and y = c ∗ 10n + d. Based on the above calculation, please
compute x + y carefully step by step.
Hallucination Detection in Long Context: We divide the summary to sentences. After that, we paralelly verify the
sentences. Finally, we merge the verification to each sentence:
Decomposer Prompt d: Please help me segment the following paragraph as sentences. The separated sentence should be
output as: #Statement 1#: ...#Statement 2#: ...Do not say anything else. Just return the statements in the given format.
Sub-task Tackling Prompt t: I want you to act as a factual contradiction checker. You are given a set of statements and a
document. Among the statements, there might be one or more statement that contains contradictions with the document.
Please find the problematic statement if it exist by analyzing the statements one by one. For each statement, please make a
choice:

• A: The statement is totally aligned with the document for sure.

• B: The statement contradicts with the document.


Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

Merge Prompt m: Based on the above analysis, please tell me, does any statement above contain contradiction with the
document?.
Fact-Verification for Misinformation Detection: Similar as hallucination detection, we divide the summary to sentences.
After that, we paralelly verify the sentences. Finally, we merge the verification to each sentence. Thus, our decomposer
prompt and sub-task tackling prompt are the same as hallucination detection. The only difference is the merge prompt.
Merge Prompt m: If we connect the above statements to be a news article, based on the above analyzation, please answer
me: Is there any contradiction between the document and the article?

Sub-task Sub-Solution
A.3. Decomposed Prompting and Least to Most
Least-to-Most (LtM) Prompting (Zhou et al., 2022)
and Decomposed Prompting (Khot et al., 2022) are Input Input
two similar works to our work. They both propose CoT
to explicitly prompt the LLM to decompose the task A
as a series of sub-tasks and sequentially tackle them.
In Fig .2, we merge these two methods. Here, we LtM
will provide more detailed comparison of them, which B
… …
is shown in Fig. 5. Decomposed Prompting can re-
garded as a upgraded version of LtM. It introduces DeP
special notations into the prompt to represent pro- C
gram states so that when sequentially tackling the
sub-tasks, it can call heterogeneous modules to tackle Output Output
them. Such design enable the LLM to call external
programs (e.g., retrieval documents on WikiPedia and (A). Least to Most (B). Decomposed
program based calculator) and/or itself (i.e., recur- Prompting (DeP)
sion). Such design endows it stronger expressive
power and increases the compositional generalization Figure 5. Comparison of Least-to-Most (LtM) Prompting and Decom-
ability of LLMs in different areas, such as symbolic posed Prompting (DeP).
manipulation and multi-hop QA (Khot et al., 2022).
Also, it endows LLM the ability to do open-domain
QA by retrieving from external knowledge base.

A.4. More Discussions on Sequential Sub-task Tackling and Parallel Sub-task Tackling

Example of Sequential Sub-task Tackling Example of Parallel Sub-task Tackling


Complete Task: Compute 12345*67890: Complete Task: Compute 12345*67890:

Sub-task 1: Compute x=45*90: Sub-task 1: Compute x=45*90:


A: ...... A: ......

Sub-task 2: Based on the above result, compute y=123*90*10^2+45*90: Sub-task 2: Compute y=123*90*10^2:
A: ...... A: ......

Sub-task 3: Based on the above result, compute Sub-task 3: Compute z=45*678*10^2:


z=45*678*10^2+123*90*10^2+45*90: A: ......
A: ......
Sub-task 4: Compute w=123*678*10^4:
Sub-task 4: Based on the above result, compute A: ......
w=123*678*10^4+45*678*10^2+123*90*10^2+45*90:
A: ...... Resolution Assembly: Based on the above computation, compute x+y+z+w
A: ......

Figure 6. Toy example of Sequential Sub-task Tackling and Parallel Sub-task Tackling in long integer multiplication
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving

Example of Sequential Sub-task Tackling Example of Parallel Sub-task Tackling


Complete Task: Verify the following summary: Complete Task: Verify the following summary:
#Summary#: A video showing the final moments of Germanwings Flight 9525 has been #Summary#: A video showing the final moments of Germanwings Flight 9525 has been
recovered by investigators from the wreckage site. Marseille prosecutor Brice Robin urged recovered by investigators from the wreckage site. Marseille prosecutor Brice Robin urged
anyone who might have more footage to turn it over immediately. Andreas Lubitz, the co-pilot anyone who might have more footage to turn it over immediately. Andreas Lubitz, the co-pilot
accused of deliberately crashing the plane, had a history of severe depression and suicidal accused of deliberately crashing the plane, had a history of severe depression and suicidal
tendencies. tendencies.

Sub-task 1: Verify the following statement: Sub-task 1: Verify the following statement :
#Summary#: A video showing the final moments of Germanwings Flight 9525 has been #Summary#: A video showing the final moments of Germanwings Flight 9525 has been
recovered by investigators from the wreckage site. recovered by investigators from the wreckage site.
A: ...... A: ......

Sub-task 2: Based on the above analysis, verify the following statement : Sub-task 2: Verify the following statement :
#Summary#: A video showing the final moments of Germanwings Flight 9525 has been #Summary#: Marseille prosecutor Brice Robin urged anyone who might have more footage to
recovered by investigators from the wreckage site. Marseille prosecutor Brice Robin urged turn it over immediately.
anyone who might have more footage to turn it over immediately. A: ......
A: ......
Sub-task 3: Verify the following statement :
Sub-task 3: Based on the above analysis, verify the following statement : #Summary#: Andreas Lubitz, the co-pilot accused of deliberately crashing the plane, had a
#Summary#: A video showing the final moments of Germanwings Flight 9525 has been history of severe depression and suicidal tendencies.
recovered by investigators from the wreckage site. Marseille prosecutor Brice Robin urged A: ......
anyone who might have more footage to turn it over immediately. Andreas Lubitz, the co-pilot
accused of deliberately crashing the plane, had a history of severe depression and suicidal Resolution Assembly: Given the above analysis, verify the summary that consist of the
tendencies. above three statements.
A: ...... A: ......

Figure 7. Toy example of Sequential Sub-task Tackling and Parallel Sub-task Tackling in hallucination detection

Sequential Sub-task Tackling and Parallel Sub-task Tackling are two different paradigm in decomposing complex tasks as
sub-task to tackle. The first one decompose a complex tasks as a series of sub-tasks. In this series, each sub-task relies on
the previous one’s output as input or context. The second one decompose a complex tasks as a set of sub-tasks, each of
which does not rely on others. Two examples for multiplication and hallucination detection are provided in Fig 6 and 7

You might also like