Guiding Large Language Models With Divide-and-Conquer Program For Discerning Problem Solving
Guiding Large Language Models With Divide-and-Conquer Program For Discerning Problem Solving
Abstract Mao et al., 2023; Zong & Krishnamachari, 2023). These de-
Foundation models, such as Large language Mod- velopments paves the way towards general-purpose problem
solvers (Bubeck et al., 2023).
arXiv:2402.05359v1 [cs.AI] 8 Feb 2024
Figure 1. An illustrative example of hallucination detection with entangled problem solving (i.e., directly forward all inputs into the LLM)
and disentangled problem solving (i.e., divide the problem inputs to parallel sub-tasks and tackle them parallelly). The sentence marked
with red back font in the material is the evidence that contradict with the first claim in summary (marked with red font).
quires a large quantity of repetitive tasks (e.g. large integer Inspired by the aforementioned human experience, in this
multiplication and article-level verification), LLM is prone paper, we explore to guide LLM with Divide-and-Conquer
to intermediate errors, such as missing some inputs or program to unlock the LLM’s ability to handle tasks with
generating wrong sub-tasks, leading to problematic answers repetitive sub-tasks. This strategy breaks down the whole
(Amaro et al., 2023). This phenomenon is especially serious task resolution process into three distinct sub-processes:
when the task input contains deceptive information or con- task decomposition, sub-task resolution, and solution merge.
tents that could trigger hallucination (Chen & Shu, 2023; Li The task decomposition process will prompt the model to
et al., 2023a). An illustrative example is presented in the Fig. separate the whole task to multiple parallelly solvable sub-
1. To alleviate this issue, some program guided prompting tasks (i.e., solving one sub-tasks does not require the result
strategies such as Least-to-Most (Zhou et al., 2022) (LtM) of other sub-tasks) and list them explicitly recursively. After
and Decomposed Prompting (Khot et al., 2022) propose to that the sub-task resolution process prompts the LLM to
disentangle sub-task generation and resolution. However, output the answer for all sub-tasks. Finally, the solution-
they implement the task decomposer through multi-round merge process assembly the solutions recursively along the
conversation (or question-answering) with a LLM to se- decomposition path. The above sub-processes follow a key
quentially raise and solve the sub-problems in an alternate principle that every upstream sub-process (e.g. sub-task
manner. When tackling deceptive text like hallucination resolution) only forward the final answer to its downstream
and fake news, such intertwined process often guides the sub-processes (e.g. solution merge) and does not forward
LLM to sequentially tackle the corpus and thus follow the its input and intermediate steps. We denote this principle as
context’s flow, making LLM prone to deception. disentangled-sub-process principle. An illustrative figure
that explain the difference between our method and previous
In fact, human brains also suffer from similar hallucination
works is provided in Fig. 2
issues (Liu & Sajda, 2023), especially when the tasks are too
hard or too complex. For example, when reviewing a long To validate the expressive power of our proposed prompting
academic paper, some reviewers produce low-quality re- strategy, we provide a theoretic analysis to validate that the
views (Garcia et al., 2021; Tennant & Ross-Hellauer, 2020; proposed strategy can expand the expressive power of fixed-
Cortes & Lawrence, 2021) containing hallucination-like depth log-precision Transformer. To further empirically
intermediate errors, such as pointing out some ‘missing validate the advantage of our proposed method, we evaluate
baselines’ that have already been sufficiently discussed by our method and representative baselines on three tasks that
authors and requiring the authors to conduct ablation stud- are challenging to existing prompting strategies even on
ies for non-existent sub-modules. To avoid such ridiculous state-of-the-art LLMs: Large Integer Multiplication, Hallu-
mistakes, experienced reviewers usually think slowly (Kah- cination Detection, Article-level Fact Verification (Cheng &
neman, 2011) to follow a Divide-and-Conquer paradigm Zhang, 2023; Li et al., 2023a; Wadden et al., 2020; Hu et al.,
to handle this task. Specifically, they decompose the paper 2023; Wu et al., 2023). These tasks either require very long
review as examinations of multiple central opinions. Then reasoning paths (e.g. large integer multiplication) or contain
they retrieve supporting corpus to verify them respectively. deceptive contents (e.g. hallucination detection and fact ver-
ification), making existing methods like Chain-of-Thought
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
Intermediate Step:
Sub-task Sub-Solution Input
Subtask + Solution
…
…
Vote
(A). COT (B). Least to Most (C). COT-SC (D). Tree of thought (E). Divide and Conquer (Ours)
Figure 2. The comparison between our method and the existing methods for prompting. The ellipse marks represent sub-tasks, the
right-angled rectangles represent sub-task solutions, and the rounded rectangles represent intermediate steps that entangle sub-task and
sub-solutions. The different shades in Tree of Thoughts (subfigure D) indicate the rates of different search directions. In CoT (Chain-of-
Thoughts), CoT-SC and ToT, the Large Language Models must simultaneously generating and resolving sub-tasks. Least-to-Most (also
Decomposed Prompting) disentangle sub-task generation and resolution. However, as shown in the figure, its sub-task resolution process
and resolution assembly process are intertwined as it sequentially attach new sub-tasks onto the previous resolution. Different from all of
them, our method totally disentangle the sub-task generation, sub-task resolution, and resolution assembly process.
Program-guided Prompting aims at controlling the LLM’s subtasks and acquire the final answer. In this process, all
generation process with symbolic programs or pre-defined three stages are isolated to avoid interruption. They are all
procedure (Zhu et al., 2022; Jung et al., 2022; Zhou et al., guided by a program rather than a LLM to avoid hallucina-
2022; Khot et al., 2022; Creswell & Shanahan, 2022; Gao tion or the deception from input context. To tackle tasks of
et al., 2023). Among them, the Least-to-Most (LtM) Prompt- different sizes, we propose two variants: Single-Level DaC
ing (Zhou et al., 2022) and Decomposed Prompting (Khot Solver and Multi-Level DaC Solver.
et al., 2022) are close to this work. They are the earliest
attempts that explicitly prompt the LLM to decompose the Algorithm 1 Single-Level Divide-and-Conquer Solver
task as a series of sub-tasks and sequentially tackle them. T (S, a, t, L, f )
LtM prompt a LLM to iteratively raise sub-tasks and se- Require: Input Sequence S, Prompt m (for solution
quentially solve them to acquire the final resolution. De- merge), Prompt t (for sub-task tackling), Prompt d (for
composed Prompting can regarded as a upgraded version of task decomposition), LLM L
LtM. It introduces special notations into the prompt to rep- Ensure: Results of the task on input sequence S
resent program states and thus can call itself (i.e., recursion) 1: {S1 , S2 , ..., Sk } ← L(d, S)
or other modules (i.e., hierarchical decomposition), endow- 2: Result ← ∅
ing it stronger expressive power. Such design increased the 3: for i = 1, 2, ..., k do
compositional generalization ability of LLMs in different 4: Result ← Result +[SEP ] + L(t, Si )
areas, such as symbolic manipulation and multi-hop QA 5: end for
(Khot et al., 2022). 6: Return L(m, Result)
The aforementioned CoT and EoT families incorporate
LLM with stronger expressive power than IO prompting. Single-level Divide-and-Conquer Solver decomposes the
However, a critical issue of them is that, they could miss task in one call to the LLM, which expands the original
or ignore some important intermediate steps or contents task as a tree of one level. The algorithm is presented in
(Liu et al., 2023). This problem is even worse when we the Alg. 1. The advantage of this variant is its simplicity
are handling tasks involved with long input (e.g. long docu- and efficiency. However, when the original input is too
ments and large numbers). Typical examples include large long, single-level Divide-and-Conquer Solver may acquire
number arithmetic calculation and fact verification in long sub-tasks with large problem sizes that will still trigger
documents. Compared to them, Least-to-Most prompting intermediate errors. In such a case, following (Khot et al.,
and Decomposed Prompting introduce explicit task decom- 2022), we can recursively expand the task as a multi-level
position to enumerate sub-tasks. However, their task decom- tree. More specifically, we repeat the aforementioned steps
posers are based on multi-round conversation or question- to further divide the sub-tasks hierarchically until they are
answering, which often guides the LLM to sequentially easy enough to be handled by the LLM. This can be done
tackle the sub-tasks. When tackling contents containing de- through a recursion program as presented the Alg. 2.
ceptive contents, e.g. hallucination and fake news detection,
such design will navigate the LLM through the deceptive
Algorithm 2 Multi-Level Divide-and-Conquer Solver Re-
content’s flow and thus make LLM prone to deception.
cursion T (S, m, t, d, f, n, L)
Require: Input Sequence S, Problem Size Metric Function
3. Proposed Method f (·) (a function that measure the problem size), hyper-
To avoid the task decomposition and task resolution from in- parameter w, Prompt m (for merge), Prompt t (for sub-
terweaving and interrupting each other, we propose to guide task tackling), Prompt d (for task decomposition), Large
LLM with Divide-and-Conquer (DaC) program that consists Language Model L
of three distinct stages: task decomposition stage, sub-task Ensure: Results of the task on input sequence S
resolution stage, solution merge stage. In task decomposi- 1: S1 , S2 , ..., Sk ← L(d, S)
tion stage, the LLM is prompted to explicitly decompose 2: Result ← ∅
the task as a series of parallel homogeneous sub-tasks with 3: for i = 1, 2, ..., k do
smaller problem sizes (e.g. divide a long paragraph to sen- 4: if f (Si ) > w then
tences). Such design avoids the multi-round conversation 5: Result ← Result +[SEP ]+T (Si , m, t, d, f, w, L)
or question-answering in LtM and Decomposed Prompting, 6: else
making the model less prone to deception. After that, in sub- 7: Result ← Result +[SEP ] + L(t, Si )
task resolution stage, the LLM is prompted to provide the 8: end if
solutions for every sub-task. Finally, in the solution merge 9: end for
stage, the LLM is prompted to assembly the solutions of 10: Return L(m, Result)
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
4. Theoretic Analysis on the Proposed Method to a 2-color binary tree isomorphism problem (2-BTI, com-
paring whether two trees are exactly isomorphic or not)
In this section, we provide a theoretic analysis to indi- under AC 0 reduction. Therefore, 2-color binary tree iso-
cate that the proposed Divide-and-Conquer program-guided morphism is N C 1 hard. Because any solver of 2-BSI can
problem solving strategy help expand the expressive ca- solve 2-color binary tree isomorphism by simply adding a
pacity of Transformers. Specifically, we exploit 2-color layer to check whether the two trees have same size, 2-BSI
Binary Subtree Isomorphism (2-BSI) problem for the proof. is harder than 2-BSI. Since T C 0 ⊆ N C 1 and we assume
2-BSI problem is a NC1 -complete problem. Under a widely- that T C 0 ̸= N C 1 , then for any fixed-length polynomial-
accepted assumption in parallel computing, we theoreti- parameter log-precision transformer, there exist a size of the
cally prove two facts: (1) Log-Precision Transformers with 2-BSI problem such that the transformer can not handle.
standard IO-prompting strategy can not solve 2-BSI prob-
lem, and (2) there exists a Log-Precision Transformers with
4.2. Fixed-Depth Transformer guided by
Divide-and-Conquer program-guided problem solving strat-
Divide-and-Conquer program solves 2-BSI
egy that can solve 2-BSI problem. In this way, we prove that
Divide-and-Conquer strategy expand the expressive power Theorem 4.2. There exists a log-precision transformer with
of Large Language Models. fixed depth L and hidden dimension d that can solve the
2-BSI of any size with fixed-length prompt m (for merge), t
4.1. Fixed-Depth Transformer can not solve 2-color (for sub-task tackling) and d (for task decomposition).
Binary Subtree Isomorphism (2-BSI)
Proof Sketch: The detailed proof to this theorem is pro-
In this section, we will present a problem that exceeds the vided in the Appendix A.1. Here we give a brief flow of
expressive power of fixed-length log-precision transformers: the proof. To prove this theorem, we first show that there
2-color Binary Subtree Isomorphism (also known as tree exists an problem size function f (·), hyper-parameter w,
matching), an important algorithmic problem which many merge function m(·), sub-task tackling function t(·) and
tasks such as semantic similarity analysis, text matching decomposition function d(·) that can solve the problem with
and structured text database querying can be reduced to divide-and-conquer strategy. Then we will prove that there
(Kilpeläinen & Mannila, 1992; Marsi & Krahmer, 2010; He exists one log-precision transformer with fixed depth L and
et al., 2018). More specifically, we first gives a theoreti- hidden dimension d that can express a(·), t(·) and d(·) re-
cal proof that 2-color Binary Subtree Isomorphism is not spectively with different but fixed-length prompts. In this
solvable for fixed-length log-precision transformers. way, we can prove the theorem.
Definition 1. 2-color Binary Subtree Isomorphism prob-
lem is that, given a pattern 2-color1 binary tree tp and a
5. Experiments
base 2-color binary tree tb , a solver is required to judge
whether the pattern tree is isomorphic to a sub-tree of tb We evaluate the capacity of our prompting strategies on
three kinds of tasks that are troubled by missing intermedi-
In (Jenner et al., 2003), the authors pointed out that the ate steps due to long inputs, even for state-of-the-art Lan-
encoding of the problem will influence the hardness of the guage Models (e.g. GPT-3.5 and GPT-4): Multiplication
problem. In this paper, we focus on pointer list encoding and Addition of Large Numbers, Hallucination Detection
of 2-BSI. Detailed information about the pointer list encod- in Long Context, and Fact-Checking for Misinformation
ing of 2-BSI can be found in Appendix. For pointer list Detection (Yang et al., 2023; Li et al., 2023a; Wadden et al.,
encoding of 2-BSI, we have the following theorem: 2020). With these experiments, we present how Divide-
Theorem 4.1. Assume that T C 0 ̸= N C 1 . For any depth L, and-Conquer (DaC) prompting handles this task better and
any polynomial Q, there exists a size n of pattern tree such suggest DaC could be a promising new paradigm in han-
that there exist no log-precision Transformer with a depth dling long documents or large inputs.
of L, hidden dimension d < Q(n), and fixed prompt p that
can directly output the solution (Yes or No) of the 2-color 5.1. Multiplication of Long Integers
Binary Subtree Isomorphism problem (2-BSI).
General-purpose Large Language Models, e.g. ChatGPT-
3.5 and ChatGPT-4, have been bothered by its poor per-
Proof. The core point of the proof is that, based on (Merrill
formance on arithmetic of big numbers, even though their
& Sabharwal, 2023a), we know that the expressive power of
expressive capacity has been proved to be enough for ac-
log-precision Transformer is in T C 0 . As proved by (Jenner
curately calculate large number multiplication (Yang et al.,
et al., 2003), any L problem (N C 1 ⊆ L) can be converted
2023). In this experiment, we show that under zero-shot
1
2-color means that each node in the tree can be of two colors. long integer multiplication, DaC can outperform other base-
We can understand it as that each node has a 0-1 label line prompting strategies.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
GPT-3.5-Turbo GPT-4
Strategies
F1 Acc Prec Recall F1 Acc Prec Recall
IO-prompting 61.69 61.27 62.11 61.28 64.07 72.66 93.41 48.76
Chain-of-Thoughts 46.85 64.26 91.36 31.50 71.05 76.10 90.08 58.66
CoT-SC 47.70 64.25 88.83 32.60 71.39 76.36 90.41 58.98
Tree-of-Thoughts 70.40 59.91 55.83 95.34 69.41 71.73 75.53 64.28
Least-to-Most 56.43 64.91 74.42 45.44 72.51 77.11 90.74 60.38
Divide-and-Conquer w/o DP 67.66 67.51 67.32 68.03 73.95 76.63 83.54 66.33
Divide-and-Conquer 74.84 75.55 77.41 72.03 76.92 78.99 85.36 70.01
Table 1. Performance of different prompting methods on HaluEval dataset. We report the F1 score, Accuracy, Precision and Recall.
Setup of baselines, ablation variants and our method: addition, as we can see, although DaC w/o DP outperforms
In this task, our baselines include IO prompting, Chain of other baselines in general, it still perform worse than DaC.
Thought, CoT-SC, Tree-of-Thoughts Least-to-Most, and De- This phenomenon indicates the importance of disentangled-
composed Prompting. We also include our method’s variant: sub-process principles. Meanwhile, it is noticeable that
DaC without Disentangled-sub-process Principle (DaC w/o the DaC w/o DP can be regarded as the upgrade version
DP) for ablation study. In this task, the sub-tasks are verify- of LtM/DeP that replace sequential sub-tasks with paral-
ing fragments of the summary, which are homogeneous and lel sub-tasks. Thus, the experimental results suggest that
do not require recurssion. In such a setting, Decomposed parallel sub-task strategy makes LLM more discerning for
Prompting is equivalent to LtM. For this task, we apply hallucination detection.
single level Divide-and-Conquer solver to decompose the
summary to multiple sentences, handle them separately and 5.3. Fact-Verification for Misinformation Detection
then merge the conclusions of all sentences. The detailed
prompts are provided in the Appendix. The increasing abuse of misinformation toward manipulat-
ing public opinions on social media has been observed in
Results: Experimental results are shown in Tab. 1. As we different areas, such as healthcare (e.g. the recent COVID-
can see for all other baselines, the performance on GPT- 19 pandemic) (Sharma et al., 2020; 2022). This threat is
3.5 is substantially worse than GPT-4, as the hallucinations increasingly serious due to LLM’s capacity in content gen-
are generated by GPT-3.5 and thus more deceptive to GPT- eration (Li et al., 2023b; Weidinger et al., 2021; Zhang
3.5. However, for our method, when we replace GPT-4 et al., 2022). This challenge raise the importance of fact-
with GPT-3.5, although the performance still decrease, the verification, which aims at judging the authenticity of an
drop is significantly lower than all other baselines, indi- article based on a collection of evidence from verified source
cating the robustness of our proposed method toward de- (Whitehouse et al., 2022; Zhang & Gao, 2023). In this exper-
ceptive contents. Also, for both GPT-3.5 and GPT-4, our iment, we present that DaC can outperform other baselines
proposed prompting strategy outperform the baselines, pre- in fact-verification involved with news article.
senting the advantage of our method. More specifically,
compared to IO-prompting, our model achieved better per- Task Setup: In this experiment, we mainly adopt SciFact
formance in general, indicating the advantage brought by dataset (Wadden et al., 2020). In SciFact dataset, each
stronger expressive power. Meanwhile, compared to CoT sample is a pair of news and evidence, where the evidence
and CoT-SC results, our method clearly achieved much is the abstract of a peer-reviewed paper and the news is a
better recall. Tree-of-Thoughts, benefited by its searching claim summarized by human annotators from fake news
ability, acquired significantly better recall score compared or true news. To better simulate the real-world scenario
to other baselines. However, its significantly lower preci- where news usually appears as an article, following Chen &
sion substantially harm its overall performance and leads Shu, we generate a dataset of article-level misinformation
to accuracy that is even worse than standard IO-prompting. based on SciFact dataset. Specifically, for a given claim, we
Least-to-Most, which explicitly decomposing the task and apply Large Language Model (i.e., ChatGPT-4) to extend
enumerating all sub-tasks, achieves better recall compared the claim as an article based on the evidence. For this task,
to CoT and CoT-SC. However, it follows the flow of the hal- similar as hallucination detection, we apply single level
lucination, leading to proneness worse recall compared to Divide-and-Conquer solver to decompose the news article to
Tree-of-Thoughts and our method. In contrary, our method multiple sentences, handle them separately and then merge
carefully checked all sentences, locate the one containing the conclusions of all sentences. Also, the baselines in this
factual error and merge the answers. Thus, our method bal- experiments are the same as Hallucination Detection, for
ance the overall performance (F1 and Accuracy score). In the same reason, Decomposed Prompting is equivalent to
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
GPT-3.5-Turbo GPT-4
Strategies
F1 G-Mean Prec Recall F1 G-Mean Prec Recall
Io-Prompting 72.12 72.77 83.22 63.64 69.15 71.77 94.44 54.55
Chain-of-Thoughts 56.09 60.64 90.48 40.64 74.03 75.79 94.21 60.96
CoT-SC 56.83 61.44 91.67 41.18 70.09 73.45 100.0 53.95
Tree-of-Thoughts 69.91 73.30 53.74 100.0 77.34 78.00 88.89 68.45
Least-to-Most 54.08 54.15 51.46 56.99 73.56 74.25 85.21 64.71
Divide-and-Conquer w/o DP 61.32 63.66 83.81 48.35 74.15 75.68 92.68 61.79
Divide-and-Conquer 76.88 77.13 83.65 71.12 81.11 81.24 76.67 86.10
Table 2. Performance of different prompting methods on SciFact dataset. We report the F1 score, G-Mean score, Precision and Recall.
LtM. The evaluation metrics includes F1 score, G-Mean itly separate the task decomposition stage and task resolu-
score (geometric mean of precision and recall), Precision tion stage. However, they are mainly designed for multi-step
score and Recall score. We do not apply accuracy as the reasoning for complex tasks. Thus, they sequentially tackle
positive and negative classes are not balanced. the sub-tasks and assembly the resolutions. As a result, they
tend to follow the flow of the deceptive contents, leading
Results: Experimental results are shown in Tab. 2. Notably,
to proneness to deceptive content. More details about this
GPT-3.5 incorporated with our proposed prompting strategy
comparison can be found in Appendix A.4.
even outperform the performance of GPT-4 incorporated
with IO-prompting, Least-to-Most, CoT and CoT-SC, which Although our proposed method DaC surpasses the baselines
have significantly lower recall scores, indicating their prone- on the proposed tasks, it still has some limitations. The
ness to deception. Only Tree-of-Thoughts, which is bene- first issue is that the appliance scope of DaC is still limited.
fited by its advantage in exploring various options, acquired More specifically, CoT, EoT, LtM and DaC are based on
the best results among all baselines, but is still defeated by different algorithmic paradigms, learning to different Ap-
our proposed method. Moreover, as we can see, for GPT-4 pliance Scopes. As pointed out by Feng et al., CoT and
the performance of CoT-SC is even worse than CoT, which LtM can be considered as a neural dynamic programming
is supposed to be a specific case of CoT-SC without explo- algorithm. Thus, CoT is more suitable for tasks that can
ration. These results suggests that, when facing deceptive be bridged to dynamic programming, such as multi-step
contents generated on purpose, existing incremental works’ question answering. Differently, EoT is based on explo-
improvement may not be so robust. ration and search, which is more suitable for planning and
search, such as Game of 24 (Yao et al., 2023). Our proposed
6. Discussions and Limitations method is based on Divide-and-Conquer algorithm. Thus,
it is more suitable for tasks that can be decomposed to a
In summary, the proposed method has following advantages: series sub-tasks that are disjoint or only slightly overlapped.
Our future work will focus on further expand the appliance
Comparison with IO-Prompting: Superiority in Expres-
scope of DaC to more areas like question answering.
sive Power As we proved in Sec. 4, Compared to IO-
prompting, our proposed method has stronger expressive
power and thus can solve harder problems. 7. Conclusions
Comparison with CoT and EoT: Disentangling the In this paper, we proposed a novel Divide-and-Conquer
task decomposition and task resolution Compared to the program-guided problem solving strategy. To guide large
prompting family of CoT and EoT, our proposed method language models to tackle tasks requiring large quantity of
explicitly separate the task decomposition stage and task repetitive sub-tasks and/or deceptive contents, DaC disentan-
resolution stage. Therefore, we can acquire explicit decom- gle the processes of task decomposition, sub-task resolution
posed sub-task rather than intermediate thoughts proposed and resolution assembly. In this way, we avoid the inter-
during decoding. Consequently, we can explicitly enumer- twined processes from interrupting each other and alleviate
ate all sub-tasks output by the decomposition module and the intermediate errors in generating the resolution path.
avoid the model from missing important sub-tasks. The experimental results show that such improvement leads
to better performance on a wide range of tasks such as large
Comparison with LtM and Decomposed Prompting: Par-
integer multiplication, hallucination detection and misin-
allel Sub-task Handler and Sequential Sub-task Han-
formation detection. Moreover, theoretic analysis reveals
dler Similar as our proposed method, some program-guided
that our proposed method help guide LLM to expand the
prompting like LtM and Decomposed Prompting also explic-
expressive power to surpass original Transformers.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
References Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang,
Y., Callan, J., and Neubig, G. Pal: Program-aided lan-
Amaro, I., Della Greca, A., Francese, R., Tortora, G., and
guage models. In International Conference on Machine
Tucci, C. Ai unreliable answers: A case study on chat-
Learning, pp. 10764–10799. PMLR, 2023.
gpt. In International Conference on Human-Computer
Interaction, pp. 23–40. Springer, 2023. Garcia, J. A., Rodriguez-Sánchez, R., and Fdez-Valdivia, J.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Quality censoring in peer review. Scientometrics, 126:
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., 825–830, 2021.
Askell, A., et al. Language models are few-shot learners.
Advances in neural information processing systems, 33: He, Y., Tao, S., Xu, J., Guo, J., Lan, Y., and Cheng, X. Text
1877–1901, 2020. matching with monte carlo tree search. In Information
Retrieval: 24th China Conference, CCIR 2018, Guilin,
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., China, September 27–29, 2018, Proceedings 24, pp. 41–
Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., 52. Springer, 2018.
Lundberg, S., et al. Sparks of artificial general intel-
ligence: Early experiments with gpt-4. arXiv preprint Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr,
arXiv:2303.12712, 2023. M., Matsushita, H., Kim, Y. J., Afify, M., and Awadalla,
H. H. How good are gpt models at machine trans-
Chen, C. and Shu, K. Can llm-generated misinformation be lation? a comprehensive evaluation. arXiv preprint
detected? arXiv preprint arXiv:2309.13788, 2023. arXiv:2302.09210, 2023.
Chen, Y., Fu, Q., Yuan, Y., Wen, Z., Fan, G., Liu, D., Zhang,
Hu, B., Sheng, Q., Cao, J., Shi, Y., Li, Y., Wang, D., and Qi,
D., Li, Z., and Xiao, Y. Hallucination detection: Robustly
P. Bad actor, good advisor: Exploring the role of large
discerning reliable answers in large language models. In
language models in fake news detection. arXiv preprint
Proceedings of the 32nd ACM International Conference
arXiv:2309.12247, 2023.
on Information and Knowledge Management, pp. 245–
255, 2023a. Jenner, B., Köbler, J., McKenzie, P., and Torán, J. Complete-
Chen, Z., Mao, H., Li, H., Jin, W., Wen, H., Wei, X., Wang, ness results for graph isomorphism. Journal of Computer
S., Yin, D., Fan, W., Liu, H., et al. Exploring the potential and System Sciences, 66(3):549–566, 2003.
of large language models (llms) in learning on graphs.
arXiv preprint arXiv:2307.03393, 2023b. Jung, J., Qin, L., Welleck, S., Brahman, F., Bhagavatula, C.,
Bras, R. L., and Choi, Y. Maieutic prompting: Logically
Cheng, V. and Zhang, Y. Analyzing ChatGPT’s math- consistent reasoning with recursive explanations. arXiv
ematical deficiencies: Insights and contributions. In preprint arXiv:2205.11822, 2022.
Wu, J.-L. and Su, M.-H. (eds.), Proceedings of the 35th
Conference on Computational Linguistics and Speech Kahneman, D. Thinking, fast and slow. macmillan, 2011.
Processing (ROCLING 2023), pp. 188–193, Taipei City,
Taiwan, October 2023. The Association for Compu- Khot, T., Trivedi, H., Finlayson, M., Fu, Y., Richardson, K.,
tational Linguistics and Chinese Language Processing Clark, P., and Sabharwal, A. Decomposed prompting:
(ACLCLP). URL https://aclanthology.org/ A modular approach for solving complex tasks. arXiv
2023.rocling-1.22. preprint arXiv:2210.02406, 2022.
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Kilpeläinen, P. and Mannila, H. Grammatical tree match-
Introduction to algorithms. 2022. ing. In Annual Symposium on Combinatorial Pattern
Matching, pp. 162–174. Springer, 1992.
Cortes, C. and Lawrence, N. D. Inconsistency in confer-
ence peer review: revisiting the 2014 neurips experiment. Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y., and Wen, J.-R.
arXiv preprint arXiv:2109.09774, 2021. Halueval: A large-scale hallucination evaluation bench-
Creswell, A. and Shanahan, M. Faithful reasoning using mark for large language models. In Proceedings of the
large language models. arXiv preprint arXiv:2208.14271, 2023 Conference on Empirical Methods in Natural Lan-
2022. guage Processing, pp. 6449–6464, 2023a.
Feng, G., Gu, Y., Zhang, B., Ye, H., He, D., and Wang, L. Li, S., Yang, J., and Zhao, K. Are you in a masquer-
Towards revealing the mystery behind chain of thought: a ade? exploring the behavior and impact of large language
theoretical perspective. arXiv preprint arXiv:2305.15408, model driven social bots in online social networks. arXiv
2023. preprint arXiv:2307.10337, 2023b.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
Liu, X. and Sajda, P. Roe: A computational-efficient anti- Tan, Y., Min, D., Li, Y., Li, W., Hu, N., Chen, Y., and
hallucination fine-tuning technology for large language Qi, G. Evaluation of chatgpt as a question answering
model inspired by human learning process. In Interna- system for answering complex questions. arXiv preprint
tional Conference on Brain Informatics, pp. 456–463. arXiv:2303.07992, 2023.
Springer, 2023.
Tennant, J. P. and Ross-Hellauer, T. The limitations to our
Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Cheng, R. G. H., understanding of peer review. Research integrity and peer
Klochkov, Y., Taufiq, M. F., and Li, H. Trustworthy review, 5(1):6, 2020.
llms: a survey and guideline for evaluating large language
models’ alignment. arXiv preprint arXiv:2308.05374, Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
2023. M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
Manakul, P., Liusie, A., and Gales, M. J. Selfcheckgpt: Zero-
guage models. arXiv preprint arXiv:2302.13971, 2023.
resource black-box hallucination detection for generative
large language models. arXiv preprint arXiv:2303.08896, Wadden, D., Lin, S., Lo, K., Wang, L. L., van Zuylen,
2023. M., Cohan, A., and Hajishirzi, H. Fact or fiction: Ver-
Mao, R., Chen, G., Zhang, X., Guerin, F., and Cambria, E. ifying scientific claims. In Webber, B., Cohn, T., He,
Gpteval: A survey on assessments of chatgpt and gpt-4. Y., and Liu, Y. (eds.), Proceedings of the 2020 Confer-
arXiv preprint arXiv:2308.12488, 2023. ence on Empirical Methods in Natural Language Pro-
cessing (EMNLP), pp. 7534–7550, Online, November
Marsi, E. and Krahmer, E. Automatic analysis of semantic 2020. Association for Computational Linguistics. doi:
similarity in comparable text through syntactic tree match- 10.18653/v1/2020.emnlp-main.609. URL https://
ing. In Proceedings of the 23rd international conference aclanthology.org/2020.emnlp-main.609.
on computational linguistics (COLING), pp. 752–760.
Chinese Information Processing Society of China (CIPS), Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang,
2010. S., Chowdhery, A., and Zhou, D. Self-consistency im-
proves chain of thought reasoning in language models.
Marzal, A. and Vidal, E. Computation of normalized edit arXiv preprint arXiv:2203.11171, 2022.
distance and applications. IEEE transactions on pattern
analysis and machine intelligence, 15(9):926–932, 1993. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F.,
Merrill, W. and Sabharwal, A. The parallelism tradeoff: Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought
Limitations of log-precision transformers. Transactions prompting elicits reasoning in large language models.
of the Association for Computational Linguistics, 11:531– Advances in Neural Information Processing Systems, 35:
545, 2023a. doi: 10.1162/tacl a 00562. URL https: 24824–24837, 2022.
//aclanthology.org/2023.tacl-1.31.
Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato,
Merrill, W. and Sabharwal, A. The expresssive power J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B.,
of transformers with chain of thought. arXiv preprint Kasirzadeh, A., et al. Ethical and social risks of harm
arXiv:2310.07923, 2023b. from language models. arXiv preprint arXiv:2112.04359,
2021.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
Sutskever, I., et al. Language models are unsupervised Whitehouse, C., Weyde, T., Madhyastha, P., and Komninos,
multitask learners. N. Evaluation of fake news detection with knowledge-
enhanced language models. In Proceedings of the In-
Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent
ternational AAAI Conference on Web and Social Media,
abilities of large language models a mirage? arXiv
volume 16, pp. 1425–1429, 2022.
preprint arXiv:2304.15004, 2023.
Sharma, K., Seo, S., Meng, C., Rambhatla, S., and Wu, Y., Zhu, J., Xu, S., Shum, K., Niu, C., Zhong, R.,
Liu, Y. Coronavirus on social media: Analyzing mis- Song, J., and Zhang, T. Ragtruth: A hallucination corpus
information in twitter conversations. arXiv preprint for developing trustworthy retrieval-augmented language
arXiv:2003.12309, 2020. models. arXiv preprint arXiv:2401.00396, 2023.
Sharma, K., Zhang, Y., and Liu, Y. Covid-19 vaccine mis- Yang, Z., Ding, M., Lv, Q., Jiang, Z., He, Z., Guo, Y., Bai,
information campaigns and social media narratives. In J., and Tang, J. Gpt can solve mathematical problems
Proceedings of the International AAAI Conference on without a calculator. arXiv preprint arXiv:2309.03241,
Web and Social Media, volume 16, pp. 920–931, 2022. 2023.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y.,
and Narasimhan, K. Tree of thoughts: Deliberate prob-
lem solving with large language models. arXiv preprint
arXiv:2305.10601, 2023.
Zhang, X. and Gao, W. Towards llm-based fact verification
on news claims with a hierarchical step-by-step prompt-
ing method. arXiv preprint arXiv:2310.00305, 2023.
Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang,
X., Zhao, E., Zhang, Y., Chen, Y., et al. Siren’s song in
the ai ocean: A survey on hallucination in large language
models. arXiv preprint arXiv:2309.01219, 2023.
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang,
X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al.
Least-to-most prompting enables complex reasoning in
large language models. arXiv preprint arXiv:2205.10625,
2022.
Zhu, X., Wang, J., Zhang, L., Zhang, Y., Gan, R., Zhang, J.,
and Yang, Y. Solving math word problem via coopera-
tive reasoning induced language models. arXiv preprint
arXiv:2210.16257, 2022.
Zong, M. and Krishnamachari, B. Solving math word prob-
lems concerning systems of equations with gpt-3. In
Proceedings of the AAAI Conference on Artificial Intelli-
gence, volume 37, pp. 15972–15979, 2023.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
A. Appendix
A.1. Proof to Theorem 4.2
Before providing the proof, we first formally define how to organize the inputs (i.e., two 2-color trees) as a sequence. We
assume that we acquire two trees tp of size n and tb of size m. They are organized as two sequences of nodes with a random
order. Each node has three variables: color, left child index, and right child index. If any child is null, then the index is filled
′
with 0. Then we can organize them as as two sequences Xp ∈ Rn×3 and Xb ∈ Rn ×3 , where each item in the sequence is a
vector of 3 dimensions. The first dimension is the index of the left child, the second dimension is the index of the right
child, the third dimension is the color indicator (0 or 1). In addition, we have a root vector r with three dimensions. The first
dimension is the index of the root node of tp (i.e., pointing to the root node of tp ) and the second is the index of the root
node of tb (i.e., pointing to the root node of tb ). The third dimension of r is filled with 0 to make it have same dimension as
the items in Xp and Xb . This expression of trees is also called as pointer list encoding according to (Jenner et al., 2003).
Note that in the following proof, we assume that all indices start from 1. Thus 0 is regarded as a NULL pointer.
Following the proof flow we provided in Sec. 4.2, we first provide the following divide-and-conquer algorithm that can
solve the above problem:
Algorithm 4 Implementation of d(r, Xp , Xb ) when the depth of the tree indicated by r is not longer than 2
′
Require: Inputs r ∈ R3 , Xp ∈ Rn×3 , Xb ∈ Rn ×3
Ensure: A 0-1 indicator vector v: if there exists a subtree with node i as root that is isomorphic with pattern tree tp defined
with inputs r, Xp , Xb , then the v[i] is 1. Otherwise, v[i] is 0.
1: rl ←< Xp [r[1], 2], r[2], r[3] >
2: rr ←< Xp [r[1], 3], r[2], r[3] >
3: Return rl , rr
The algorithm described above is a typical divide-and-conquer algorithm for solving rooted tree isomorphism. Its justification
can be found in many textbooks introducing algorithms, such as Introduction to Algorithms (Cormen et al., 2022). Here we
provide the detailed definition and implementation of problem size metric f (·), hyper-parameter w, merge function m(),
sub-task tackling function t(·), task decomposition function d(·):
• w = 1, and f (r, Xp , Xb ) is defined as the depth of the pattern tree tp indicated with root vector r. Although precisely
calculating f (r, Xp , Xb ) is of O(n), judging whether f (r, Xp , Xb ) > 1 only require us to check whether the root
node has child. If not, then return False.
• d(r, Xp , Xb ) = rl , rr returns two new root vectors rl , rr . Both rl , rr have the same second and third dimension as
r. The rl ’s first dimension is updated to be the index of the left child of the root node that r points to. The rr ’s first
dimension is updated to be the index of the right child of the root node that r points to. The updating function can be
written as:
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
Algorithm 5 Implementation of t(r, Xp , Xb ) when the depth of the tree indicated by r is not longer than 2
′
Require: Inputs r ∈ R3 , Xp ∈ Rn×3 , Xb ∈ Rn ×3
Ensure: A 0-1 indicator vector v: if there exists a subtree with node i as root that is isomorphic with pattern tree tp defined
with inputs r, Xp , Xb , then the v[i] is 1. Otherwise, v[i] is 0.
1: Initialize v as all q vector with a length of n′
2: if r[1] == 0 then
3: Return v
4: end if
5: for i ∈ {1, 2, ..., m} do
6: if Xb [i, 3]! = Xp [r[1], 3] then
7: v[i] ← 0
8: end if
9: end for
10: Return v
• t(r, Xp , Xb ) = v returns a 0-1 indicator vector v ∈ Rm with the same length of the base tree size. If there exists
a subtree with node i as root that is isomorphic with pattern tree tp defined with inputs r, Xp , Xb , then the v[i] is 1.
Otherwise, v[i] is 0. When the pattern tree’s depth is not higher than 1 (i.e., 1-node tree), t(r, Xp , Xb ) is equivalent to
output a 0-1 vector indicating the nodes in the base tree that have the same color of the root node of pattern tree. The
implementation is provided in Alg. 5.
• m(r, Xp , Xb , vl , vl ) = v merge the results vl , vl to acquire a 0-1 indicator vector v ∈ Rm with the same length of
the base tree size. If there exists a subtree with node i as root that is isomorphic with pattern tree tp defined with inputs
r, Xp , Xb , then the v[i] is 1. Otherwise, v[i] is 0. This function can be implemented by checking whether the pattern
root’s children have a perfect match with each node’s children. Since each node has at most two children, checking the
perfect match can be done in constant time. The implementation is provided in Alg. 6.
After providing the detailed implementation of the functions d(·), t(·), m(·), we are going to prove that there exists one
unified transformer that can handle all these tasks with different prompts d, t, m. First, we will provide the following
Lemma:
Lemma A.1. Any fixed-size logic circuit that only contains multi-fan-in AND gates, multi-fan-in OR gates, NOT gates and
has no recurrent structure can be precisely simulated by a multi-layer perceptron (MLP) with ReLU activation function
and a width of O(|Input| + |Circuit|) and a depth of O(|Circuit|), where |Input| denotes the size of input and |Circuit|
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
Proof. Assume that we are given a series of input pins with logic variable of 0 or 1, organized as a 0-1 vector x ∈ Rh . We
first prove that all gates can be simulated by a two-layer perceptron. Then we can serialize all gates in the circuits and stack
their corresponding 2-layer simulators accordingly to acquire a MLP simulator. An AND gate that take x as input can be
simulated as:
AND(x) = σ(wA x − h + 1) (1)
where σ is the ReLU activation function, and wA is a weight vector with all dimensions equal to 1. If some dimensions of x
are not the input of the gate, we can set the corresponding dimensions in the weight vector as 0 and adjust the h as the input
pin number. Similarly, an OR gate that take x as input can be simulated as:
where σ is the ReLU activation function, and wO is a weight vector with all dimensions equal to -1. A NOT gate is different,
since it only takes one input pin. In such a case, we denote the index of the input pin as i, then we can simulate a NOT gate
as:
NOT(x) = σ(wN x + 1) (3)
where wN is a weight is a weight vector whose i-th dimension equals to -1 and all other dimensions equal to 0. Also, since
the x is a 0-1 vector, the activation function is equivalent to a identical function to x:
x = σ(x) (4)
To construct a MLP that can simulate a fixed-size logic circuit without recurrent structure, we apply the circuit serialization
in (Merrill & Sabharwal, 2023b) which order the gates based on topological order. In this way, we can represent the circuit as
a sequence GATE[1], GATE[2], GATE[3],...,GATE[L], where each GATE[i]’s input only contains the output of the previous
gates and the original input x. Therefore, we can construct a 2L-layer MLP base on the above serialization. Specifically,
the 2i-th and 2i + 1-th layers of the MLP will simulate the GAT E[i] as well as copy all previous inputs with activation
function and concatenate them together. This can be done by concatenate an identical matrix on the GATE’s weight vector
(wA , wO or wN ). In this way, we can construct a MLP that precisely simulate the circuit. Since every time we concatenate
the output of a gate with the input of it, the input dimension number of the final layer can be bounded by O(|x| + L). In the
worst case, for a circuit of size L, we needs 2L layers to precisely simulate it. However, in many cases, a lot of gates in the
circuits can be run parallelly. In such cases, the MLP could be much more shallow.
Proof. We prove this theorem by constructing a Transformer that can tackle this problem. First we define how to organize
′
the input given r, Xp , Xb and the prompt. Specifically, we construct a feature sequence X ∈ R(3+n+n )×7 . Each item in
this sequence is a feature of 7 dimensions, indicating a token. The first two dimensions indicate whether the token is a
prompt (’00’), a root vector (’01’), a pattern tree node (’10’), or a base tree node (’11’). The third to fifth dimensions carries
the information about the token. For a prompt token, ’100’ indicates merge prompt m, ’010’ indicates sub-task tackling
prompt t, and ’001’ indicates task decomposition prompt d. For other cases, these three dimensions are with the same
formula as the three dimensions in r, Xp , Xb . The rest two dimensions are allocated specifically for the merge function m(·)
to store vl and vr . More specifically, for the feature of token indicating the i-th base tree node, its sixth dimension is vl [i]
and its seventh dimension is vr [i]. For other tokens, these two dimensions are filled with 0. In X[1] we store the prompt
token. In X[2] and X[3] we store the input root vector r duplicately. We store the same token twice so that we can tackle rl
and rr separately. To separate this two token, we use the last dimension, which was padded as 0 in r, to distinguish them.
X[2, 5] is set as 0 and X[3, 5] is set as 1. From X[4] to X[3 + n], we store Xp . From X[4 + n] to X[3 + n + n′ ], we store
Xb . For all node indices of pattern tree, we add them by 3. For all node indices of base tree, we add them by 3+n, so that the
indices can be applied to directly retrieve the positional embeddings. After preparing the inputs, we start to construct our
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
Transformer. The transformer first attach the position index for each token (positional embedding). After that, the inputs are
forwarded into a transformer with depth of 2. Each transformer layer contains a multi-head attention layer followed by a
MLP. As proved by (Merrill & Sabharwal, 2023b; Feng et al., 2023), the attention layer of Transformer can retrieve the
feature of tokens whose positional embeddings satisfy specific conditions. For multi-head attention, different heads can
retrieve tokens with different conditions. In the following construction, we will use this conclusion to construct attention
heads with different functions.
In the first Transformer layer, the function of each attention head is defined as:
• Head 1 only attends to the token itself to store X[i] for token i.
• Head 2 attends to the token with a positional embedding matches the X[i, 3] and copy this token’s 5-dimension feature.
For tree node tokens, this head’s job is to retrieve the feature of X[i]’s left child. For root vector tokens, this head’s job
is to retrieve the feature of pattern tree root node. For the first token (prompt token), this head’s retrieved feature will
not be applied in the afterwards layers and thus does not influence the correctness of the model.
• Similar as Head 2, Head 3 attends to the token with a positional embedding matches the X[i, 4] and copy this token’s
5-dimension feature. This head’s job is to retrieve the feature of X[i]’s right child. For root vector tokens, this head’s
job is to retrieve the feature of base tree root node.
• Head 4 attends to the first token (prompt token) and copy this token’s 7-dimension feature. This head’s job is to retrieve
the prompt indicator.
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
• Head 5 attends to the second token (root token) and copy this token’s 7-dimension feature. This head’s job is to retrieve
the root information.
With the above 5 heads, the attention layer will output a 35-dimension feature for each token. We denote these features
′
as X′ ∈ R(3+n+n )×35 . After that, these features are forwarded into a MLP fitting identical mapping to acquire the input
features for the second Transformer layer.
In the second Transformer layer, the function of each attention head is defined as:
• Head 1 only attends to the token itself to store X ′ [i] for token i.
• Head 2 attends to the token with a positional embedding matches the X′ [i, 31] and copy this token’s 1-7 dimension
features (X′ [X′ [i, 31], 1 : 7]). This head’s job is to broadcast the feature of the pattern tree root node to every token.
With the above 2 heads, the attention layer will output a 42-dimension feature for each token. We denote these features as
′
X′′ ∈ R(3+n+n )×42 . For root vector token, only the features from head 1 and head 4 are useful. For base tree node tokens,
all 42 dimensions are useful. Then each token’s feature are parallely forwarded into a MLP. We will use this MLP to fit the
logical circuit described in Alg. 7. The function of Alg. 7 is to aggregate the functions of m(·), t(·), d(·) together and assign
the correct value based on the prompt indicator. In Alg. 7, all operations are AND, OR, NOT, SELECTOR, and ASSIGN
and there is not loop. Thus, it is a static logical circuit and can be implemented with multi-fan-in AND, OR, NOT gates.
Thus, it can be precisely simulated by a MLP according to our Lemma A.1.
′
After acquiring the y ∈ R7 for each token, we can organize them as a feature sequence Y ∈ R(3+n+n )×7 . When the prompt
is d, we return Y[2, 3 : 5] as rl and Y[3, 3 : 5] as rr . If the prompt is t or m, then we can output Y[3 + n + 1 : 3 + n + n′ , 5]
as the expected v.
Merge Prompt m: Based on the above analysis, please tell me, does any statement above contain contradiction with the
document?.
Fact-Verification for Misinformation Detection: Similar as hallucination detection, we divide the summary to sentences.
After that, we paralelly verify the sentences. Finally, we merge the verification to each sentence. Thus, our decomposer
prompt and sub-task tackling prompt are the same as hallucination detection. The only difference is the merge prompt.
Merge Prompt m: If we connect the above statements to be a news article, based on the above analyzation, please answer
me: Is there any contradiction between the document and the article?
Sub-task Sub-Solution
A.3. Decomposed Prompting and Least to Most
Least-to-Most (LtM) Prompting (Zhou et al., 2022)
and Decomposed Prompting (Khot et al., 2022) are Input Input
two similar works to our work. They both propose CoT
to explicitly prompt the LLM to decompose the task A
as a series of sub-tasks and sequentially tackle them.
In Fig .2, we merge these two methods. Here, we LtM
will provide more detailed comparison of them, which B
… …
is shown in Fig. 5. Decomposed Prompting can re-
garded as a upgraded version of LtM. It introduces DeP
special notations into the prompt to represent pro- C
gram states so that when sequentially tackling the
sub-tasks, it can call heterogeneous modules to tackle Output Output
them. Such design enable the LLM to call external
programs (e.g., retrieval documents on WikiPedia and (A). Least to Most (B). Decomposed
program based calculator) and/or itself (i.e., recur- Prompting (DeP)
sion). Such design endows it stronger expressive
power and increases the compositional generalization Figure 5. Comparison of Least-to-Most (LtM) Prompting and Decom-
ability of LLMs in different areas, such as symbolic posed Prompting (DeP).
manipulation and multi-hop QA (Khot et al., 2022).
Also, it endows LLM the ability to do open-domain
QA by retrieving from external knowledge base.
A.4. More Discussions on Sequential Sub-task Tackling and Parallel Sub-task Tackling
Sub-task 2: Based on the above result, compute y=123*90*10^2+45*90: Sub-task 2: Compute y=123*90*10^2:
A: ...... A: ......
Figure 6. Toy example of Sequential Sub-task Tackling and Parallel Sub-task Tackling in long integer multiplication
Guiding Large Language Models with Divide-and-Conquer Program for Discerning Problem Solving
Sub-task 1: Verify the following statement: Sub-task 1: Verify the following statement :
#Summary#: A video showing the final moments of Germanwings Flight 9525 has been #Summary#: A video showing the final moments of Germanwings Flight 9525 has been
recovered by investigators from the wreckage site. recovered by investigators from the wreckage site.
A: ...... A: ......
Sub-task 2: Based on the above analysis, verify the following statement : Sub-task 2: Verify the following statement :
#Summary#: A video showing the final moments of Germanwings Flight 9525 has been #Summary#: Marseille prosecutor Brice Robin urged anyone who might have more footage to
recovered by investigators from the wreckage site. Marseille prosecutor Brice Robin urged turn it over immediately.
anyone who might have more footage to turn it over immediately. A: ......
A: ......
Sub-task 3: Verify the following statement :
Sub-task 3: Based on the above analysis, verify the following statement : #Summary#: Andreas Lubitz, the co-pilot accused of deliberately crashing the plane, had a
#Summary#: A video showing the final moments of Germanwings Flight 9525 has been history of severe depression and suicidal tendencies.
recovered by investigators from the wreckage site. Marseille prosecutor Brice Robin urged A: ......
anyone who might have more footage to turn it over immediately. Andreas Lubitz, the co-pilot
accused of deliberately crashing the plane, had a history of severe depression and suicidal Resolution Assembly: Given the above analysis, verify the summary that consist of the
tendencies. above three statements.
A: ...... A: ......
Figure 7. Toy example of Sequential Sub-task Tackling and Parallel Sub-task Tackling in hallucination detection
Sequential Sub-task Tackling and Parallel Sub-task Tackling are two different paradigm in decomposing complex tasks as
sub-task to tackle. The first one decompose a complex tasks as a series of sub-tasks. In this series, each sub-task relies on
the previous one’s output as input or context. The second one decompose a complex tasks as a set of sub-tasks, each of
which does not rely on others. Two examples for multiplication and hallucination detection are provided in Fig 6 and 7