0% found this document useful (0 votes)
2 views19 pages

5639 Chain of Experts When LLM

The paper presents Chain-of-Experts (CoE), a novel multi-agent framework that utilizes large language models (LLMs) to solve complex operations research (OR) problems, reducing reliance on domain experts. CoE orchestrates specialized agents with domain knowledge to enhance reasoning capabilities through a structured interaction process, including forward thought construction and backward reflection. Experimental results demonstrate that CoE significantly outperforms existing LLM-based approaches on a new benchmark dataset, ComplexOR, indicating its effectiveness in tackling complex OR challenges.

Uploaded by

Muturi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views19 pages

5639 Chain of Experts When LLM

The paper presents Chain-of-Experts (CoE), a novel multi-agent framework that utilizes large language models (LLMs) to solve complex operations research (OR) problems, reducing reliance on domain experts. CoE orchestrates specialized agents with domain knowledge to enhance reasoning capabilities through a structured interaction process, including forward thought construction and backward reflection. Experimental results demonstrate that CoE significantly outperforms existing LLM-based approaches on a new benchmark dataset, ComplexOR, indicating its effectiveness in tackling complex OR challenges.

Uploaded by

Muturi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Published as a conference paper at ICLR 2024

C HAIN - OF -E XPERTS : W HEN LLM S M EET C OMPLEX


O PERATIONS R ESEARCH P ROBLEMS
Ziyang Xiao1 , Dongxiang Zhang1∗, Yangjun Wu1 , Lilin Xu1 , Yuan Wang2 ,
Xiongwei Han2 , Xiaojin Fu2 , Tao Zhong2 , Jia Zeng2 , Mingli Song1 , Gang Chen1
1
Zhejiang University 2 Huawei Noah’s Ark Lab
3
School of Business, Singapore University of Social Sciences
{xiaoziyang, zhangdongxiang, brooksong, cg}@zju.edu.cn
[email protected], {hanxiongwei, Zeng.Jia}@huawei.com
[email protected], [email protected]
[email protected], [email protected]

A BSTRACT

Large language models (LLMs) have emerged as powerful techniques for various
NLP tasks, such as mathematical reasoning and plan generation. In this paper,
we study automatic modeling and programming for complex operations research
(OR) problems, so as to alleviate the heavy dependence on domain experts and
benefit a spectrum of industry sectors. We present the first LLM-based solution,
namely Chain-of-Experts (CoE), a novel multi-agent cooperative framework to
enhance reasoning capabilities. Specifically, each agent is assigned a specific role
and endowed with domain knowledge related to OR. We also introduce a con-
ductor to orchestrate these agents via forward thought construction and backward
reflection mechanism. Furthermore, we build a benchmark dataset (ComplexOR)
of complex OR problems to facilitate OR research and community development.
Experimental results show that CoE significantly outperforms the state-of-the-art
LLM-based approaches both on LPWP and ComplexOR.

1 I NTRODUCTION

Operations research (OR) aims to mathematically model complex decision-making problems that
arise from a wide spectrum of industry sectors. To automate the procedure and reduce the depen-
dence on domain-specific modeling experts, NL4Opt (Natural Language for Optimization) (Rama-
monjison et al., 2022a) has recently emerged as an attractive but challenging NLP task. Its objective
is to translate the text description of an OR problem into math formulations for optimization solvers.
To facilitate understanding the task, an example from the current NL4Opt benchmark dataset is de-
picted in Figure 1. The prevailing NL4Opt models adopt a two-stage framework. Initially, they
perform NER to identify variables, parameters, and constraints from the input text, which are subse-
quently converted into a mathematical optimization model. Despite their efficacy in elementary OR
problems, these approaches fail in tackling complex real-world challenges.
In this paper, we study the automatic modeling and programming of complex OR problems derived
from real-world industrial demands. As shown in Figure 1, their text descriptions often contain
implicit constraints, posing a substantial interpretation challenge for existing NL4Opt solvers. For
instance, the phrase “zero lead times”, highlighted in green, conveys the absence of any time lag
between production orders. Additionally, it is imperative to possess domain-specific knowledge to
understand terminologies such as “backlogging”, “carryover”, and “lot-sizing”. Finally, in contrast
to the explicit input numbers in the simple example, complex OR problems exhibit an abundance
of implicit variables that require specification from domain modeling experts. The magnitude of
variables and constraints in these complex problems introduces formidable hurdles and results in a
longer reasoning chain.


Corresponding author.

1
Published as a conference paper at ICLR 2024

An example of current NL4Opt dataset Modeling result

A theme park transports its visitors around the park either by scooter or rickshaw. A Variables: !, #
scooter can carry 2 people while a rickshaw can carry 3 people. To avoid excessive Constraints:
pollution, at most 40% of the vehicles used can be rickshaws. If the park needs to # ≤ 0.4 # + !
transport at least 300 visitors, minimize the total number of scooters used. 3# + 2! ≥ 300
Objective: -./.-.01 !
An example of our dataset

In the context of manufacturing planning, we tackle the Capacitated Multi-level Lot Sizing Problem with Backlogging.
We make the following assumptions in defining and formulating this problem. First, we assume that setup times and
costs are non-sequence dependent, setup carryover between periods is not permitted, and all initial inventories are
zero. Second, all production costs are assumed to be linear in production output and do not vary over time; hence,
they can be dropped from the model for simplicity. Setup and holding costs also are assumed not to vary over time.
Furthermore, end items are assumed to have no successors, and only end items have external demands and
backlogging costs. Finally, we assume zero lead times and no lost sales. It is important to note that all these
assumptions (except setup carryover) are made for ease of exposition only and without loss of generality, i.e., the
theoretical results remain valid even when they are removed. See Ozturk and Ornek (2010) for the lot-sizing problem
with setup carryover as well as with external demands for component items.

Figure 1: Comparison between elementary and complex NL4Opt problems. In the complex OR
example, phrases in green indicate implicit constraints, and the Domain-specific terminologies are
highlighted in yellow. The model output is presented in the Appendix A.1.

To resolve the above issues, we leverage the power of LLMs and present the first LLM-based solu-
tion. We propose a multi-agent reasoning framework, namely Chain-of-Experts (CoE), to orches-
trate multiple LLM agents for complex OR problem solving. At the helm of the collaborative en-
deavor, there presides a central entity, designated as the “Conductor”, responsible for orchestrating
the sequence of interactions among the agents. Each agent is assigned a specific role and is equipped
with domain-specific expertise. We implement diversified agents with different skills, including but
not limited to terminology interpreter, construction of mathematical models and programming. Fur-
thermore, we incorporate a backward reflection mechanism. Through a systematic analysis of the
output, the framework has the capacity to detect potential errors in the problem-solving process.
Comparison with other LLM-based reasoning. In recent years, extensive research efforts have
been devoted to enhancing the reasoning capabilities of Large Language Models (LLMs). Notable
examples in this domain include Chain-of-Thought (Wei et al., 2022), Self-consistency (Wang et al.,
2023a), Tree of Thoughts (Yao et al., 2023a), Graph of Thoughts (Besta et al., 2023), Progressive-
Hint Prompting (Zheng et al., 2023), ReAct (Yao et al., 2023b). These works have formulated
distinct prompting schemes and approaches to thought transformation. Further elaboration on these
methodologies is presented in the subsequent section. Unfortunately, these single-agent LLMs as
well as multi-agent schemes like Solo-Performance Prompting (Wang et al., 2023b) exhibit conspic-
uous limitations when confronted with complex OR problems because they cannot simultaneously
tackle the challenges of implicit constraints, external knowledge prerequisites and long reasoning
chain. In our CoE, we address these challenges via multi-expert collaboration and experimental
results indicate that CoE can significantly outperform the LLM-based approaches.
Contributions. (1) We study NL4Opt at the more challenging level, which requires the model
to have implicit constraint discovery, domain-specific knowledge, and complex reasoning capabil-
ity. (2) This is the first LLM-based solution to complex OR problems. (3) We propose a novel
multi-agent framework called Chain-of-Experts (CoE), enabling collaborative problem-solving and
iterative modeling optimization based on the forward thought construction and backward reflection
mechanism. (4) We also build a new dataset (ComplexOR) and the experimental results on it affirm
the superior performance of CoE over 8 other LLM-based reasoning baselines.

2 R ELATED WORK
NL4Opt Problems. NL4Opt aims to translate the descriptions of OR problems into mathematical
formulations. A benchmark dataset1 was curated by Ramamonjison et al. (2022a). To bridge the gap
1
https://github.com/nl4opt/nl4opt-competition

2
Published as a conference paper at ICLR 2024

between the natural language input p and context-free formulation r, they proposed a two-stage map-
ping p → r → f that first adopted the BART-base model (Lewis et al., 2020) with copy mechanism
to generate an intermediate representation r, which was then parsed into a canonical formulation.
Edit-based models (Malmi et al., 2022) can be applied as a post-processing step for error correction.
The two-stage framework was followed by subsequent studies. He et al. (2022) introduced an en-
semble text generator leveraging multitask learning techniques to enhance the quality of generated
formulations. In a similar vein, Ning et al. (2023) proposed a prompt-guided generation framework,
complemented by rule-based pre-processing and post-processing techniques, to enhance accuracy.
In a related research endeavor, Prasath & Karande (2023) investigated the synthesis of mathematical
programs. GPT-3 with back translation was utilized to synthesize the canonical forms as well as
Python code.

LLMs-based Reasoning. Language models have shown substantial potential in solving complex
reasoning tasks within specific domains, such as TSP(Zhang et al., 2023), databases(Xuanhe Zhou,
2023) and knowledge systems(Zhu et al., 2023). The Chain-of-Thought (CoT) (Wei et al., 2022)
broke a complex reasoning task into a series of intermediate reasoning steps. Self-consistency
(Wang et al., 2023a) replaced the greedy decoding in CoT by sampling a diverse set of reason-
ing paths and selecting the most consistent answer. Tree of Thoughts (ToT) (Yao et al., 2023a) and
Graph of Thoughts (GoT) (Besta et al., 2023) further enhanced the reasoning capability by allowing
LLMs to explore and combine thoughts in a structured manner. Progressive-Hint Prompting (PHP)
(Zheng et al., 2023) progressively refined the answers by leverageing previously generated answers
as hints. Subsequent works, such as ReAct (Yao et al., 2023b) and Reflexion (Shinn et al., 2023),
allowed LLMs to interface with additional information or feedback from external sources. Recently,
cooperation among multiple agents has also been explored. CAMEL (Li et al., 2023) introduced
a novel communicative agent framework for autonomous cooperation. Solo Performance Prompt-
ing (SPP) (Wang et al., 2023b) transformed a single LLM into a cognitive synergist by simulating
multiple personas and demonstrated the potential problem-solving abilities for multi-agent systems.

3 P ROPOSED M ETHOD
3.1 E XPERT D ESIGN

In our reasoning framework, an “expert” refers to a specialized agent based on a Large Language
Model (LLM) augmented with domain-specific knowledge and reasoning skills. Each expert is
assigned a specific role and undergoes four steps:
Step 1: In-context Learning. Each agent is allowed to access an external knowledge base and
perform top-k retrieval against the knowledge base. The retrieved information is then provided to
the LLM to facilitate in-context learning. For example, an expert responsible for generating Gurobi
programs can access the Gurobi official API documentation. This step is optional, depending on the
availability of the knowledge base.
Step 2: Reasoning. LLM-based expert utilizes existing prompting techniques, such as Chain-of-
Thought or self-consistency, to perform reasoning task according to their specific role. Our reason-
ing procedure consists of forward thinking and reflection modes, whose details will be presented in
the subsequent section.
Step 3: Summarize. Due to the token limit constraint in a single interaction with LLM, an expert
can choose to summarize their reasoning output. Since this step may result in significant information
loss, it is optional for certain experts (e.g., modeling experts).
Step 4: Comment. This step is inspired by Solo Performance Prompting (Wang et al., 2023b), in
which the participants are allowed to give critical comments and detailed suggestions. The objective
is to make the communication between agents more constructive.

3.2 T HE WORKFLOW OF C HAIN - OF -E XPERTS

The framework of our proposed Chain-of-Experts (CoE) is depicted in Figure 2. We initialize a


collection of 11 experts such as terminology interpreter, modeling experts, programming experts,
and code reviewing expert. Their detailed design specifications are available in Appendix A.2.1.

3
Published as a conference paper at ICLR 2024

Problem Input: In the context of manufacturing planning, we tackle the Multi-level Lot Sizing Problem with Backlogging. We assume that…

Conductor Terminology Interpreter: Modeling Expert: Programmer:


“backlogging” refers to a ① Variables:𝑥!" , 𝐼!" , 𝐵!" ② import gurobipy as gp
start situation where customer Constraints:… from gurobipy import GRB
orders cannot be met on time… Object: Minimize … model = gp.Model(”MLSP") …

Terminology ① Modeling Expert: Programmer:
Interpreter
I apologize that I have ⑤ I‘ve reviewed the code, and ④ Evaluator’s Feedback:
② Line 41: Variable ’Q‘ is not
reviewed the modeling process, confirm that it accurately
⑤ defined...
there was an error … reflects the modeling…
⑥ Modeling
Expert ⑥
Programmer Programmer: Evaluator’s
forward pass
③ ④⑦ Here‘s the corrected Python ⑦ ⑧
Feedback: Final answer backward pass
⑧ code based on the new
Run successfully! forward pass
Evaluator modeling…
User

Figure 2: An example to illustrate the workflow of Chain-of-Experts. In this example, the Conductor
receives the input problem and starts coordinating the experts. The exemplar workflow consists of ➀:
terminology interpretation for the input problem; ➁: problem modeling; ➂: program generation; ➃:
evaluation of correctness and identify an issue; ➄: reflection of program, confirming correctness; ➅:
reflection modeling, find a mistake; ➆: proceed with program generation again; ➇: final evaluation,
confirming correctness.

There is a Conductor to effectively coordinate these experts. It iteratively and dynamically selects
an expert to construct a forward thought chain. The candidate answer generated by the forward
reasoning chain will be passed to an external program execution environment, whose feedback signal
triggers a backward reflection procedure. The details of these two key components are elaborated in
the following:
Forward Thought Construction. During the forward-thought construction phase, the experts are
sequentially selected by the Conductor. We can formulate forward thought construction as a sequen-
tial decision-making process, where we view the set of experts as the action space. Let’s define the
input problem description as P, and a set of pre-defined experts as ϵ = {Eϕ1 , Eϕ2 , ..., Eϕn }, where
n is the total number of experts and ϕi represents the configuration of the i-th expert. Each expert is
associated with an optional knowledge base and a prompt template. We denote the set of comments
at the t-th reasoning step as Ct and define the state in Equation 1.
St = (P, Ct , t) (1)
Unlike traditional reinforcement learning which requires a large amount of training data, we utilize
a training-free approach by leveraging the prompt technique of large language models. This allows
us to achieve the same functionality as a decision-making agent without any training. Consequently,
we model the policy of Conductor in Equations 2, where the Conductor acts as a policy function to
select the experts, F represents the large language model, θ′ represents the parameters of LLM, and
PTt denotes the prompt template at the t-th step.
ConductorF θ′ (PTt ) (e|s) = Pr {Eϕt = e|St = s} (2)
Based on the above formulation, the expert selection policy can be translated into the design of a
prompt template PTt , which requires prompt engineering to achieve an optimal policy. The detailed
design of the prompt template is presented in Appendix A.2.2. After designing Conductor, we can
update the comment set in each reasoning step as follows.
Eϕit = Conductor(St ) (3)
c = Eϕit (P, Ct ) (4)
Ct+1 = Ct ∪ {c} (5)
where Eϕit represents the selected it -th expert at step t and c denotes the comment of the selected
expert. We concatenate the previous comments Ct and c to obtain Ct+1 as the new state. After a
fixed number of steps T , the forward process in the Chain-of-Experts framework is terminated. At
this point, all the comments are summarized to form the final answer A.
Backward Reflection. The backward reflection mechanism in the Chain-of-Experts enables the
system to leverage external feedback and adjust the collaboration among experts based on the

4
Published as a conference paper at ICLR 2024

Algorithm 1 Chain-of-Experts
Input: problem description p
Parameters: forward steps N , maximum forward-backward trials T
1: Initialize a set of comments C ← ∅
2: Initialize a stack of experts E ← ∅
3: for t = 1, ..., T do
4: for i = 1, ..., N do
5: experti ← Conductor(p, C)
6: comment ← experti (p, C)
7: C ← C ∪ {comment}
8: E.push(experti )
9: end for
10: answer ← Reducer(p, C)
11: f eedback, passed ← Evaluator(answer)
12: if passed then
13: return answer
14: end if
15: stop backward ← false
16: while not stop backward and not E.empty() do
17: expert ← E.pop()
18: f eedback, stop backward ← expert.ref lect(p, C, f eedback)
19: C ← C ∪ {f eedback}
20: end while
21: end for
22: return answer

evaluation of problem-solving results. Let’s define the trajectory of experts selected in order as
τ = {Eϕi1 , Eϕi2 , ..., EϕiT }, where it represents the index of the expert selected at step t. The back-
ward reflection process starts with external feedback rraw , which is typically provided by a program
execution environment. This process can be denoted as rraw ← execution(A). Then, the initial
signals are derived from the evaluation of the raw external feedback: (r0 , sr0 ) ← evaluate(rraw ),
where r0 is a boolean value indicating whether the backward process needs to continue and sr0
represents the natural language summary of the feedback, which is used to locate errors during the
backward reflection process. If the answer A is deemed as correct, r0 is set to false and the whole
reasoning procedure terminates. Otherwise, the Chain-of-Experts initiates a backward self-reflection
process to update the answer. The process begins with the last expert EϕiT , and backpropagates in
reverse order to iteratively update the feedback signal. At the t-th backward step, the update of the
state is described by Equation 6 and 7, where ref lect represents one of the reasoning abilities in
expert design. The ref lect function also produces a tuple of rt and srt , which aligns with the output
of the evaluate function. In this case, rt is set to true when the expert confirms the correctness of
its previous comment.
(rt , srt ) ← ref lect(EϕiT −t+1 , P, Ct , rt−1 ) (6)
Ct+1 = Ct ∪ {srt } (7)
The backward reflection process continues until the feedback signal rt indicates that the expert
EϕiT −t+1 is the one who made the mistake or or until all experts have been reflected upon. Subse-
quently, a forward process will be performed again.
It is worth noting that our reflection method differs from Reflexion (Shinn et al., 2023), where
reflection is performed at the system level with interaction among multiple experts, and the feedback
is recursively backpropagated. In contrast, Reflexion just involves a single LLM.

3.3 I MPLEMENTATION D ETAILS

Algorithm 1 provides the implementation pseudo-code of the Chain-of-Expert framework, which


consists of four main stages:
Initialization (lines 1 - 2): The process begins by initializing the set of comments C. Additionally,
a stack S is used to store the selected experts, ensuring a first-in-last-out order for forward thought
construction and backward reflection.

5
Published as a conference paper at ICLR 2024

Forward Thought Construction (lines 4 - 9): Experts are selected sequentially by the Conductor, with
each expert contributing their comments to the global comment set C. Forward construction process
continues for a fixed number of steps N . As depicted in line 10, once the forward construction is
completed, a Reducer is employed to summarize all the comments and generate a final answer. Since
the comment set contains numerous comments after the forward construction, the Reducer plays a
crucial role in summarizing and reconciling these comments. For more detailed prompt template
design of the Reducer, please refer to Appendix A.2.3.
Backward Reflection (lines 11 - 20): In line 11, once a solution is obtained, an Evaluator gathers
feedback and converts it into natural language feedback to assess its correctness. If the solution is
deemed incorrect, the system enters a reflection phase. In this phase, experts are consulted iteratively
in reverse order by removing them from the stack. They are prompted to reflect on their solution and
provide additional comments if necessary. As indicated in line 16, the backward process continues
until a mistake is found by self-reflection or the first expert is reached.
Iterative Improvement (loop in line 3): The steps of forward thought construction and backward
reflection are repeated iteratively until a satisfactory solution is achieved or a maximum number of
trials T is reached.

4 E XPERIMENTS
4.1 DATASETS

LPWP. The LPWP dataset (Ramamonjison et al., 2022b) is collected from the NL4Opt competition
in NuerIPS 2022. It comprises 1101 elementary-level linear programming (LP) problems. Each
problem consists of a text description with IR annotations including parameters, variables, linear
constraints and the objective function. The dataset is partitioned into 713 training samples, 99
validation samples, and 289 test samples for performance evaluation.
ComplexOR. With the assistance from three specialists with expertise in operations research, we
constructed and released the first dataset for complex OR problems. We selected 37 problems from
diversifed sources, including academic papers, textbooks, and real-world industry scenarios. These
problems cover a wide range of subjects, spanning from supply chain optimization and scheduling
problems to warehousing logistics. It took the experts nearly a month to annotate each problem with
model formulation, and a minimum of five test cases to verify the correctness of generated code.

4.2 M ODEL S ETUP AND P ERFORMANCE M ETRICS

In our experimental setup, we use the GPT-3.5-turbo as the default large language model. We set the
parameter temperature to a value of 0.7 and conduct five runs to average the metrics. The number
of iterations is set to 3, with each iteration consisting of 5 forward steps by default.
Since it is infeasible for domain experts to manually evaluate the output from the LLM-based so-
lutions, we employed an automated code evaluation process. Specifically, we require each solution
to generate the programming code for each OR problem. If the code can pass the associated test
cases annotated by the OR specialists, we consider the problem is successfully solved. We use Ac-
curacy to indicate the success rate. Besides this rigorous metric, we also adopt compile error rate
(CE rate) to capture the percentage of generated programs that fail to compile, possibly caused
by issues in the automatic modeling process; Alternatively, runtime error rate (RE rate) measures
the percentage of generated programs that encounter errors during execution, which are caused by
internal logic errors such as unsolvable models or non-linear constraints. The experimental code is
at https://github.com/xzymustbexzy/Chain-of-Experts.

4.3 BASELINES

We compare CoE with 9 baselines. As to traditional approaches for NL4Opt, we consider tag-BART
(Neeraj Gangwar, 2022) as a SOTA model, which won 1st place in the NeurIPS competition (Ra-
mamonjison et al., 2022b). We also compare CoE with prevailing LLM-based methods, including
Chain-of-Thought, Progressive Hint, Tree-of-Thought, Graph-of-Thought, ReAct, Reflexion
and Solo Performance Prompting. The default GPT without any optimization on the reasoning

6
Published as a conference paper at ICLR 2024

Table 1: Comparison with baselines on LPWP and ComplexOR

LPWP ComplexOR
Method
Accuracy↑ CE rate↓ RE rate↓ Accuracy↑ CE rate↓ RE rate↓
tag-BART 47.9% - - 0% - -
Standard 42.4% 18.1% 13.2% 0.5% 36.8% 8.6%
Chain-of-Thought 45.8% 20.5% 9.4% 0.5% 35.3% 8.6%
Progressive Hint 42.1% 19.4% 10.3% 2.2% 35.1% 13.5%
Tree-of-Thought 47.3% 17.4% 9.7% 4.9% 31.4% 7.6%
Graph-of-Thought 48.0% 16.9% 9.1% 4.3% 32.4% 8.1%
ReAct 48.5% 15.5% 11.2% 14.6% 31.9% 10.8%
Reflexion 50.7% 7.3% 9.0% 13.5% 12.9% 10.1%
Solo Performance 46.8% 17.9% 13.6% 7.0% 46.5% 13.5%
CoE without expert 55.1% 4.0% 11.9% 18.8% 7.9% 15.0%
Chain-of-Experts 58.9% 3.8% 7.7% 25.9% 7.6% 6.4%

chain is named Standard, which is expected to achieve inferior performance. We also implement
a baseline that uses the same model, which uses a uniform system prompt, ”You are a helpful as-
sistant,” across all roles, without any additional knowledge bases. The detailed implementation of
these algorithms is described in Appendix A.3.

4.4 OVERALL P ERFORMANCE ON LPWP AND C OMPLEX OR

The results in terms of accuracy, CE rate and RE rate in the two benchmark datasets are reported
in Table 1. Since the traditional method tag-BART is not capable of generating code, we measure
its accuracy by requiring its constraints and objective in the math model to be correct. Note that
generating a valid linear programming model is a prerequisite step for correct code generation. Even
under such a loose evaluation metric, tag-BART is still inferior to certain LLM-based baselines,
verifying the feasibility of applying LLM to solve OR problems. We also observe that tag-BART
fails in all problem instances of ComplexOR.
Among the LLM-based baselines, Reflexion stands out as the most promising OR problem solver.
Due to its self-reflection mechanism, it achieves the lowest CE rate and RE rate in both datasets.
When confronted with complex OR problems, its overall accuracy is slightly inferior to ReAct. The
reason is that in complex OR problems, the ability to access external knowledge bases becomes
more crucial, which is a strength of ReAct. Even though Solo Performance also adopts a multi-
agent reasoning framework, its performance is not satisfactory. Unlike our collaborative reasoning
framework, its agents are simply initialized by a leader LLM and lack effective cooperation to solve
the challenging OR problems.
Our proposed CoE established clear superiority among all performance metrics in both datasets.
In LPWP, the accuracy is 58.9%, surpassing the state-of-the-art agent Reflexion by 8.2%. In its
best performance, CoE also manages to solve 10 out of 37 complex problem instances in the Com-
plexOR. The outstanding performance owes to the effective design of the Chain-of-Experts reason-
ing framework, including the expert design methodology, the roles of the Conductor and Reducer,
and the reflection mechanism. We also find that the removal of experts’ features leads to a de-
crease in accuracy, which suggests that the CoE benefits from using specialized experts. In the next
experiment, we will investigate the effect of these ingredients through an ablation study.

4.5 A BLATION S TUDY

Regarding single expert design, Table 2 highlights the positive roles played by both the knowledge
base and reasoning ability. Removing these components results in a slight drop in accuracy. Inter-
estingly, we found that summarization is the most crucial design aspect. Reasons are two-fold. First,
the length of comments may exceed the token limit of GPT-turbo-3.5 and the overflowed tokens will
be discarded. Second, a compact and meaningful summary is more friendly for decision making in
the downstream experts.

7
Published as a conference paper at ICLR 2024

Table 2: Ablation study of Chain-of-Experts

LPWP ComplexOR
Method
Accuracy CE rate RE rate Accuracy CE rate RE rate
CoE (Full) 58.9% 3.8% 7.7% 25.9% 7.6% 6.4%
w/o knowledge base 58.0% 4.0% 8.5% 23.3% 8.4% 7.9%
inner-
w/o CoT reasoning 58.2% 3.7% 7.9% 24.3% 8.1% 6.4%
agent
w/o summarize 56.3% 3.8% 9.4% 20% 7.6% 10.3%
w/o Reflection 55.6% 4.2% 12.2% 22.7% 7.8% 10.6%
inter-
w/o Conductor 54.2% 6.5% 8.2% 21.1% 8.1% 8.6%
agent
w/o Reducer 56.5% 5.5% 8.8% 23.0% 9.2% 8.1%

For inter-expert collaboration, we evaluate the effect of backward reflection, forward thought con-
struction in Conductor, and reducer. As shown in Table 2, after removing the component of back-
ward reflection, the accuracy drops significantly from 58.9% to 55.6%, and the RE rate increases
noticeably from 7.7% to 12.2%. These results imply that without the reflection mechanism, the
system is prone to mistakes in logical reasoning and lacks the ability for self correction. To eval-
uate the effect of Conductor, we replace it with random selection of subsequent experts during the
construction of the forward chain of thought. The performance also degrades significantly because
the experts are no longer well coordinated and the random selection of experts may even be detri-
mental to the reasoning process. It’s surprising to find that the Reducer component also contributes
remarkably. This module summarizes the collective insights from multiple preceding experts. If we
remove it, the answer will be extracted from the concatenation of the experts’ raw comments, which
may lead to incorrect conclusions, as the comments can be fragmented and even conflicting with
each other.

4.6 PARAMETER S ENSITIVITY A NALYSIS

We evaluate two critical parameters related to reasoning capabilities, including the number of steps
in the forward thought construction and the temperature in the Large Language Model. From the
results shown in Figure 3a, we can draw two conclusions. Firstly, a lower value for the temperature
parameter tends to lead to better performance. This suggests that, for knowledge-intensive problems,
the experts benefit from providing more deterministic and consistent thoughts, rather than creative or
diverse ones. Secondly, a longer reasoning chain that involves more experts in the forward thought
construction generally improves accuracy. However, it occurs at the cost of higher reasoning time
cost and more API requests. That’s why we select temperature = 0 and f orward steps = 5 as
the default parameter configuration.

100
99.2%

88.3%

0.58 80
69.4%
Frequency (%)

60
0.56 47.3%
Accuracy

44.0%

40 38.0%
34.3%
38.5%

0.54
temperature=0.0 20 14.1%

temperature=0.3
10.2% 10.3%

0.52
6.4%

temperature=0.6 0
Terminology Interpreter

Parameter Extraction

Variable Extraction

Constraint Extraction

Object Extraction

Modeling Knowledge

Modeling

LP File Generation

Feasibility Check

Code Example Provider

Programming

Code Reviewer

0.50 temperature=0.9
2 3 4 5 6 7 8
Forward Steps
(a) CoE performance on different parameter settings. (b) Selection frequency of individual expert.

Figure 3: Parameter sensitive analysis and selection frequency analysis on LPWP

8
Published as a conference paper at ICLR 2024

Table 3: Robustness of Chain-of-Experts under different large language models

GPT-3.5-turbo GPT-4 Claude2


Method
LPWP ComplexOR LPWP ComplexOR LPWP ComplexOR
Standard 42.4% 0.5% 47.3% 4.9% 44.9% 0.0%
Reflexion 50.7% 13.5% 53.0% 16.8% 51.4% 12.4%
Chain-of-Experts 58.9% 25.9% 64.2% 31.4% 62.0% 27.0%

4.7 OTHER LLM S AS BASE R EASONING M ODEL

We also conduct an investigation into the impact of using different LLMs within the Chain-of-
Experts. We consider GPT-4 and Claude2 as two alternative LLMs and select Standard and
Reflexion as two baselines. As shown in Table 3, all methods benefit from the upgrade of more
advanced LLMs. However, our Chain-of-Experts approach exhibited the most substantial improve-
ment. For instance, when GPT-4 is used, we observed an accuracy boost of 8.3% in LPWP and
5.5% in ComplexOR, the highest among the three methods.

4.8 E XPERIMENTAL A NALYSIS OF E XPERT S ELECTION F REQUENCY

In the final experiment, we aim to gain a deeper understanding of the Conductor’s behavior and
examine the rationality behind its selection of experts.
First, we conduct experiments on the LPWP dataset and analyze the selection frequency of each
expert. In Figure 3b, we observe that the Programming Expert and Modeling Expert are the two
most frequently selected experts. This finding is consistent with the expectation that modeling and
programming are crucial to solving OR problems. Additionally, we notice that the extraction of
parameters, variables, constraints, and objective functions is rarely selected. This can be attributed to
advancements in language comprehension capabilities of LLMs, which now can understand problem
statements directly, without the need for the step-by-step NER in traditional methods.
Moreover, we study the most frequently sampled collaboration paths involving multiple experts.
Each problem-solving process provides a path that represents the order of experts involved. In Table
4, we observe that when the parameter f orward steps is set to 2, where only two experts collaborate
to solve a problem, the most frequent path is from the Modeling to the Programming Expert. This
finding aligns with the importance of these two roles in problem-solving. Additionally, when the
steps is set to 6, the collaboration path becomes more complex and resembles real-world workflows.

Table 4: The most frequently collaboration paths for experts on different forward steps settings.

Forward steps Most frequent path


2 Modeling → Programming
3 Knowledge → Modeling → Programming
4 Terminology Interpreter → Knowledge → Modeling → Programming
5 Terminology Interpreter → Modeling → LP file Generator → Programming
→ Code Reviewer
6 TI → Modeling → LP file Generator → Programming Example Provider →
Programming → Code Reviewer

5 C ONCLUSION
In this paper, we presented the first LLM-based solution to complex OR problems. To enhance
reasoning capabilities, we devised Chain-of-Experts (CoE), a novel multi-agent cooperative frame-
work. The core of CoE was a conductor orchestrating a group of LLM-based experts via forward
thought construction and backward reflection mechanism. We built a new dataset, ComplexOR to
facilitate OR research and community development. Experimental results indicated that our CoE
significantly outperformed the state-of-the-art reasoning methods both on LPWP and ComplexOR.

9
Published as a conference paper at ICLR 2024

6 ACKNOWLEDGEMENTS

The work is supported by the National Key Research and Development Project of China
(2022YFF0902000).

R EFERENCES
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda,
Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoe-
fler. Graph of thoughts: Solving elaborate problems with large language models. CoRR,
abs/2308.09687, 2023. doi: 10.48550/arXiv.2308.09687. URL https://doi.org/10.
48550/arXiv.2308.09687.

JiangLong He, Mamatha N, Shiv Vignesh, Deepak Kumar, and Akshay Uppal. Linear program-
ming word problems formulation using ensemblecrf ner labeler and t5 text generator with data
augmentations, 2022.

Dave Hulbert. Tree of knowledge: Tok aka tree of knowledge dataset for large language models llm.
https://github.com/dave1010/tree-of-thought-prompting, 2023.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-
training for natural language generation, translation, and comprehension. In Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880, On-
line, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703.
URL https://aclanthology.org/2020.acl-main.703.

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem.
CAMEL: communicative agents for ”mind” exploration of large scale language model society.
CoRR, abs/2303.17760, 2023. doi: 10.48550/arXiv.2303.17760. URL https://doi.org/
10.48550/arXiv.2303.17760.

Eric Malmi, Yue Dong, Jonathan Mallinson, Aleksandr Chuklin, Jakub Adámek, Daniil Mirylenka,
Felix Stahlberg, Sebastian Krause, Shankar Kumar, and Aliaksei Severyn. Text generation with
text-editing models. CoRR, abs/2206.07043, 2022. doi: 10.48550/arXiv.2206.07043. URL
https://doi.org/10.48550/arXiv.2206.07043.

Nickvash Kani Neeraj Gangwar. Tagged input and decode all-at-once strategy. https://
github.com/MLPgroup/nl4opt-generation, 2022.

Yuting Ning, Jiayu Liu, Longhu Qin, Tong Xiao, Shangzi Xue, Zhenya Huang, Qi Liu, Enhong
Chen, and Jinze Wu. A novel approach for auto-formulation of optimization problems, 2023.

Ganesh Prasath and Shirish Karande. Synthesis of mathematical programs from natural language
specifications. DL4C Workshop, ICLR, 2023.

Rindra Ramamonjison, Haley Li, Timothy Yu, Shiqi He, Vishnu Rengan, Amin Banitalebi-dehkordi,
Zirui Zhou, and Yong Zhang. Augmenting operations research with auto-formulation of optimiza-
tion models from problem descriptions. In Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing: Industry Track, pp. 29–62, Abu Dhabi, UAE, Decem-
ber 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-industry.4.
URL https://aclanthology.org/2022.emnlp-industry.4.

Rindra Ramamonjison, Haley Li, Timothy T. L. Yu, Shiqi He, Vishnu Rengan, Amin Banitalebi-
Dehkordi, Zirui Zhou, and Yong Zhang. Augmenting operations research with auto-formulation
of optimization models from problem descriptions. In Yunyao Li and Angeliki Lazaridou (eds.),
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing:
EMNLP 2022 - Industry Track, Abu Dhabi, UAE, December 7 - 11, 2022, pp. 29–62. Association
for Computational Linguistics, 2022b. doi: 10.18653/v1/2022.emnlp-industry.4. URL https:
//doi.org/10.18653/v1/2022.emnlp-industry.4.

10
Published as a conference paper at ICLR 2024

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic
memory and self-reflection. CoRR, abs/2303.11366, 2023. doi: 10.48550/arXiv.2303.11366.
URL https://doi.org/10.48550/arXiv.2303.11366.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha
Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language
models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Ki-
gali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URL https://openreview.net/
pdf?id=1PL1NIMMrw.

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing
cognitive synergy in large language models: A task-solving agent through multi-persona self-
collaboration. CoRR, abs/2307.05300, 2023b. doi: 10.48550/arXiv.2307.05300. URL https:
//doi.org/10.48550/arXiv.2307.05300.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi,
Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language
models. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/
2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.
html.

Guoliang Li Xuanhe Zhou, Zhaoyan Sun. Db-gpt: Large language model meets database, 2023.

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik
Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. CoRR,
abs/2305.10601, 2023a. doi: 10.48550/arXiv.2305.10601. URL https://doi.org/10.
48550/arXiv.2305.10601.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan
Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International
Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenRe-
view.net, 2023b. URL https://openreview.net/pdf?id=WE_vluYUL-X.

Dongxiang Zhang, Ziyang Xiao, Yuan Wang, Mingli Song, and Gang Chen. Neural tsp
solver with progressive distillation. In Proceedings of the Thirty-Seventh AAAI Conference
on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artifi-
cial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence,
AAAI’23/IAAI’23/EAAI’23. AAAI Press, 2023. ISBN 978-1-57735-880-0. doi: 10.1609/aaai.
v37i10.26432. URL https://doi.org/10.1609/aaai.v37i10.26432.

Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting
improves reasoning in large language models. CoRR, abs/2304.09797, 2023. doi: 10.48550/
arXiv.2304.09797. URL https://doi.org/10.48550/arXiv.2304.09797.

Di Zhu, Hailian Yin, Yidan Xu, Jiaqi Wu, Bowen Zhang, Yaqi Cheng, Zhanzuo Yin, Ziqiang
Yu, Hao Wen, and Bohan Li. A survey of advanced information fusion system: from model-
driven to knowledge-enabled. Data Science and Engineering, 8:1–13, 04 2023. doi: 10.1007/
s41019-023-00209-8.

A A PPENDIX

A.1 A N EXAMPLE OF C OMPLEX OR DATASET

The example shown in Figure 4 shows a typical instance of a complex operations research problem.
Specifically, it illustrates a Capacitated Lot Sizing Problem, which is much more challenging. This
problem involves a wide range of constraints, such as summations, equations, and inequalities. Un-
like simpler problems, the objective function in this case is not a straightforward linear expression
but rather a summation across multiple set variables. These combined characteristics categorize this
problem as a complex optimization challenge.

11
Published as a conference paper at ICLR 2024

An example of our dataset


In the context of manufacturing planning, we tackle the Capacitated Multi-level Lot Sizing Problem with Backlogging.
We make the following assumptions in defining and formulating this problem. First, we assume that setup times and
costs are non-sequence dependent, setup carryover between periods is not permitted, and all initial inventories are
zero. Second, all production costs are assumed to be linear in production output and do not vary over time; hence,
they can be dropped from the model for simplicity. Setup and holding costs also are assumed not to vary over time.
Furthermore, end items are assumed to have no successors, and only end items have external demands and
backlogging costs. Finally, we assume zero lead times and no lost sales. It is important to note that all these
assumptions (except setup carryover) are made for ease of exposition only and without loss of generality, i.e., the
theoretical results remain valid even when they are removed. See Ozturk and Ornek (2010) for the lot-sizing
problem with setup carryover as well as with external demands for component items.

Modeling Result

Sets: 𝑃𝑒𝑟𝑖𝑜𝑠, 𝑀, 𝐼, 𝑒𝑛𝑑, 𝑒𝑡𝑎


Parameters: 𝑠𝑐, 𝑏𝑐, ℎ𝑐, 𝑠𝑡, 𝑎, 𝑔𝑑, 𝑀𝑛, 𝑟, 𝐶
Variables: 𝑥, 𝑠, 𝑏, 𝑦, 𝑏𝑐
Constraints:
invBalance1: 𝑥!,# + 𝑠!,#$% + 𝑏!,# − 𝑏!,#$% = 𝑔𝑑!,# + 𝑠!,# for 𝑖 ∈ 𝑒𝑛𝑑, 𝑡 ∈ 𝑃𝑒𝑟𝑖𝑜𝑑𝑠
invBalance2: 𝑥!,# + 𝑠!,#$% = 𝑔𝑑!,# + ∑&∈(#)! 𝑟!,& : 𝑥&,# + 𝑠!,# for 𝑖 ∈ 𝐼 − 𝑒𝑛𝑑, 𝑡 ∈ 𝑃𝑒𝑟𝑖𝑜𝑑𝑠
capConstraints: ∑!∈* 𝑎!,+ ⋅ 𝑥!,# + ∑!∈* 𝑠𝑡!,+ ⋅ 𝑦!,# <= 𝐶+,# for 𝑚 ∈ 𝑀, 𝑡 ∈ 𝑃𝑒𝑟𝑖𝑜𝑑𝑠
setupConstraints: 𝑥!,# <= 𝑀𝑛 ⋅ 𝑦!,# for 𝑖 ∈ 𝐼, 𝑡 ∈ 𝑃𝑒𝑟𝑖𝑜𝑑𝑠
initStatus: 𝑠!,, = 0 for 𝑖 ∈ 𝐼
endStatus: 𝑏!,- = 0 for 𝑖 ∈ 𝐼
Objective: minimize ∑!∈*,#∈.(/!012 (𝑠𝑐! ⋅ 𝑦!,# + ℎ𝑐! ⋅ 𝑠!,# ) + ∑!∈(31,#∈.(/!012 𝑏𝑐! ⋅ 𝑏!,#

Figure 4: An example of ComplexOR dataset.

A.2 M ORE I MPLEMENTATION D ETAILS OF C HAIN - OF -E XPERTS

A.2.1 E XPERTS D ESIGN

In this section, we provide an in-depth overview of the individual experts participating in the Chain-
of-Experts framework. Table 5 offers a comprehensive list of these experts, each assigned a specific
role and domain knowledge relevant to OR problem-solving.

Table 5: All experts involved in Chain-of-Experts

Expert name Knowledge base


Terminology Interpreter Supply Chain Optimization & Scenario Modeling
Parameter Extraction Expert -
Variable Extraction Expert -
Constraint Extraction Expert -
Objective Function Extraction Expert -
Modeling Knowledge Supplement Expert GAMS-Cutting Edge Modeling
Modeling Expert -
LP File Generation Expert LP format Documentation
Constraint Feasibility Check Expert -
Programming Example Provider Gurobi Example Tour
Programming Expert Gurobi Reference Manual
Code Reviewer -

Below, we present the detailed descriptions and prompt template implementations for each expert.
Please note that text enclosed within curly braces signifies placeholders that will be dynamically
populated during runtime based on the specific problem description, comments provided by experts,
and retrieved knowledge.

Terminology Interpreter:

12
Published as a conference paper at ICLR 2024

Role description: Provides additional domain-specific knowledge to enhance problem understand-


ing and formulation.
Prompt template: As a domain knowledge terminology interpreter, your role is to provide addi-
tional information and insights related to the problem domain. Here are some relevant background
knowledge about this problem: {knowledge}. You can contribute by sharing your expertise, ex-
plaining relevant concepts, and offering suggestions to improve the problem understanding and for-
mulation. Please provide your input based on the given problem description: {problem}.

Parameter Extraction Expert:


Role description: Provides additional domain-specific knowledge to enhance problem understand-
ing and formulation.
Prompt template: As a variable extraction expert, your role is to identify and extract the relevant
variables from the problem statement. Variables represent the unknowns or decision variables in the
optimization problem. Your expertise in the problem domain will help in accurately identifying and
describing these variables. Please review the problem description and provide the extracted variables
along with their definitions: {problem}.

Variable Extraction Expert:


Role description: Proficient in identifying and extracting relevant variables from the problem state-
ment.
Prompt template: As a variable extraction expert, your role is to identify and extract the relevant
variables from the problem statement. Variables represent the unknowns or decision variables in the
optimization problem. Your expertise in the problem domain will help in accurately identifying and
describing these variables. Please review the problem description and provide the extracted variables
along with their definitions: {problem}.

Constraint Extraction Expert:


Role description: Skilled in extracting constraints from the problem description.
Prompt template: As a constraint extraction expert, your role is to identify and extract the con-
straints from the problem description. Constraints represent the limitations or conditions that need
to be satisfied in the optimization problem. Your expertise in the problem domain will help in ac-
curately identifying and formulating these constraints. Please review the problem description and
provide the extracted constraints: {problem}. The comments given by your colleagues are as fol-
lows: {comments}, please refer to them carefully.

Objective Function Extraction Expert:


Role description: Capable of identifying and extracting the objective function from the problem
statement.
Prompt template: You are an expert specialized in Operations Research and Optimization and re-
sponsible for objective function extraction. Your role is to identify and extract the objective function
from the problem statement. The objective function represents the goal of the optimization problem.
Now, the problem description is as following: {problem}.

Modeling Knowledge Supplement Expert:


Role description: Offers supplementary knowledge related to modeling techniques and best prac-
tices.
Prompt template: As a modeling knowledge supplement expert, your role is to provide additional
knowledge and insights related to modeling techniques and best practices in the field of Operations
Research and Optimization. Here are some relevant background knowledge about modeling tech-
nique: {knowledge}. You can contribute by explaining different modeling approaches, suggesting
improvements, or sharing relevant tips and tricks. Please provide your input based on the given
problem description and the modeling efforts so far: {problem}.

13
Published as a conference paper at ICLR 2024

Modeling Expert:
Role description: Proficient in constructing mathematical optimization models based on the ex-
tracted information.
Prompt template: You are a modeling expert specialized in the field of Operations Research and
Optimization. Your expertise lies in Mixed-Integer Programming (MIP) models, and you possess an
in-depth understanding of various modeling techniques within the realm of operations research. At
present, you are given an Operations Research problem, alongside additional insights provided by
other experts. The goal is to holistically incorporate these inputs and devise a comprehensive model
that addresses the given production challenge. Now the origin problem is as follow: {problem}.
And the modeling is as follow: {comments} Give your model of this problem.

LP File Generation Expert:


Role description: Expertise in generating LP (Linear Programming) files that can be used by opti-
mization solvers.
Prompt template: As an LP file generation expert, your role is to generate LP (Linear Program-
ming) files based on the formulated optimization problem. LP files are commonly used by opti-
mization solvers to find the optimal solution. Here is the important part source from LP file format
document: {knowledge}. Your expertise in generating these files will help ensure compatibility and
efficiency. Please review the problem description and the extracted information and provide the gen-
erated LP file: {problem}. The comments given by your colleagues are as follows: {comments},
please refer to them carefully.

Programming Example Provider:


Role description: Provides programming examples and templates to assist in implementing the
optimization solution.
Prompt template: As a programming expert in the field of operations research and optimiza-
tion, you offer programming examples and templates according to the background knowledge:
{knowledge}. Now, according to problem description: {problem}. Could you please comprehend
the extract code snippets in background knowledge and understand the their function, then give your
code example to assist with addressing this problem. The comments given by your colleagues are
as follows: {comments}, please refer to them carefully.

Programming Expert:
Role description: Skilled in programming and coding, capable of implementing the optimization
solution in a programming language.
Prompt template: You are a Python programmer in the field of operations research and optimiza-
tion. Your proficiency in utilizing third-party libraries such as Gurobi is essential. In addition to
your expertise in Gurobi, it would be great if you could also provide some background in related
libraries or tools, like NumPy, SciPy, or PuLP. You are given a specific problem and comments by
other experts. You aim to develop an efficient Python program that addresses the given problem.
Now the origin problem is as follow: {problem} And the experts along with there comment are as
follow: {comments} Give your Python code directly.

Code Reviewer:
Role description: Conducts thorough reviews of the implemented code to identify any errors, inef-
ficiencies, or areas for improvement.
Prompt template: As a Code Reviewer, your responsibility is to conduct thorough reviews of
implemented code related to optimization problems. You will identify possible errors, inefficiencies,
or areas for improvement in the code, ensuring that it adheres to best practices and delivers optimal
results. Now, here is the problem: {problem}. You are supposed to refer to the comments given by
your colleagues from other aspects: {comments}

14
Published as a conference paper at ICLR 2024

A.2.2 T HE C ONDUCTOR
The role of a conductor is highly specialized and significant, which necessitates a more intricate
prompt design compared to other experts. The following is a prompt template for a conductor:

You are a leader of an expert team in the field of operations research. Now, You
need to coordinate all the experts you manage so that they can work together to
solve a problem.
Next, you will be given a specific OR problem, and your goal is to select the expert
you think is the most suitable to ask for insights and suggestions.
Generally speaking, the solution of a complex OR problem requires analysis, in-
formation extraction, modeling and programming to solve the problem. The de-
scription of problem is presented as follows: {problem}
Remember, based on the capabilities of different experts and the current status of
the problem-solving process, you need to decide which expert to consult next. The
experts’ capabilities are described as follows: {experts inf o}
Experts that have already been commented include: {commented experts}
REMEMBER, THE EXPERT MUST CHOOSE FROM THE EXISTING LIST
ABOVE.
Note that you need to complete the entire workflow within the remaining
{remaining steps} steps.
Now, think carefully about your choice and give your reasons.

A.2.3 T HE R EDUCER
The Reducer’s role is to serve as a summarizer for all comments provided by the selected experts.
They must meticulously analyze the comments and generate the final answer, which can take various
forms, such as a modeling or a program. The Reducer’s prompt template may vary based on the
specific type of final answer required. Here’s an example of the Reducer’s prompt template when
the goal is to obtain a program:

You have been assigned the critical task of generating a program to solve the
complex operations research problem presented. This program should incorporate
the insights and suggestions provided by the selected experts. Your role is to
synthesize the information effectively to create a functional program.
The program is described as follows: {problem}
The comments from other experts are as follows: {comments}
Could you please write Python GUROBI code according to the comments.

A.3 BASELINES ’ I MPLEMENTATION

To ensure a fair comparison, we have implemented the baseline algorithms following their original
papers’ guidelines.
The traditional model, tag-BART, typically requires a training process. If we were to directly use
a pretrained tag-BART model from the LPWP dataset to test on the ComplexOR dataset, there
would likely be a domain shift. To mitigate this issue, we can adopt a two-step approach. First,
we pretrained the tag-BART model on the LPWP dataset. This initial pretraining enables the model
to acquire basic NER abilities. Next, we fine-tune the pretrained model on an additional set of 30
problems that are similar to the ComplexOR problems. These problems have the same annotated
format as the LPWP dataset. By fine-tuning the model on this specific set of problems, we aim to
maximize the performance and adapt the model to the requirements of the ComplexOR domain.
For the Standard prompting technique, we leverage the in-context learning ability of the language
model. Following the recommended approach outlined in the OpenAI official documentation, we
design the following prompt template:

You are a Python programmer in the field of operations research and optimization.
Your proficiency in utilizing third-party libraries such as Gurobi is essential. In

15
Published as a conference paper at ICLR 2024

addition to your expertise in Gurobi, it would be great if you could also provide
some background in related libraries or tools, like NumPy, SciPy, or PuLP. You
are given a specific problem. You aim to develop an efficient Python program that
addresses the given problem. Now the origin problem is as follow:{problem}.
Give your Python code directly.

Chain-of-Thought is a technique similar to standard prompting but with some additional steps. It
begins with the sentence ”Let’s think step by step” to guide the model’s thought process. After that,
a further summarization step is added because the output generated by Chain-of-Thought can be
lengthy and fragmented.
For Tree-of-Thoughts and Graph-of-Thoughts, we set the parameters based on the experiments
conducted in the respective papers. The width of exploration is set to 3, and the maximum explo-
ration step is set to 5. We adopt the prompt paradigm proposed in the work by Tree-of-Thoughts
Prompting (Hulbert, 2023). The prompt is designed as follows, where {exploration step prompt}
represents the original prompt used in each exploration step:

Imagine three different experts in the field of operations research and optimization
are modeling and writing programmer for a hard problem.
All experts will write down 1 step of their thinking, then share it with the group.
Then all experts will go on to the next step, etc.
If any expert realises they’re wrong at any point then they leave.
The problem description is: {problem}
{exploration step prompt}

For Progressive-Hint Prompting, the original implementation may not be suitable for complex
OR problems. In the original paper, the answer is an immediate number, which makes it easy
to determine consistency between multiple interactions. However, in complex OR problems, the
answer can be a model or a program, both of which are not directly comparable. To address this,
we follow the underlying idea of Progressive-Hint Prompting and make some modifications. We
generate an initial answer and then use an additional interaction with the language model to ask
whether the current answer is the same as the previous one. In this way, the PHP algorithm is
implemented in a more appropriate manner.

In the ReAct approach, there are two main steps: reasoning and acting. In the reasoning step,
we use the same prompt as in CoT to guide the model’s thought process. In the acting step, we
limit the actions to retrieving knowledge from a knowledge base. This is because, in complex OR
problems, accessing and utilizing external knowledge is crucial for making informed decisions. To
ensure a fair comparison, we allow the ReAct agent to access all the knowledge bases mentioned in
Chain-of-Experts.
The design of Reflexion aligns with the backward reflection process described in Chain-of-Experts.
In Reflexion, feedback is obtained from the compilation and runtime of the modeling program,
allowing for iterative refinement of the previous steps until the agent is satisfied with the answer. It’s
worth noting that in our experiment setting, we do not generate test units.

A.4 M ORE E XPERIMENTAL R ESULTS

A.4.1 D ETAILED E XPERIMENT R ESULT ON C OMPLEX OR


Table 6 presents a detailed overview of the performance of baseline algorithms and the Chain-of-
Experts approach applied to the ComplexOR dataset. In the interest of brevity, we employ abbre-
viations, referring to the main content of the traditional algorithm as “BART”. Additionally, we
use shorthand labels for various algorithms. The results highlight the challenges faced by both
traditional algorithms, such as BART, and prompting techniques like CoT, as they were unable to
successfully address all the problems in the dataset. Notably, GoT achieved success in solving the
relatively straightforward “Blending” problem. Among the methods employing Large Language
Models agents, which include ReAct and Reflexion, several less complex problems were solvable,

16
Published as a conference paper at ICLR 2024

Table 6: Detailed experiment results for different methods on ComplexOR

problem BART Standard CoT PHP ToT GoT ReAct Reflexion SPP CoE
Blending ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✗ ✓
Car Selection ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✓
Capacitated Warehouse Location ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Employee Assignment ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Aircraft Landing ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
VRPTW Routing ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✓
Flowshop ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Distribution Center Allocation ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Aircraft Assignment ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Traffic Equilibrium ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓
Robot Arm ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Largest Small Polygon ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✓
CFLP ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Cut Problem ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓
Diet Problem ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Dietu Problem ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Knapsack ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Multi-commodity Transportation ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
PROD ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗
Single Level Big Bucket ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✓
Overall 0/20 0/20 0/20 0/20 0/20 1/20 3/20 2/20 0/20 7/20

but overall performance remained suboptimal. The Chain-of-Experts (CoE) approach demonstrated
the highest success rate, solving 7 out of 20 problems, making it the most effective algorithm.
We also found that certain algorithms, particularly PHP and SPP, struggled with token limitations
when confronted with complex problems. This token limitation issue not only hindered their perfor-
mance but also incurred increased computational costs and inefficiencies in their execution. In con-
trast, the Chain-of-Experts approach incorporates three key strategies to mitigate token limitation-
related errors. Firstly, the utilization of expert summaries significantly reduces memory stress by
nearly 50%. Secondly, the adoption of a Conductor, as opposed to a round-robin approach, further
reduces the maximum context, effectively halving the token usage. Lastly, the incorporation of a
visible map led to a substantial reduction of approximately 70% in token consumption.

A.4.2 A BLATION EXPERIMENT ON INDIVIDUAL EXPERT

0.60 0.580 0.584 0.583 0.580 0.588 0.570 0.585 0.576


0.569
0.350 0.350 0.350 0.350
0.553 0.568 0.3
0.300
0.250 0.250
0.300
0.55
0.200 0.200
Accuracy

Accuracy

0.513 0.2
0.50 0.150
0.100
0.1
0.45
0.0
Terminology Interpreter

Parameter Extraction

Variable Extraction

Constraint Extraction

Objective Function Extraction

Modeling Knowledge Supplement

Modeling

Constraint Feasibility Check

Terminology Interpreter

Parameter Extraction
Variable Extraction
Constraint Extraction
Objective Function Extraction
Modeling Knowledge Supplement
Modeling

Constraint Feasibility Check


LP File Generation

Programming Example Provider

Programming

Code Reviewer

LP File Generation

Programming Example Provider


Programming
Code Reviewer

(a) Expert ablation experiment on LPWP dataset (b) Expert ablation experiment on ComplexOR
dataset

Figure 5: Impact on accuracy when removing individual experts from CoE

Figure 5 presents the results of the ablation experiment conducted on each expert in the Chain-of-
Experts framework. The x-axis labels represent the removal of specific experts from CoE, while
the y-axis represents the accuracy achieved after removing each expert. The blue horizontal line

17
Published as a conference paper at ICLR 2024

represents the accuracy when all experts are integrated. Based on the results, we can observe the
following insights regarding the importance of each expert for both datasets.
In the subfigure 5a, which corresponds to the LPWP dataset consisting of easy problems, removing
a single expert does not lead to a significant performance drop. However, the most crucial expert is
the Programming Expert. This finding aligns with the nature of the LPWP dataset, where the final
evaluation is based on the correctness of the program. Therefore, having an expert who can provide
insights into programming is essential. The second important expert is the Modeling Expert, as
mathematical modeling plays a crucial role in problem-solving.

Example Problem
In the context of manufacturing planning, we tackle the Capacitated Multi-level Lot Sizing Problem with
Backlogging. We make the following assumptions in defining and formulating this problem. First, we assume that
setup times and costs are non-sequence dependent, setup carryover between periods is not permitted, and all
initial inventories are zero. Second, all production costs are assumed to be linear in production output and do not
vary over time; hence, they can be dropped from the model for simplicity. Setup and holding costs also are
assumed not to vary over time. Furthermore, end items are assumed to have no successors, and only end items
have external demands and backlogging costs. Finally, we assume zero lead times and no lost sales. It is important
to note that all these assumptions (except setup carryover) are made for ease of exposition only and without loss
of generality, i.e., the theoretical results remain valid even when they are removed. See Ozturk and Ornek (2010)
for the lot-sizing problem with setup carryover as well as with external demands for component items.

Standard Prompting Result


Sets: 𝑃𝑒𝑟𝑖𝑜𝑠, 𝑀, 𝐼, 𝑒𝑛𝑑, 𝑒𝑡𝑎
Parameters: 𝑠𝑐, 𝑏𝑐, ℎ𝑐, 𝑠𝑡, 𝑎, 𝑔𝑑, 𝑀𝑛, 𝑟, 𝐶
Variables: 𝑥, 𝑠, 𝑏, 𝑦 Missing variable
Constraints:
invBalance1: 𝑥!,# + 𝑠!,#$% + 𝑏!,# − 𝑏!,#$% = 𝑔𝑑!,# + 𝑠!,# for 𝑖 ∈ 𝑒𝑛𝑑, 𝑡 ∈ 𝑃𝑒𝑟𝑖𝑜𝑑𝑠
invBalance2: 𝑥!,# + 𝑠!,#$% = 𝑔𝑑!,# + ∑&∈(#)! 𝑟!,& : 𝑥&,# + 𝑠!,# for 𝑖 ∈ 𝐼 − 𝑒𝑛𝑑, 𝑡 ∈ 𝑃𝑒𝑟𝑖𝑜𝑑𝑠
capConstraints: ∑!∈* 𝑎!,+ ⋅ 𝑥!,# + ∑!∈* 𝑠𝑡!,+ ⋅ 𝑦!,# <= 𝐶+,# for 𝑚 ∈ 𝑀, 𝑡 ∈ 𝑃𝑒𝑟𝑖𝑜𝑑𝑠
setupConstraints: 𝑥!,# <= 𝑀𝑛 ⋅ 𝑦!,# for 𝑖 ∈ 𝐼, 𝑡 ∈ 𝑃𝑒𝑟𝑖𝑜𝑑𝑠 Missing constrains
Objective: minimize ∑!∈*,#∈,(-!./0(𝑠𝑐! ⋅ 𝑦!,# + ℎ𝑐! ⋅ 𝑠!,# ) Missing item

SSP Result
Personas identification:Manufacturing Planner, Demand Planner, Data Analysts, Communication Specialists

Manufacturing Planner: To address this problem effectively, it's crucial to carefully consider production
quantities and schedules while adhering to the defined assumptions , such as zero lead times and no lost sales.

Demand Planner: In this context, the role of the Demand Planner is critical for forecasting external…

Data Analysts: My recommandations are instrumental in helping the Manufacturing Planner make informed…

Communication Specialists: Effective communication is key to solving the Capacitated Multi-level Lot
Sizing Problem successfully. Communication specialists facilitate collaboration…

Result: Variables: 𝑥, 𝑠, 𝑏, 𝑦 ; Constrains: … 𝑠!,1 = 0, 𝑏!,2 = 0; Objective: minimize ∑(𝑠𝑐! ⋅ 𝑦!,# + ℎ𝑐! ⋅ 𝑠!,# )

Conductor CoE Result


Terminology Interpreter: In this context, … , 1. No Setup Carryover: Setup carryover would mean that …;
2. Initial Inventories Zero means …; 3. Zero Lead Times and No Lost Sales: Lead time is the delay…

Variable Extractor: Sets:𝑃𝑒𝑟𝑖𝑜𝑠, 𝑀, 𝐼,𝑒𝑛𝑑,𝑒𝑡𝑎; Parameters:𝑠𝑐,𝑏𝑐, ℎ𝑐, st, a …; Variables: 𝑥, 𝑠, 𝑏, 𝑦, 𝑏𝑐

Modeling Expert: Variables: 𝑥, 𝑠, 𝑏, 𝑦, 𝑏𝑐 ; Constrains: … 𝑠!,1 = 0, 𝑏!,2 = 0; Objective: minimize


∑(𝑠𝑐! ⋅ 𝑦!,# + ℎ𝑐! ⋅ 𝑠!,# )

Programmer: LP file of this modeling: Minimize: obj…; Subject to: invBalance1[i,t]: x[i,t] + s[i,t-1] + b[i,t]
- b[i,t-1] = gd[i,t] + s[i,t] for all i in end, t in Periods ….
Run the file and get modeling unsolvable error. Start backward reflection to adjust modeling…

Result: Variables: 𝑥, 𝑠, 𝑏, 𝑦, 𝑏𝑐 ; Constrains: … 𝑠!,1 = 0, 𝑏!,2 = 0; Objective: minimize


∑(𝑠𝑐! ⋅ 𝑦!,# + ℎ𝑐! ⋅ 𝑠!,# ) + ∑!∈(3/,#∈,(-!./0 𝑏𝑐! ⋅ 𝑏!,#

Figure 6: Case study

18
Published as a conference paper at ICLR 2024

Subfigure 5b shows that individual experts have a much more significant impact on more challenging
problems. Apart from the Programming Expert and Modeling Expert, the removal of the Terminol-
ogy Interpreter leads to a significant drop of approximately 20% in accuracy. This result highlights
the knowledge-intensive nature of the ComplexOR dataset, which heavily relies on the supplementa-
tion of external knowledge. Interestingly, the LP File Generator Expert also proves to be important.
This finding suggests that for harder problems, utilizing LP files as an efficient intermediate struc-
tural file for modeling is a good approach, as it yields better results compared to writing Python
Gurobi program files.

A.4.3 C ASE STUDY


In this experiment, we conducted a detailed case study to gain insights into the effectiveness of
our approach, as depicted in Figure 6. To reduce the uncertainty of sampling results, we run each
methods five times and based our findings on the majority response. In this result, the standard
prompting approach failed to correctly identify the variable bc as backlogging cost, due to insuf-
ficient knowledge about the Multi-level Lot Sizing Problem. And there was a lack of constraints
regarding initial and end statuses, which are essential in the context of real manufacturing process
constraints. Moreover, the objective function was deemed unsolvable due to missing critical items.
The SSP provided some basic background knowledge through a Manufacturing Planner created by
leader persona. However, this method had limitations: three out of four personas offered negligible
assistance in solving the problem, which is a critical issue in domain-specific problems like OR.
In CoE, the Conductor effectively selected appropriate experts for different stages of the problem-
solving process. Initially, a Terminology Interpreter was chosen to provide essential background
knowledge. Although the Modeling Expert initially repeated the same mistake regarding the objec-
tive function, this error was rectified in the backward reflection process.

19

You might also like