23 Multi Modal and Multi Agent
23 Multi Modal and Multi Agent
Bowen Jiang 1 Yangxinyu Xie 1 2 Xiaomeng Wang 1 Weijie J. Su 1 Camillo J. Taylor 1 Tanwi Mallick 2
1
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Figure 1. The evolutionary tree of multi-agent and/or multi-modal systems related to the four axioms of rationality. Many proposed
approaches strive to address multiple axioms simultaneously. Bold fonts are used to mark works that involve multi-modalities. This tree
also includes foundational works to provide a clearer reference of time.
perspective, simulating the dynamics of discussion in hu- and factual reality of the world in which it operates. There-
man societies. Multi-agent systems can also incorporate fore, drawing on foundational works in rational decision-
multi-modal agents and agents specialized in querying ex- making (Tversky & Kahneman, 1988; Hastie & Dawes,
ternal knowledge sources or tools (Lewis et al., 2020; Schick 2009; Eisenführ et al., 2010), this section adopts an ax-
et al., 2024; Tang et al., 2023; Pan et al., 2024) to overcome iomatic approach to define rationality, presenting four sub-
hallucinations, ensuring that their results are more robust, stantive axioms that we expect a rational agent or agent
deterministic, and trustworthy, thus significantly improving systems to fulfill:
the quality of the generated responses towards rationality.
Grounding The decision of a rational agent is grounded
This survey provides a unique lens to interpret the underly-
on the physical and factual reality. In order to make a sound
ing motivations behind current multi-modal and/or multi-
decision, the agent must be able to integrate sufficient and
agent systems. Drawing from cognitive science, we first
accurate information from different sources and modalities
delineate four fundamental requirements for rational think-
grounded in reality without hallucination. While this re-
ing. We then discuss how research fields within the multi-
quirement is generally not explicitly stated in the cognitive
modality and multi-agents literature are progressing towards
science literature when defining rationality, it is implicitly
rationality by inherently improving these criteria. We posit
implied, as most humans have access to physical reality
that such advancements are bridging the gap between the per-
through multiple sensory signals.
formance of these systems and the expectations for a rational
thinker, in contrast to traditional single-agent language-only
models. We hope this survey can inspire further research Orderability of Preferences When comparing alterna-
at the intersection between agent systems and cognitive tives in a decision scenario, a rational agent can rank the
science. options based on the current state and ultimately select the
most preferred one based on the expected outcomes. This
orderability consists of several key principles, including
2. Defining Rationality comparability, transitivity closure, solvability, etc. with de-
A rational agent should avoid reaching contradictory conclu- tails in Appendix A. The orderability of preferences ensures
sions in decision making processes, respecting the physical the agent can make consistent and logical choices when
faced with multiple alternatives. LLM-based evaluations
2
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
heavily rely on this property, as discussed in Appendix B. ioms of rationality, offering a novel perspective that bridges
existing methodologies with rational principles.
Independence from irrelevant context The agent’s pref-
erence should not be influenced by information irrelevant 4.1. Towards Grounding through Multi-Modal Models
to the decision problem at hand. LLMs have been shown to Multi-modal approaches aim to improve information
exhibit irrational behavior when presented with irrelevant grounding across various channels, such as language and vi-
context (Shi et al., 2023; Wu et al., 2024; Liu et al., 2024c), sion. By incorporating multi-modal models (Radford et al.,
leading to confusion and suboptimal decisions. To ensure 2021; Alayrac et al., 2022; Awadalla et al., 2023; Liu et al.,
rationality, an agent must be able to identify and disregard 2024a; 2023a; Wang et al., 2023c; Zhu et al., 2023a; Ope-
irrelevant information, focusing solely on the factors that nAI, 2023; 2024; Reid et al., 2024), multi-agent systems can
directly impact the decision-making processes. greatly expand their capabilities, enabling a richer, more ac-
curate, and contextually aware interpretation of environment.
Invariance The preference of a rational agent remains For example, Chain-of-Action (Pan et al., 2024) advances
invariant across equivalent representations of the decision the single-modal Search-in-the-Chain (Xu et al., 2023) by
problem, regardless of specific wordings or modalities. supporting multi-modal data retrieval for faithful question
answering. We leave more discussions to Appendix D.
3. Scope
4.2. Towards Grounding through Knowledge Retrieval
Unlike existing surveys (Han et al., 2024; Guo et al., 2024;
Xie et al., 2024a; Durante et al., 2024; Cui et al., 2024; The existing transformer architecture (Vaswani et al., 2017)
Xu et al., 2024; Zhang et al., 2024a; Cheng et al., 2024; fundamentally limits how much information LLMs can hold.
Li et al., 2024a) that focus on the components, structures, As a result, in the face of uncertainty, LLMs often halluci-
agent profiling, planning, communications, memories, and nate (Bang et al., 2023; Guerreiro et al., 2023; Huang et al.,
applications of multi-modal and/or multi-agent systems, this 2023), generating outputs that are not supported by the fac-
survey is the first to specifically examine the increasingly tual reality of the environment. Retrieval-Augmented Gen-
important relationship between rationality and these eration (RAG) (Lewis et al., 2020) marks a significant mile-
multi-modal and multi-agent systems, exploring how they stone in addressing such an inherent limitation of LLMs.
contribute to enhancing rationality in decision making. We A multi-agent system can include planning agents in its
emphasize that rationality, by definition, is not equivalent framework, which determine how and where to retrieve
to reasoning or Theory of Mind, although they are deeply external knowledge, and what specific information to ac-
intertwined. We leave explanations to Appendix C. quire. External knowledge source could be a knowl-
edge graph (Gardères et al., 2020; Hogan et al., 2021), a
4. Towards Rationality through Multi-Modal database (Lu et al., 2024; Xie et al., 2024b), and more. Ad-
ditionally, the system can have summarizing agents that
and Multi-Agent Systems
utilize retrieved knowledge to enrich the system’s language
This section surveys recent advancements in multi-modal outputs with better factuality. For example, thanks to the
and multi-agent systems, categorized by their fields as de- external knowledge base, ReAct (Yao et al., 2022b) reduces
picted in Figure 1. Each category of research, such as knowl- false positive rates from hallucination by 8.0% compared
edge retrieval or neuro-symbolic reasoning, addresses one or to CoT (Wei et al., 2022). We provide a detailed survey of
more fundamental requirements for rational thinking. These how multi-agent systems surpass single-agent baselines in
rationality requirements are typically intertwined; there- Appendix E.
fore, an approach that enhances one aspect of rationality
often inherently improves others simultaneously. Mean- 4.3. Towards Grounding & Invariance & Independence
while, the overall goal of current multi-agent system in from Irrelevant Contexts through Tool Utilization
achieving rationality can usually be distilled into two key
concepts: deliberation and abstraction. Deliberation en- Similar to knowledge retrieval, Toolformer (Schick et al.,
courages slower reasoning process such as brainstorming 2024) opens a new era that allows LLMs to use external
and reflection, while abstraction refers to boiling down the tools via API calls following predefined syntax, effectively
problem into its logical essence like calling APIs of tools or extending their capabilities beyond their intrinsic limita-
incorporating neuro-symbolic reasoning agents. tions and enforcing consistent and predictable outputs. A
multi-agent system can understand when and which tool to
Most existing studies do not explicitly base their frameworks use, which modality of information the tool should expect,
on rationality in their original writings. Our analysis aims how to call the corresponding API, and how to incorporate
to reinterpret these works through the lens of our four ax-
3
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
outputs from the API calls, which anchors subsequent rea- contributes to a self-evolving process within the multi-agent
soning processes with more accurate information beyond system (Zhang et al., 2024b), resulting in a final response
their parametric memory. For example, VisProg (Gupta & or a consensus that is less sensitive to specific wording or
Kembhavi, 2023) generates Python programs to reliably exe- token bias, moving the response towards better invariance.
cute subroutines. We provide more examples in Appendix F.
In most cases, utilizing tools require translating natural lan- 5. Open Problems and Future Directions
guage queries into API calls with predefined syntax. Once This survey builds connections between multi-modal and
the planning agent has determined the APIs and their input multi-agent systems with rationality, guided by the four
arguments, the original queries that may contain irrelevant axioms we expect a rational agent or agent systems should
context become invisible to the tools, and the tools will satisfy: information grounding, orderability of preference,
ignore any variance in the original queries as long as they independence from irrelevant context, and invariance across
share the equivalent underlying logic. This improves the equivalent representations. Our findings suggest that the
invariance property from noisy queries and independence grounding can usually be enhanced by multi-modalities,
of irrelevant context. Examples are shown in Appendix F. world models, knowledge retrieval, and tool utilization. The
remaining three axioms are typically intertwined, which
4.4. Towards Orderability of Preferences & Invariance could be improved by achievements in multi-modalities,
& Independence from Irrelevant Context through tool utilization, neuro-symbolic reasoning, self-reflection,
Neuro-Symbolic Reasoning and multi-agent collaborations.
A multi-agent system incorporating symbolic modules can
not only understand language queries but also solve them Inherent Rationality It is important to understand that
with a level of consistency, providing a faithful and trans- integrating most of these agents or modules with LLMs still
parent reasoning process based on well-defined rules and does not inherently make LLMs more rational. Current
logical principles, which is unachievable by LLMs alone. methods are neither sufficient nor necessary, but they
Logic-LM (Pan et al., 2023), for example, combines prob- serves as instrumental tools that bridge the gap between
lem formulating, symbolic reasoning, and summarizing an LLM’s response and rationality. These approaches
agents, where the symbolic reasoner empowers LLMs with enable multi-agent systems, which are black boxes from the
deterministic symbolic solvers to perform inference, ensur- user’s perspective, to more closely mimic rational thinking
ing a correct answer is consistently chosen. These modules in their output responses. However, despite these more ra-
typically expect standardized input formats, enhancing in- tional responses elicited from multi-modal and multi-agent
variance and independence similar to API calls of tool usage. systems, the challenge of how to effectively close the loop
More examples are included in Appendix G. and bake these enhanced outputs back into the LLMs (Zhao
et al., 2024), beyond mere fine-tuning, remains an open
4.5. Towards Orderability of Preferences & Invariance topic. In other words, can we leverage these more rational
through Reflection, Debate, and Prompt Strategies outputs to inherently enhance a single foundation model’s
rationality in its initial responses in future applications?
Single agents with self-reflection prompting (Shinn et al.,
2023) and multi-agent systems that promote debate and con-
sensus can help align outputs more closely with deliberate Encouraging More Multi-Modal Agents in Multi-Agent
and logical decision-making, thus enhancing rational rea- Systems Research into the integration of multi-modality
soning. For instance, Corex (Sun et al., 2023) finds that within multi-agent systems would be promising. Fields
orchestrating multiple agents to work together yields better such as multi-agent debate and neuro-symbolic reasoning,
complex reasoning results, exceeding strong single-agent as shown in Figure 1, currently under-utilize the potential of
baselines (Wang et al., 2022b) by an average of 1.1-10.6%. multi-modal sensory inputs. We believe that expanding the
More similar results are discussed in Appendix H. These role of multi-modalities, including but not limited to vision,
collaborative approaches, in summary, allow each agent in a sounds, and structured data could significantly enhance the
system to compare and rank its preference on choices from capabilities and rationality of multi-agent systems.
its own or from other agents through critical judgments.
It helps enable the system to discern and output the most
Evaluation on Rationality Benchmarks on rationality are
dominant decision as a consensus, thereby improving the
scarce. Future research should prioritize the development of
orderability of preference. At the same time, through such a
benchmarks specifically tailored to assess rationality, going
slow and critical thinking process, errors in initial responses
beyond existing ones on accuracy. These new benchmarks
or input prompts are more likely to be detected and cor-
should avoid data contamination and emphasize tasks that
rected. Accumulated experience from past error planning
demand consistent reasoning across diverse representations.
4
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Apperly, I. A. and Butterfill, S. A. Do humans have two sys- Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A.,
tems to track beliefs and belief-like states? Psychological Padlewski, P., Salz, D., Goodman, S., Grycner, A.,
review, 116(4):953, 2009. Mustafa, B., Beyer, L., et al. Pali: A jointly-scaled
multilingual language-image model. arXiv preprint
Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., arXiv:2209.06794, 2022.
Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa,
S., et al. Openflamingo: An open-source framework Chen, Y., Wang, R., Jiang, H., Shi, S., and Xu, R. Exploring
for training large autoregressive vision-language models. the use of large language models for reference-free text
arXiv preprint arXiv:2308.01390, 2023. quality evaluation: A preliminary empirical study. arXiv
preprint arXiv:2304.00723, 2023.
Bai, Y., Ying, J., Cao, Y., Lv, X., He, Y., Wang, X., Yu,
J., Zeng, K., Xiao, Y., Lyu, H., et al. Benchmarking Cheng, Y., Zhang, C., Zhang, Z., Meng, X., Hong, S., Li,
foundation models with language-model-as-an-examiner. W., Wang, Z., Wang, Z., Yin, F., Zhao, J., et al. Ex-
Advances in Neural Information Processing Systems, 36, ploring large language model based intelligent agents:
2024. Definitions, methods, and prospects. arXiv preprint
arXiv:2401.03428, 2024.
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie,
B., Lovenia, H., Ji, Z., Yu, T., Chung, W., et al. A multi- Cheng, Z., Xie, T., Shi, P., Li, C., Nadkarni, R., Hu, Y.,
task, multilingual, multimodal evaluation of chatgpt on Xiong, C., Radev, D., Ostendorf, M., Zettlemoyer, L.,
reasoning, hallucination, and interactivity. arXiv preprint et al. Binding language models in symbolic languages.
arXiv:2302.04023, 2023. arXiv preprint arXiv:2210.02875, 2022.
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Cheong, I., Xia, K., Feng, K., Chen, Q. Z., and Zhang,
Aggarwal, K., Som, S., Piao, S., and Wei, F. Vlmo: A. X. (a) i am not a lawyer, but...: Engaging legal experts
Unified vision-language pre-training with mixture-of- towards responsible llm policies for legal advice. arXiv
modality-experts. Advances in Neural Information Pro- preprint arXiv:2402.01864, 2024.
cessing Systems, 35:32897–32912, 2022.
Chern, S., Fan, Z., and Liu, A. Combating adversar-
Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Pod- ial attacks with multi-agent debate. arXiv preprint
stawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., arXiv:2401.05998, 2024.
5
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Chiang, C.-H. and Lee, H.-y. A closer look into automatic Gao, D., Ji, L., Zhou, L., Lin, K. Q., Chen, J., Fan, Z., and
evaluation using large language models. arXiv preprint Shou, M. Z. Assistgpt: A general multi-modal assistant
arXiv:2310.05657, 2023. that can plan, execute, inspect, and learn. arXiv preprint
arXiv:2306.08640, 2023b.
Cohen, R., Hamri, M., Geva, M., and Globerson, A. Lm vs
lm: Detecting factual errors via cross examination. arXiv Gao, M., Ruan, J., Sun, R., Yin, X., Yang, S., and Wan,
preprint arXiv:2305.13281, 2023. X. Human-like summarization evaluation with chatgpt.
arXiv preprint arXiv:2304.02554, 2023c.
Cui, C., Ma, Y., Cao, X., Ye, W., Zhou, Y., Liang, K., Chen,
J., Lu, J., Yang, Z., Liao, K.-D., et al. A survey on mul- Gardères, F., Ziaeefard, M., Abeloos, B., and Lecue, F.
timodal large language models for autonomous driving. Conceptbert: Concept-aware representation for visual
In Proceedings of the IEEE/CVF Winter Conference on question answering. In Findings of the Association for
Applications of Computer Vision, pp. 958–979, 2024. Computational Linguistics: EMNLP 2020, pp. 489–498,
Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, 2020.
B., Sun, H., and Su, Y. Mind2web: Towards a general-
Gravitas, S. Autogpt. Python. https://github.com/Significant-
ist agent for the web. Advances in Neural Information
Gravitas/ Auto-GPT, 2023.
Processing Systems, 36, 2024.
Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mor- Guerreiro, N. M., Alves, D. M., Waldendorf, J., Haddow, B.,
datch, I. Improving factuality and reasoning in lan- Birch, A., Colombo, P., and Martins, A. F. Hallucinations
guage models through multiagent debate. arXiv preprint in large multilingual translation models. Transactions of
arXiv:2305.14325, 2023. the Association for Computational Linguistics, 11:1500–
1517, 2023.
Durante, Z., Huang, Q., Wake, N., Gong, R., Park, J. S.,
Sarkar, B., Taori, R., Noda, Y., Terzopoulos, D., Choi, Y., Guo, T., Chen, X., Wang, Y., Chang, R., Pei, S., Chawla,
et al. Agent ai: Surveying the horizons of multimodal N. V., Wiest, O., and Zhang, X. Large language model
interaction. arXiv preprint arXiv:2401.03568, 2024. based multi-agents: A survey of progress and challenges.
arXiv preprint arXiv:2402.01680, 2024.
Echterhoff, J., Liu, Y., Alessa, A., McAuley, J., and He, Z.
Cognitive bias in high-stakes decision-making with llms. Gupta, T. and Kembhavi, A. Visual programming: Compo-
arXiv preprint arXiv:2403.00811, 2024. sitional visual reasoning without training. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Eisenführ, F., Weber, M., and Langer, T. Rational decision Pattern Recognition, pp. 14953–14962, 2023.
making. Springer, 2010.
Gur, I., Furuta, H., Huang, A., Safdari, M., Matsuo, Y., Eck,
Fan, L., Wang, G., Jiang, Y., Mandlekar, A., Yang, Y., Zhu, D., and Faust, A. A real-world webagent with planning,
H., Tang, A., Huang, D.-A., Zhu, Y., and Anandkumar, A. long context understanding, and program synthesis. arXiv
Minedojo: Building open-ended embodied agents with preprint arXiv:2307.12856, 2023.
internet-scale knowledge. Advances in Neural Informa-
tion Processing Systems, 35:18343–18362, 2022. Hagendorff, T. Machine psychology: Investigating
emergent capabilities and behavior in large language
Fang, M., Deng, S., Zhang, Y., Shi, Z., Chen, L., Pech-
models using psychological methods. arXiv preprint
enizkiy, M., and Wang, J. Large language models are neu-
arXiv:2303.13988, 2023.
rosymbolic reasoners. arXiv preprint arXiv:2401.09334,
2024. Han, S., Zhang, Q., Yao, Y., Jin, W., Xu, Z., and He, C.
Fu, J., Ng, S.-K., Jiang, Z., and Liu, P. Gptscore: Evaluate Llm multi-agent systems: Challenges and open problems.
as you desire. arXiv preprint arXiv:2302.04166, 2023. arXiv preprint arXiv:2402.03578, 2024.
Furuta, H., Nachum, O., Lee, K.-H., Matsuo, Y., Gu, Hastie, R. and Dawes, R. M. Rational choice in an uncertain
S. S., and Gur, I. Multimodal web navigation with world: The psychology of judgment and decision making.
instruction-finetuned foundation models. arXiv preprint Sage Publications, 2009.
arXiv:2305.11854, 2023.
He, K., Mao, R., Lin, Q., Ruan, Y., Lan, X., Feng, M.,
Gao, C., Lan, X., Lu, Z., Mao, J., Piao, J., Wang, H., Jin, D., and Cambria, E. A survey of large language models for
and Li, Y. Sˆ3: Social-network simulation system with healthcare: from data, technology, and applications to ac-
large language model-empowered agents. arXiv preprint countability and ethics. arXiv preprint arXiv:2310.05694,
arXiv:2307.14984, 2023a. 2023.
6
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo, Khardon, R. and Roth, D. Learning to reason. Journal of
G. D., Gutierrez, C., Kirrane, S., Gayo, J. E. L., Nav- the ACM (JACM), 44(5):697–725, 1997.
igli, R., Neumaier, S., et al. Knowledge graphs. ACM
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C.,
Computing Surveys (Csur), 54(4):1–37, 2021.
Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo,
Hong, S., Zheng, X., Chen, J., Cheng, Y., Wang, J., Zhang, W.-Y., Dollár, P., and Girshick, R. Segment anything.
C., Wang, Z., Yau, S. K. S., Lin, Z., Zhou, L., et al. arXiv:2304.02643, 2023.
Metagpt: Meta programming for multi-agent collabora- Koo, R., Lee, M., Raheja, V., Park, J. I., Kim, Z. M.,
tive framework. arXiv preprint arXiv:2308.00352, 2023a. and Kang, D. Benchmarking cognitive biases in
large language models as evaluators. arXiv preprint
Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang,
arXiv:2309.17012, 2023.
Y., Wang, Z., Dong, Y., Ding, M., et al. Cogagent: A
visual language model for gui agents. arXiv preprint Kosinski, M. Evaluating large language models in theory of
arXiv:2312.08914, 2023b. mind tasks. arXiv e-prints, pp. arXiv–2302, 2023.
Hsu, J., Mao, J., Tenenbaum, J., and Wu, J. What’s left? Langley, P. Crafting papers on machine learning. In Langley,
concept grounding with logic-enhanced foundation mod- P. (ed.), Proceedings of the 17th International Conference
els. Advances in Neural Information Processing Systems, on Machine Learning (ICML 2000), pp. 1207–1216, Stan-
36, 2024. ford, CA, 2000. Morgan Kaufmann.
LeCun, Y. A path towards autonomous machine intelligence
Hu, Z., Iscen, A., Sun, C., Chang, K.-W., Sun, Y., Ross,
version 0.9. 2, 2022-06-27. Open Review, 62(1), 2022.
D., Schmid, C., and Fathi, A. Avis: Autonomous visual
information seeking with large language model agent. LeCun, Y. Objective-driven ai: Towards ai systems
Advances in Neural Information Processing Systems, 36, that can learn, remember, reason, plan, have com-
2024. mon sense, yet are steerable and safe. University
of Washington, Department of Electrical & Com-
Huang, J. and Chang, K. C.-C. Towards reasoning in puter Engineering, January 2024. URL https:
large language models: A survey. arXiv preprint //www.ece.uw.edu/wp-content/uploads/
arXiv:2212.10403, 2022. 2024/01/lecun-20240124-uw-lyttle.pdf.
Slide presentation retrieved from University of Washing-
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H.,
ton.
Chen, Q., Peng, W., Feng, X., Qin, B., et al. A survey
on hallucination in large language models: Principles, Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V.,
taxonomy, challenges, and open questions. arXiv preprint Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel,
arXiv:2311.05232, 2023. T., et al. Retrieval-augmented generation for knowledge-
intensive nlp tasks. Advances in Neural Information Pro-
Jiang, B., Zhuang, Z., Shivakumar, S. S., Roth, D., and cessing Systems, 33:9459–9474, 2020.
Taylor, C. J. Multi-agent vqa: Exploring multi-agent
foundation models in zero-shot visual question answering. Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., Gao, J.,
arXiv preprint arXiv:2403.14783, 2024. et al. Multimodal foundation models: From specialists to
general-purpose assistants. Foundations and Trends® in
Kang, H. and Liu, X.-Y. Deficiency of large language mod- Computer Graphics and Vision, 16(1-2):1–214, 2024a.
els in finance: An empirical examination of hallucination.
Li, H., Chong, Y. Q., Stepputtis, S., Campbell, J., Hughes,
arXiv preprint arXiv:2311.15548, 2023.
D., Lewis, M., and Sycara, K. Theory of mind for multi-
Ke, Y. H., Yang, R., Lie, S. A., Lim, T. X. Y., Abdullah, agent collaboration via large language models. arXiv
H. R., Ting, D. S. W., and Liu, N. Enhancing diagnos- preprint arXiv:2310.10701, 2023a.
tic accuracy through multi-agent conversations: Using Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping
large language models to mitigate cognitive bias. arXiv language-image pre-training with frozen image encoders
preprint arXiv:2401.14589, 2024. and large language models. In International conference
on machine learning, pp. 19730–19742. PMLR, 2023b.
Khan, A., Hughes, J., Valentine, D., Ruis, L., Sachan, K.,
Radhakrishnan, A., Grefenstette, E., Bowman, S. R., Li, J., Wang, S., Zhang, M., Li, W., Lai, Y., Kang, X.,
Rocktäschel, T., and Perez, E. Debating with more persua- Ma, W., and Liu, Y. Agent hospital: A simulacrum of
sive llms leads to more truthful answers. arXiv preprint hospital with evolvable medical agents. arXiv preprint
arXiv:2402.06782, 2024. arXiv:2405.02957, 2024b.
7
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Li, X., Zhao, R., Chia, Y. K., Ding, B., Bing, L., Joty, S., and Lu, P., Peng, B., Cheng, H., Galley, M., Chang, K.-W.,
Poria, S. Chain of knowledge: A framework for ground- Wu, Y. N., Zhu, S.-C., and Gao, J. Chameleon: Plug-and-
ing large language models with structured knowledge play compositional reasoning with large language models.
bases. arXiv preprint arXiv:2305.13269, 2023c. Advances in Neural Information Processing Systems, 36,
2024.
Li, Y., Wang, S., Ding, H., and Chen, H. Large language
models in finance: A survey. In Proceedings of the Fourth Luo, Z., Xie, Q., and Ananiadou, S. Chatgpt as a factual in-
ACM International Conference on AI in Finance, pp. 374– consistency evaluator for abstractive text summarization.
382, 2023d. arXiv preprint arXiv:2303.15621, 2023.
Li, Y., Zhang, Y., and Sun, L. Metaagents: Simulating inter- Macmillan-Scott, O. and Musolesi, M. (ir) rationality and
actions of human behaviors for llm-based task-oriented cognitive biases in large language models. arXiv preprint
coordination via collaborative generative agents. arXiv arXiv:2402.09193, 2024.
preprint arXiv:2310.06500, 2023e.
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao,
Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S.,
R., Yang, Y., Tu, Z., and Shi, S. Encouraging divergent Yang, Y., et al. Self-refine: Iterative refinement with self-
thinking in large language models through multi-agent feedback. Advances in Neural Information Processing
debate. arXiv preprint arXiv:2305.19118, 2023. Systems, 36, 2024.
Liu, E. Z., Guu, K., Pasupat, P., Shi, T., and Liang, P. Re- Marino, K., Chen, X., Parikh, D., Gupta, A., and Rohrbach,
inforcement learning on web interfaces using workflow- M. Krisp: Integrating implicit and symbolic knowledge
guided exploration. arXiv preprint arXiv:1802.08802, for open-domain knowledge-based vqa. In Proceedings
2018. of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 14111–14121, 2021.
Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved base-
lines with visual instruction tuning. arXiv preprint Mohtashami, A., Hartmann, F., Gooding, S., Zilka, L., Shar-
arXiv:2310.03744, 2023a. ifi, M., et al. Social learning: Towards collaborative
learning with large language models. arXiv preprint
Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tun- arXiv:2312.11441, 2023.
ing. Advances in neural information processing systems,
36, 2024a. Mukherjee, A. and Chang, H. H. Heuristic reasoning in ai:
Instrumental use and mimetic absorption. arXiv preprint
Liu, H., Yan, W., Zaharia, M., and Abbeel, P. World model arXiv:2403.09404, 2024.
on million-length video and language with ringattention.
arXiv preprint arXiv:2402.08268, 2024b. Nakajima, Y. Babyagi. Python. https://github.
com/yoheinakajima/babyagi, 2023.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua,
M., Petroni, F., and Liang, P. Lost in the middle: How Nye, M., Tessler, M., Tenenbaum, J., and Lake, B. M. Im-
language models use long contexts. Transactions of the proving coherence and consistency in neural sequence
Association for Computational Linguistics, 12:157–173, models with dual-system, neuro-symbolic reasoning. Ad-
2024c. vances in Neural Information Processing Systems, 34:
25192–25204, 2021.
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C.
Gpteval: Nlg evaluation using gpt-4 with better human Oguntola, I., Hughes, D., and Sycara, K. Deep interpretable
alignment. arXiv preprint arXiv:2303.16634, 2023b. models of theory of mind. In 2021 30th IEEE interna-
tional conference on robot & human interactive commu-
Liu, Z., Zhang, Y., Li, P., Liu, Y., and Yang, D. Dy- nication (RO-MAN), pp. 657–664. IEEE, 2021.
namic llm-agent network: An llm-agent collaboration
framework with agent team optimization. arXiv preprint OpenAI. Gpt-4v(ision) system card. 2023. URL https:
arXiv:2310.02170, 2023c. //api.semanticscholar.org/CorpusID:
263218031.
Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pre-
training task-agnostic visiolinguistic representations for OpenAI. Gpt-4o. Software available from Ope-
vision-and-language tasks. Advances in neural informa- nAI, 2024. URL https://openai.com/index/
tion processing systems, 32, 2019. hello-gpt-4o/. Accessed: 2024-05-20.
8
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Pan, L., Albalak, A., Wang, X., and Wang, W. Y. Logic- Sclar, M., Kumar, S., West, P., Suhr, A., Choi, Y., and
lm: Empowering large language models with symbolic Tsvetkov, Y. Minding language models’(lack of) theory
solvers for faithful logical reasoning. arXiv preprint of mind: A plug-and-play multi-character belief tracker.
arXiv:2305.12295, 2023. arXiv preprint arXiv:2306.00924, 2023.
Pan, Z., Luo, H., Li, M., and Liu, H. Chain-of-action: Faith- Shaw, P., Joshi, M., Cohan, J., Berant, J., Pasupat, P., Hu,
ful and multimodal question answering through large lan- H., Khandelwal, U., Lee, K., and Toutanova, K. N. From
guage models. arXiv preprint arXiv:2403.17359, 2024. pixels to ui actions: Learning to follow instructions via
graphical user interfaces. Advances in Neural Information
Prasad, A., Koller, A., Hartmann, M., Clark, P., Sabharwal, Processing Systems, 36, 2024.
A., Bansal, M., and Khot, T. Adapt: As-needed decompo-
Shen, C., Cheng, L., Nguyen, X.-P., You, Y., and Bing, L.
sition and planning with language models. arXiv preprint
Large language models are not yet human-level evaluators
arXiv:2311.05772, 2023.
for abstractive summarization. In Findings of the Associ-
Qian, C., Cong, X., Yang, C., Chen, W., Su, Y., Xu, J., ation for Computational Linguistics: EMNLP 2023, pp.
Liu, Z., and Sun, M. Communicative agents for software 4215–4233, 2023.
development. arXiv preprint arXiv:2307.07924, 2023. Shen, W., Li, C., Chen, H., Yan, M., Quan, X., Chen, H.,
Zhang, J., and Huang, F. Small llms are weak tool learn-
Qiao, S., Ou, Y., Zhang, N., Chen, X., Yao, Y., Deng, ers: A multi-llm agent. arXiv preprint arXiv:2401.07324,
S., Tan, C., Huang, F., and Chen, H. Reasoning with 2024a.
language model prompting: A survey. arXiv preprint
arXiv:2212.09597, 2022. Shen, Y., Song, K., Tan, X., Li, D., Lu, W., and Zhuang,
Y. Hugginggpt: Solving ai tasks with chatgpt and its
Qiao, S., Zhang, N., Fang, R., Luo, Y., Zhou, W., Jiang, friends in hugging face. Advances in Neural Information
Y. E., Lv, C., and Chen, H. Autoact: Automatic agent Processing Systems, 36, 2024b.
learning from scratch via self-planning. arXiv preprint
arXiv:2401.05268, 2024. Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi,
E. H., Schärli, N., and Zhou, D. Large language models
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., can be easily distracted by irrelevant context. In Inter-
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., national Conference on Machine Learning, pp. 31210–
et al. Learning transferable visual models from natural 31227. PMLR, 2023.
language supervision. In International conference on
Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang,
machine learning, pp. 8748–8763. PMLR, 2021.
P. World of bits: An open-domain platform for web-
Reid, M., Savinov, N., Teplyashin, D., Lepikhin, D., Lilli- based agents. In International Conference on Machine
crap, T., Alayrac, J.-b., Soricut, R., Lazaridou, A., Firat, Learning, pp. 3135–3144. PMLR, 2017.
O., Schrittwieser, J., et al. Gemini 1.5: Unlocking multi- Shi, Z., Gao, S., Chen, X., Yan, L., Shi, H., Yin, D., Chen, Z.,
modal understanding across millions of tokens of context. Ren, P., Verberne, S., and Ren, Z. Learning to use tools
arXiv preprint arXiv:2403.05530, 2024. via cooperative and interactive agents. arXiv preprint
arXiv:2403.03031, 2024.
Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen,
J., Huang, X., Chen, Y., Yan, F., et al. Grounded sam: Shinn, N., Labash, B., and Gopinath, A. Reflexion: an au-
Assembling open-world models for diverse visual tasks. tonomous agent with dynamic memory and self-reflection.
arXiv preprint arXiv:2401.14159, 2024. arXiv preprint arXiv:2303.11366, 2023.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Speer, R., Chin, J., and Havasi, C. Conceptnet 5.5: An
Ommer, B. High-resolution image synthesis with latent open multilingual graph of general knowledge. In Pro-
diffusion models. In Proceedings of the IEEE/CVF con- ceedings of the AAAI conference on artificial intelligence,
ference on computer vision and pattern recognition, pp. volume 31, 2017.
10684–10695, 2022. Stureborg, R., Alikaniotis, D., and Suhara, Y. Large lan-
guage models are inconsistent and biased evaluators.
Schick, T., Dwivedi-Yu, J., Dessı̀, R., Raileanu, R., Lomeli,
arXiv preprint arXiv:2405.01724, 2024.
M., Hambro, E., Zettlemoyer, L., Cancedda, N., and
Scialom, T. Toolformer: Language models can teach Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., and Dai, J.
themselves to use tools. Advances in Neural Information Vl-bert: Pre-training of generic visual-linguistic represen-
Processing Systems, 36, 2024. tations. arXiv preprint arXiv:1908.08530, 2019.
9
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Sun, Q., Yin, Z., Li, X., Wu, Z., Qiu, X., and Kong, Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu,
L. Corex: Pushing the boundaries of complex reason- Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som,
ing through multi-model collaboration. arXiv preprint S., et al. Image as a foreign language: Beit pretraining
arXiv:2310.00280, 2023. for all vision and vision-language tasks. arXiv preprint
arXiv:2208.10442, 2022a.
Sun, R. Can a cognitive architecture fundamentally enhance
llms? or vice versa? arXiv preprint arXiv:2401.10444, Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji,
2024. J., Yang, Z., Zhao, L., Song, X., et al. Cogvlm: Visual
expert for pretrained language models. arXiv preprint
Suri, G., Slater, L. R., Ziaee, A., and Nguyen, M. Do arXiv:2311.03079, 2023c.
large language models show decision heuristics similar
to humans? a case study using gpt-3.5. Journal of Exper- Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang,
imental Psychology: General, 2024. S., Chowdhery, A., and Zhou, D. Self-consistency im-
proves chain of thought reasoning in language models.
Surı́s, D., Menon, S., and Vondrick, C. Vipergpt: Visual arXiv preprint arXiv:2203.11171, 2022b.
inference via python execution for reasoning. In Pro-
ceedings of the IEEE/CVF International Conference on Wang, Z., Wan, W., Chen, R., Lao, Q., Lang, M., and Wang,
Computer Vision, pp. 11888–11898, 2023. K. Towards top-down reasoning: An explainable multi-
agent approach for visual question answering. arXiv
Talebirad, Y. and Nadiri, A. Multi-agent collaboration:
preprint arXiv:2311.17331, 2023d.
Harnessing the power of intelligent llm agents. arXiv
preprint arXiv:2306.03314, 2023. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting
Tang, Q., Deng, Z., Lin, H., Han, X., Liang, Q., and
elicits reasoning in large language models. Advances in
Sun, L. Toolalpaca: Generalized tool learning for lan-
neural information processing systems, 35:24824–24837,
guage models with 3000 simulated cases. arXiv preprint
2022.
arXiv:2306.05301, 2023.
Tversky, A. and Kahneman, D. Rational choice and the Wikipedia contributors. Plagiarism — Wikipedia,
framing of decisions. Decision making: Descriptive, the free encyclopedia, 2004. URL https:
normative, and prescriptive interactions, pp. 167–192, //en.wikipedia.org/w/index.php?title=
1988. Plagiarism&oldid=5139350. [Online; accessed
22-July-2004].
Valmeekam, K., Marquez, M., Sreedharan, S., and Kamb-
hampati, S. On the planning abilities of large language Wong, L., Mao, J., Sharma, P., Siegel, Z. S., Feng, J., Ko-
models-a critical investigation. Advances in Neural Infor- rneev, N., Tenenbaum, J. B., and Andreas, J. Learning
mation Processing Systems, 36:75993–76005, 2023. adaptive planning representations with natural language
guidance. arXiv preprint arXiv:2312.08566, 2023.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At- Wu, J., Lu, J., Sabharwal, A., and Mottaghi, R. Multi-modal
tention is all you need. Advances in neural information answer validation for knowledge-based vqa. In Proceed-
processing systems, 30, 2017. ings of the AAAI conference on artificial intelligence,
volume 36, pp. 2712–2721, 2022.
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu,
Y., Fan, L., and Anandkumar, A. Voyager: An open- Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li,
ended embodied agent with large language models. arXiv B., Jiang, L., Zhang, X., and Wang, C. Autogen: Enabling
preprint arXiv:2305.16291, 2023a. next-gen llm applications via multi-agent conversation
framework. arXiv preprint arXiv:2308.08155, 2023.
Wang, J., Liang, Y., Meng, F., Sun, Z., Shi, H., Li, Z., Xu,
J., Qu, J., and Zhou, J. Is chatgpt a good nlg evaluator? Wu, S., Xie, J., Chen, J., Zhu, T., Zhang, K., and Xiao,
a preliminary study. arXiv preprint arXiv:2303.04048, Y. How easily do irrelevant inputs skew the responses of
2023b. large language models? arXiv preprint arXiv:2404.03302,
2024.
Wang, P., Xiao, Z., Chen, H., and Oswald, F. L. Will the
real linda please stand up... to large language models? Xie, J., Chen, Z., Zhang, R., Wan, X., and Li, G.
examining the representativeness heuristic in llms. arXiv Large multimodal agents: A survey. arXiv preprint
preprint arXiv:2404.01461, 2024. arXiv:2402.15116, 2024a.
10
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Xie, Y., Mallick, T., Bergerson, J. D., Hutchison, J. K., Ye, H., Gui, H., Zhang, A., Liu, T., Hua, W., and Jia,
Verner, D. R., Branham, J., Alexander, M. R., Ross, R. B., W. Beyond isolation: Multi-agent synergy for im-
Feng, Y., Levy, L.-A., et al. Wildfiregpt: Tailored large proving knowledge graph construction. arXiv preprint
language model for wildfire analysis. arXiv preprint arXiv:2312.03022, 2023.
arXiv:2402.07877, 2024b.
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., and Tenen-
Xiong, K., Ding, X., Cao, Y., Liu, T., and Qin, B. Div- baum, J. Neural-symbolic vqa: Disentangling reasoning
ing into the inter-consistency of large language models: from vision and language understanding. Advances in
An insightful analysis through debate. arXiv preprint neural information processing systems, 31, 2018.
arXiv:2305.11595, 2023.
Yin, D., Brahman, F., Ravichander, A., Chandu, K., Chang,
Xu, S., Pang, L., Shen, H., Cheng, X., and Chua, T.-s. K.-W., Choi, Y., and Lin, B. Y. Lumos: Learning agents
Search-in-the-chain: Towards the accurate, credible and with unified data, modular design, and open-source llms.
traceable content generation for complex knowledge- arXiv preprint arXiv:2311.05657, 2023.
intensive tasks. arXiv preprint arXiv:2304.14732, 2023. Yoshikawa, H. and Okazaki, N. Selective-lama: Selective
prediction for confidence-aware evaluation of language
Xu, X., Wang, Y., Xu, C., Ding, Z., Jiang, J., Ding, Z.,
models. In Findings of the Association for Computational
and Karlsson, B. F. A survey on game playing agents
Linguistics: EACL 2023, pp. 2017–2028, 2023.
and large models: Methods, applications, and challenges.
arXiv preprint arXiv:2403.10249, 2024. Zelikman, E., Huang, Q., Poesia, G., Goodman, N., and
Haber, N. Parsel: Algorithmic reasoning with language
Yang, Z. and Zhu, Z. Curiousllm: Elevating multi-document models by composing decompositions. Advances in Neu-
qa with reasoning-infused knowledge graph prompting. ral Information Processing Systems, 36:31466–31523,
arXiv preprint arXiv:2404.09077, 2024. 2023.
Yang, Z., Chen, G., Li, X., Wang, W., and Yang, Y. Do- Zhang, J., Hou, Y., Xie, R., Sun, W., McAuley, J., Zhao,
raemongpt: Toward understanding dynamic scenes with W. X., Lin, L., and Wen, J.-R. Agentcf: Collaborative
large language models. arXiv preprint arXiv:2401.08392, learning with autonomous language agents for recom-
2024. mender systems. arXiv preprint arXiv:2310.09233, 2023.
Yao, S., Chen, H., Yang, J., and Narasimhan, K. Web- Zhang, Y., Mao, S., Ge, T., Wang, X., de Wynter, A., Xia, Y.,
shop: Towards scalable real-world web interaction with Wu, W., Song, T., Lan, M., and Wei, F. Llm as a master-
grounded language agents. Advances in Neural Informa- mind: A survey of strategic reasoning with large language
tion Processing Systems, 35:20744–20757, 2022a. models. arXiv preprint arXiv:2404.01230, 2024a.
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, Zhang, Z., Bo, X., Ma, C., Li, R., Chen, X., Dai, Q., Zhu,
K., and Cao, Y. React: Synergizing reasoning and acting J., Dong, Z., and Wen, J.-R. A survey on the memory
in language models. arXiv preprint arXiv:2210.03629, mechanism of large language model based agents. arXiv
2022b. preprint arXiv:2404.13501, 2024b.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Zhao, S. and Xu, H. Less is more: Toward zero-shot local
and Narasimhan, K. Tree of thoughts: Deliberate problem scene graph generation via foundation models. arXiv
solving with large language models. Advances in Neural preprint arXiv:2310.01356, 2023.
Information Processing Systems, 36, 2024. Zhao, Y., Lin, Z., Zhou, D., Huang, Z., Feng, J., and Kang,
B. Bubogpt: Enabling visual grounding in multi-modal
Yao, W., Heinecke, S., Niebles, J. C., Liu, Z., Feng, Y., Xue, llms. arXiv preprint arXiv:2307.08581, 2023.
L., Murthy, R., Chen, Z., Zhang, J., Arpit, D., et al. Retro-
former: Retrospective large language agents with policy Zhao, Z., Ma, K., Chai, W., Wang, X., Chen, K., Guo, D.,
gradient optimization. arXiv preprint arXiv:2308.02151, Zhang, Y., Wang, H., and Wang, G. Do we really need
2023. a complex agent system? distill embodied agent into a
single model. arXiv preprint arXiv:2404.04619, 2024.
Yasunaga, M., Aghajanyan, A., Shi, W., James, R.,
Leskovec, J., Liang, P., Lewis, M., Zettlemoyer, L., and Zheng, B., Gou, B., Kil, J., Sun, H., and Su, Y. Gpt-4v
Yih, W.-t. Retrieval-augmented multimodal language (ision) is a generalist web agent, if grounded. arXiv
modeling. arXiv preprint arXiv:2211.12561, 2022. preprint arXiv:2401.01614, 2024a.
11
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging
llm-as-a-judge with mt-bench and chatbot arena. Ad-
vances in Neural Information Processing Systems, 36,
2024b.
Zhong, M., Liu, Y., Yin, D., Mao, Y., Jiao, Y., Liu, P.,
Zhu, C., Ji, H., and Han, J. Towards a unified multi-
dimensional evaluator for text generation. arXiv preprint
arXiv:2210.07197, 2022.
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
Minigpt-4: Enhancing vision-language understanding
with advanced large language models. arXiv preprint
arXiv:2304.10592, 2023a.
Zhu, X., Chen, Y., Tian, H., Tao, C., Su, W., Yang, C.,
Huang, G., Li, B., Lu, L., Wang, X., et al. Ghost in the
minecraft: Generally capable agents for open-world envi-
roments via large language models with text-based knowl-
edge and memory. arXiv preprint arXiv:2305.17144,
2023b.
12
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
A. Orderability of Preferences
Comparability When faced with any two alternatives A and B, the agent should have at least a weak preference, i.e.,
A ⪰ B or B ⪰ A. This means that the agent can compare any pair of alternatives and determine which one is preferred or if
they are equally preferred.
Transitivity If the agent prefers A to B and B to C, then the agent must prefer A to C. This ensures that the agent’s
preferences are consistent and logical across multiple comparisons.
Closure If A and B are in the alternative set S, then any probabilistic combination of A and B (denoted as ApB) should
also be in S. This principle ensures that the set of alternatives is closed under probability mixtures.
Distribution of probabilities across alternatives If A and B are in S, then the probability mixture of (ApB) and B,
denoted as [(ApB)qB], should be indifferent to the probability mixture of A and B, denoted as (ApqB). This principle
ensures consistency in the agent’s preferences when dealing with probability mixtures of alternatives.
Solvability When faced with three alternatives A, B, and C, with the preference order A ⪰ B ⪰ C, there should be some
probabilistic way of combining A and C such that the agent is indifferent between choosing B or this combination. In other
words, the agent should be able to find a solution to the decision problem by making trade-offs between alternatives.
B. LLM-based Evaluations
Recent research underscores a critical need for more rational LLM-based evaluation methods, particularly for assessing
open-ended language responses. CoBBLEr (Koo et al., 2023) provides a cognitive bias benchmark for evaluating LLMs as
evaluators, revealing a preference for their own outputs over those from other LLMs. Stureborg et al. (2024) argues that
LLMs are biased evaluators towards more familiar tokens and previous predictions, and exhibit strong self-inconsistency in
the score distribution. Luo et al. (2023); Shen et al. (2023); Gao et al. (2023c); Wang et al. (2023b); Chen et al. (2023);
Chiang & Lee (2023); Zheng et al. (2024b); Fu et al. (2023); Liu et al. (2023b) also point out the problem with a single
LLM as the evaluator, with concerns over factual and rating inconsistencies, a high dependency on prompt design, a low
correlation with human evaluations, and struggles with the comparison, i.e., the orderability of preferences.
Multi-agent systems might be a possible remedy. By involving multiple evaluative agents from diverse perspectives, it
becomes possible to achieve a more balanced and consistent orderability of preferences. For instance, ChatEval (Chan et al.,
2023) posits that a multi-agent debate evaluation usually offers judgments that are better aligned with human annotators
compared to single-agent ones. Bai et al. (2024) also finds decentralized methods yield fairer evaluation results. Multi-Agent
VQA (Jiang et al., 2024) relies on a group of LLM-based graders for evaluating zero-shot, open-world visual question
answering, where exact answer matches are no longer feasible.
Consider an environment where the input space and the output decision space are finite. A lookup table with
consistent mapping from input to output is inherently rational, while no reasoning is necessarily present in the
mapping.
Despite this example, it is still crucial to acknowledge that reasoning typically plays a vital role in ensuring rationality,
especially in complex and dynamic real-world scenarios where a simple lookup table is insufficient. Agents must possess the
ability to reason through novel situations, adapt to changing circumstances, make plans, and achieve rational decisions based
on incomplete or uncertain information. Furthermore, reasoning is crucial when faced with conflicting data or competing
objectives. It helps systems to weigh the evidence, consider alternative perspectives, and make trade-offs between different
courses of action. Through reasoning, individuals can weigh the evidence, consider alternative perspectives, and make
13
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
trade-offs between different courses of action. This process allows for more nuanced and context-dependent decision-making
while navigating the intricacies in the real world. , all of which are fundamental steps in making rational decisions.
Rationality is also different from Theory of Mind (ToM) (Apperly & Butterfill, 2009; Nye et al., 2021; Oguntola et al., 2021;
Hagendorff, 2023; Li et al., 2023a; Sclar et al., 2023; Kosinski, 2023) in machine psychology. ToM refers to the model’s
ability to understand that others’ mental states, beliefs, desires, emotions, and intentions may be different from its own.
14
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
15
Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Abstraction that Boils Down to Logical Essence Beyond detailed symbolic reasoning steps, these modules typically
expect a standardized input formats, similar to API calls of tool usage. This layer of abstraction enhances the independence
from irrelevant contexts and maintains the invariance of LLMs when handling natural language queries. The only relevant
factor is the parsed inputs into the predetermined neuro-symbolic programs. For instance, Ada (Wong et al., 2023) introduces
symbolic operators to abstract actions, ensuring that lower-level planning models are not compromised by irrelevant
information in the queries and observations. Without the symbolic action library, a single LLM would frequently fail at
grounding objects or obeying environmental conditions, resulting in a significant accuracy gap of approximately 59.0-89.0%.
16