Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

Ma, Xinbei; Wang, Yiting; Yao, Yao; Yuan, Tongxin; Zhang, Aston; Zhang, Zhuosheng; Zhao, Hai

Computer Science > Computation and Language

arXiv:2408.02544 (cs)

[Submitted on 5 Aug 2024 (v1), last revised 2 Jul 2025 (this version, v2)]

Title:Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

Authors:Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao

View PDF HTML (experimental)

Abstract:This paper investigates the faithfulness of multimodal large language model (MLLM) agents in a graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general scenario is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using a simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness of agents, our findings first indicate that these agents are prone to environmental distractions. Furthermore, we implement an adversarial environment injection and analyze the approach to improve faithfulness, calling for a collective focus on this important topic.

Comments:	ACL 2025
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2408.02544 [cs.CL]
	(or arXiv:2408.02544v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2408.02544

Submission history

From: Xinbei Ma [view email]
[v1] Mon, 5 Aug 2024 15:16:22 UTC (9,407 KB)
[v2] Wed, 2 Jul 2025 12:23:53 UTC (7,217 KB)

Computer Science > Computation and Language

Title:Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators