MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Qi, Jingyuan; Liu, Minqian; Shen, Ying; Xu, Zhiyang; Huang, Lifu

Computer Science > Computation and Language

arXiv:2310.04965 (cs)

[Submitted on 8 Oct 2023 (v1), last revised 18 Jan 2024 (this version, v2)]

Title:MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Authors:Jingyuan Qi, Minqian Liu, Ying Shen, Zhiyang Xu, Lifu Huang

View PDF HTML (experimental)

Abstract:Automatically generating scripts (i.e. sequences of key steps described in text) from video demonstrations and reasoning about the subsequent steps are crucial to the modern AI virtual assistants to guide humans to complete everyday tasks, especially unfamiliar ones. However, current methods for generative script learning rely heavily on well-structured preceding steps described in text and/or images or are limited to a certain domain, resulting in a disparity with real-world user scenarios. To address these limitations, we present a new benchmark challenge -- MultiScript, with two new tasks on task-oriented multimodal script learning: (1) multimodal script generation, and (2) subsequent step prediction. For both tasks, the input consists of a target task name and a video illustrating what has been done to complete the target task, and the expected output is (1) a sequence of structured step descriptions in text based on the demonstration video, and (2) a single text description for the subsequent step, respectively. Built from WikiHow, MultiScript covers multimodal scripts in videos and text descriptions for over 6,655 human everyday tasks across 19 diverse domains. To establish baseline performance on MultiScript, we propose two knowledge-guided multimodal generative frameworks that incorporate the task-related knowledge prompted from large language models such as Vicuna. Experimental results show that our proposed approaches significantly improve over the competitive baselines.

Comments:	Accepted by AAAI 2024. 11 pages, 9 figures, 4 tables
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2310.04965 [cs.CL]
	(or arXiv:2310.04965v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.04965

Submission history

From: Minqian Liu [view email]
[v1] Sun, 8 Oct 2023 01:51:17 UTC (6,734 KB)
[v2] Thu, 18 Jan 2024 21:17:04 UTC (6,736 KB)

Computer Science > Computation and Language

Title:MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:MULTISCRIPT: Multimodal Script Learning for Supporting Open Domain Everyday Tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators