0% found this document useful (0 votes)

54 views15 pages

2024 Acl-Demos 22

Acl demos

Uploaded by

muhammad0410

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views15 pages

2024 Acl-Demos 22

Acl demos

Uploaded by

muhammad0410

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

DocPilot: Copilot for Automating PDF Edit Workflows in Documents

Puneet Mathur, Alexa Siu, Varun Manjunatha, Tong Sun

Adobe Research
{puneetm, asiu, manjunatha, tsun}@adobe.com
Demo Video: https://github.com/docpilot-ai/demo

Abstract DocPilot
1.Task Planning 2.Plan Verification Task Plan
User Request
Digital documents, such as PDFs, are vital in
Syntax Hallucination Tool Name
Tool Documentation
Few shot Prompting Argument Validity Input Args
business workflows, enabling communication, Retrieval-based Tool Hallucination Output

documentation, and collaboration. Handling

Tool Selection Dependency Checks

PDFs can involve navigating complex work- Initial PDF File

flows and numerous tools (e.g., comprehension, 4.Task Execution 3.Code Generation Code Solution
annotation, editing), which can be tedious and Code Compiler
API Encapsulation
time-consuming for users. We introduce DocPi- Error-log Code
Self Revision
Few-shot Prompting

lot, an AI-assisted document workflow Copilot File Output

Response
Generation
Guardrails

system capable of understanding user intent

and executing tasks accordingly to help users Figure 1: DocPilot is an LLM-assisted document workflow
streamline their workflows. DocPilot under- Copilot system capable of understanding user intent and exe-
takes intelligent orchestration of various tools cuting PDF actions to help users achieve their editing needs.
through LLM prompting in four steps: (1) Task
plan generation, (2) Task plan verification and
self-correction, (3) Multi-turn User Feedback, specified details to eliminate ambiguity in require-
and (4) Task Plan Execution via Code Gener- ments, and incorporate user feedback by interact-
ation and Error log-based Code Self-Revision. ing with the user . Further, it is desired that such
Our goal is to enhance user efficiency and pro- a system should be able to sample from a large
ductivity by simplifying and automating their diversity of tools and resolve interdependencies be-
document workflows with task delegation to
tween selected sub-tasks to generate coherent task
DocPilot.
plans. The copilot must then produce executable
programs consistent with the initial intent while
1 Introduction
being extensible to accommodate the addition of
Digital documents, particularly PDFs, play a cru- new tools in the future (Kudashkina et al., 2020).
cial role in business workflows, facilitating commu- To address these issues, we present DocPilot
nication, documentation, and collaboration. Han- (Fig. 1), an LLM-based framework for automating
dling PDF documents involves a wide array of editing workflows in PDF documents. Inspired by
functionalities. These include tasks such as un- recent work like HuggingGPT, (Shen et al., 2024)
derstanding content, annotating, editing content and ControlLM (Liu et al., 2023), DocPilot takes
(e.g., comments, redaction, highlights), organizing the user’s requests along with the PDF documents
pages (e.g., crop, rotate, extract), adding signatures as inputs and leverages LLMs to infer the user’s
or watermarks, and form-filling. intent and transforms it into a task plan consist-
Several document processing applications pro- ing of a sequence of PDF action tools. The task
vide standalone tools and APIs to help users com- plan undergoes thorough verification checks to en-
plete these tasks. However, accomplishing com- sure accuracy and reliability. Any errors in the
plex workflows involving numerous tools can be task plan are self-corrected by the LLM, and the
tedious and time-consuming. Additionally, unfa- final task plan is then presented to the user in easy-
miliar users may face challenges in understanding to-understand language, inviting feedback through
and navigating the various tools available. Hence, conversation. Once the plan is approved by the
there is a need for an AI-assisted copilot system user, DocPilot converts the task plan into a soft-
that can comprehend the user’s intent, clarify un- ware program that can orchestrate external tool
232
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 232–246
August 11-16, 2024 ©2024 Association for Computational Linguistics
TASK PLAN VERIFICATION

Error Logs Description

User Request TASK PLANNING
Tool Documentation Syntax Tool Argumant Dependency Dependency
Redact Underline Search Count Hallucination Hallucination Validity Consistency Hallucination
Input Args Input Args Input Args Input Args Verification Verification Verification Verification Verification
Output Output Output Output
MULTI-TURN USER FEEDBACK LLM-
based Self
Correction Task Plan
Retrieval-augmented Tool Selection Tool: ADD PAGE Tool: COUNT Tool: SEARCH
Top-K Pairs Input: Filename Input: Filename Input: Filename, Text
Sample Task
Request Requests Plans Output: Filename Output: Integer Output: Text
Embedding Model
Task Dependencies
LLM Tool: REDACT Tool:COMPRESS Tool: QnA
Corpus Input: Filename, Text Input: Filename Input: Filename, Text
DataStore
< Request , Task Plan > Output: Filename Output: NA Output: Filename

TASK EXECUTION CODE GENERATION

LLM
Compiler Error Logs
API Encapsulation Prompting
Guard Rails
Python Code
Code def Redact(args) def Search(args) def Redact(args)
File Compiler return return return
Error-log Generation
Output
Code Self Syntax
Code Solution Revision Software Retrieval-augmented
Import Few Shot Prompting
Compatibility
Task Code
Top-K Pairs Plans Solution
File Handling
LLM DataStore
<Task Plan ,Code >

Figure 2: DocPilot: (1) Task Plan Generation decomposes user requests into a task plan using Tool Documentation
prompting of Retrieval-augmented selection of PDF tools. (2) Task Plan Verification applies a series of syntax and
dependency checks, and error descriptions are passed as feedback for LLM-based self-correction. (3) Multi-turn
User Feedback allows users to critique the verbose task plan via the chat interface. (4) Task Plan Execution converts
the approved task plan into Python code via API Encapsulation-based few-shot prompting with guardrails. Error
log-based Code Self-Revision repairs code errors; the compiler executes code solution to generate output files.

API calls using LLM’s code generation capabili- Selection to tailor few-shot tool usage examples
ties. The generated program is simulated using a suitable for input queries, and API Encapsulation
code interpreter and detected error logs are passed prompting for generating modularized code.
as feedback to the LLM for code revision. The re- (3) Reliability: DocPilot promotes reliable work-
sultant error-free code solution executes seamless flow automation by mitigating task hallucinations,
cooperation between diverse tools and provides handling complex interdependencies between sub-
users with a modified document that meets their tasks via dependency verification, and iterative self-
expectations. To assess DocPilot’s performance correction to generate an executable program.
in supporting users, we collected user feedback
on diverse workflows completed with the help of 2 Related Work
DocPilot. We find that DocPilot is effective in
improving user productivity by automating repeti- Recent research informs us how LLMs can act as
tive tasks and simplifying complex processes. autonomous agents for task automation in various
application domains (Xi et al., 2023; Wang et al.,
The main contributions of DocPilot are:
2023a). AI-powered LLM Agents: Frameworks
(1) Accessibility: By employing LLMs as task like AgentGPT and HuggingGPT (Shen et al.,
planners, DocPilot engages users in multi-turn in- 2024) leverage LLMs as a controller to analyze
teractions to disambiguate complex requests. This user requests and invoke relevant tools for solving
eliminates the need to master the skillful use of doc- the task. AudioGPT (Huang et al., 2023) solves nu-
ument processing software, making it accessible to merous audio understanding and generation tasks
a broader audience. by connecting LLMs with input/output interface
(2) Modularity: DocPilot is designed to be highly (ASR, TTS) for speech conversations. TPU (Ruan
extensible, allowing users to expand its functional- et al., 2023) proposes a structured framework tai-
ity by adding more PDF tools and APIs. To achieve lored for LLM-based AI Agents for task planning
this, we introduce Tool Documentation-based and execution. (Zhu et al., 2023) introduced the
prompting for generating task plans grounded in Ghost in Minecraft (GITM), a framework of Gener-
real-world tool usage, Retrieval-Augmented Tool ally Capable Agents (GCAs) that can skillfully nav-
233
igate complex, sparse-reward environments with semantic constituents for spatial object identifica-
text-based interactions and develop a set of struc- tion in 3D scenes. (Qiao et al., 2024) put forth
tured actions executed via LLMs. AssistGPT (Gao the AUTOACT framework that automatically syn-
et al., 2023) proposed an interleaved code and lan- thesizes planning trajectories from experience to
guage reasoning approach called Plan, Execute, In- alleviate the reliance of copilot systems on large-
spect, and Learn (PEIL) for processing complex im- scale annotated data. Toolken (Hao et al., 2024)
ages and long-form videos. RecMind (Wang et al., addresses the inherent problems of context length
2023b) designed an LLM-powered autonomous constraints and adaptability to a new set of tools by
recommender agent capable of leveraging external proposing LLM tool embeddings. Recent work has
knowledge and utilizing tools with careful plan- shown that descriptive tool documentation can be
ning to provide zero-shot personalized recommen- more beneficial than simple few-shot demonstra-
dations. Frameworks like AutoDroid (Wen et al., tions for tool-augmented LLM automation (Hsieh
2023) and AppAgent (Zhang et al., 2023a) pre- et al., 2023).
sented smartphone task automation systems that
can automate arbitrary tasks on any mobile appli- 3 DocPilot
cation by mimicking human-like interactions such
as tapping and swiping leveraged through LLMs Fig. 2 shows DocPilot, a chat-based AI assistant
like GPT-3.5/GPT-4. AdaPlanner (Sun et al., 2024) framework that uses LLM as a controller to trans-
allows LLM agents to refine their self-generated late a user’s PDF editing request into an actionable
plan adaptively in response to environmental feed- task plan and orchestrates numerous software tools
back using few-shot demonstrations. (Chen et al., to realize the document editing tasks into modi-
2023) proposed a tool-augmented chain-of-thought fied PDF outputs. DocPilot undertakes intelligent
reasoning framework that allows chat-based LLMs orchestration of various LLM capabilities into an
(e.g., ChatGPT) to indulge in multi-turn conver- executable workflow, which includes four steps: (1)
sations to utilize tools in a more natural conver- Task plan generation, (2) Task plan verification and
sational manner. CREATOR (Qian et al., 2023) self-correction, (3) Multi-turn User Feedback, and
built a novel framework that enables LLMs to cre- (4) Task Plan Execution via Code Generation and
ate their own tools using documentation and code Error log-based Code Self-Revision.
realization. ControlLLM (Liu et al., 2023) pro-
3.1 Task Plan Generation
posed a Thoughts-on-Graph (ToG) paradigm that
searches the optimal solution path on a pre-built User requests involve several intricate intentions
tool graph to resolve parameter and dependency that need to be decomposed into a sequence of sub-
relations among different tools for image, audio, tasks to be solved to achieve the final output. The
and video processing. LUMOS (Yin et al., 2023) task planning stage utilizes an LLM to analyze the
trained open-source LLMs with unified data to user request and determine the execution orders
represent complex interactive tasks. DataCopilot of the PDF Tool API calls based on their resource
(Zhang et al., 2023b) built an LLM-based system dependencies. We represent the LLM-generated
to autonomously transform raw data into visual- task plan in the JSON format to parse the sub-tasks
ization results that best match the user’s intent by through slot filing. Each sub-task is composed of
designing versatile interfaces for data management, five slots - "task", "id", "dep", "args", and "return"
processing, and visualization. (Song et al., 2023) to represent the PDF tool function name, unique
connects LLMs with REST software architectural identifier, dependencies, arguments, and returned
style (RESTful) APIs, conducts coarse-to-fine on- values, respectively. To better understand the inten-
line planning, and executes the APIs by meticu- tion and criteria for task planning, we utilize Tool
lously formulating API parameters and parsing re- Documentation-based prompting. The task plan-
sponses. Gorilla (Patil et al., 2023) explores the use ning prompt contains documentation of the PDF
of self-instruct fine-tuning and retrieval to enable tool APIs (see Table 1 for the API list), briefly men-
LLMs to accurately select from a large, overlap- tioning each function’s utilities, arguments, and
ping, and changing set of APIs. LLM-Grounder return values. Without explicitly exposing the API
(Yang et al., 2023) created a novel open-vocabulary implementation, this novel prompting technique
LLM-based 3D visual grounding pipeline to de- ensures that our methodology embraces API-level
compose complex natural language queries into abstraction and encapsulation by restricting access
234
to proprietary data and internal functions for en- checks the validity of dependency relations be-
hanced user privacy to black-box LLM models. tween various function calls in the task plan as:
Retrieval-Augmented Tool Selection: The task (1) Dependency hallucination verification – Each
planning stage may involve a large number of tools. function call depends on arguments provided by
Many of these tools might not be relevant to the the user request or outputs of preceding functions
user request, and including all in the LLM prompt in the task plan. We add checks to ensure the LLM
may lead to reduced context length for subsequent does not hallucinate dependencies referencing non-
chat prompting. Hence, based on the incoming user existent or future function calls in the task plan.
request, we utilized a retrieval-augmented selection (2) Dependency consistency verification: Each
approach to only include the most relevant few-shot function call in the task plan sequence may de-
examples in the task plan prompt. pend on one or more prior function calls. These
functional dependencies need not be linear and
Let q denote the user request and Z =
can be better represented as a graph of connected
{(1 , y1 ), (2 , y2 ), · · · , (n , yn )} represents
components (also known as a dependency graph).
the set of few-shot examples curated for the task
A function call may often try to access resources
plan prompt. Each example consists of a sample
from another function call. However, in some cases,
request ( ) paired with the corresponding ground
these interdependencies may be cyclic or unreach-
truth task plan (y ). We use a text embedding
able. Hence, subsequent function calls can not
model E to encode the sample user requests from
proceed ahead without resolving the prior. This
the few-shot examples into vector representations
may give rise to deadlock conditions during the
- {E( ), E(2 ), · · · , E(n )}, respectively. We
task execution. To avoid deadlocks and resource
construct a datastore of few-shot examples with
conflicts, it is important to ensure that there are
keys as vectorized sample requests and values as
no cyclic dependencies between the intermediate
ground truth task plans. We encode the incom-
function calls. To solve this problem, we create a
ing user request via embedding model as E(q )
dependency graph G from the task plan T where
at inference. Next, we use the k-nearest neigh-
all function calls denote the set of nodes V, and
bor technique with the Euclidean distance metric
their interdependencies represent the set of edges
to query top-k sample requests from the datastore
E of the graph. To check for the presence of cyclic
which are semantically most similar to the encoded
dependencies in a graph, it should be sufficient to
user query. The selected pairs of user requests and
check if the dependency graph is a directed acyclic
their task plans, similar to the example shown in
graph (DAG). We utilize Kahn’s algorithm (Kahn,
Fig. ??, are utilized in prompting the LLM model
1962) to evaluate this condition, which involves
to generate the task plan for the current user query.
performing a topological sort of the dependency
graph followed by a depth-first traversal to evaluate
3.2 Task Plan Verification and Self-Correction
if all nodes have been visited exactly once without
LLM-generated task plans involve a risk of hal- repetition. Violation of this condition indicates a
lucinations when selecting unspecified functions, lack of DAG property. The dependency error is
connecting dependency connections, or invalid ar- then attributed to the API function corresponding
gument parsing, which may lead to undesired out- to the failure node in the graph.
puts. We introduce two novel modules to ensure LLM-based Self-Correction: The verification
robustness in the generated task plans against log- module generates error log descriptions based on
ical inconsistencies: "Task Plan Verification" and the nature of the fault and the responsible API func-
"LLM-based Self-Correction". tions. The error logs and original task plan se-
First, the "Task Plan Verification" consists of quence are passed as feedback to the LLM model
three static composition verification and two inter- as a chat completion prompt to rework the solution.
task dependency verification checks on the gener- This process recursively improves the task plan
ated task plan JSON (Appendix Figure 6 shows an solution until no further errors are encountered.
illustrative example). Static composition verifica-
tion checks the individual constituents of the task 3.3 Multi-turn User Feedback
plan for hallucinations on syntax, tool name and User consent is a prerequisite for executing actions
API calls, and function arguments (Appendix A.4). that could potentially alter a user’s proprietary PDF
Second, the inter-task dependency verification files. Adhering to this principle, the meticulously
235
verified error-free task plan is transformed into task plan generation step, paired with their corre-
a clear and comprehensible layman explanation sponding ground truth Python code solutions (c )
through LLM prompting. This elucidation is then to guide the code generation process to remain
presented to the user through the chat interface. faithful to task plan logic. Further, we designed
Subsequently, the user can engage in a multi-turn stringent guard rails to safeguard program execu-
chat conversation with the LLM to challenge the tion by ensuring consistency in code generation
proposed task plan and provide additional feedback. syntax, avoiding lazy code generation phenomenon
The user’s input is integrated to iteratively refine of LLMs, machine compatibility of software im-
the task plan by recursively following the task plan- ports, safe-listing of approved Python packages,
ning and verification stages. This iterative process secure access of file addresses, and cautious file
of modifying the task plan through multi-turn chat handling. More details in Appendix Sec. A.5.
conversations continues until the user is content Error log-based Code Self-Revision: Despite
with the solution or decides to abort the request. carefully crafted prompts and strong guard rails,
the generated program solution may give errors
3.4 Task Plan Execution
upon code execution. To screen for errors in ad-
The task plan obtained in the last step lists tool vance and recover from a failed execution state, we
APIs with corresponding arguments and return propose Error log-based Self-Revision prompting.
values. However, the sequence of function calls In particular, we build a Python code interpreter to
that need to be executed is not linear due to inter- simulate code execution in a sandboxed environ-
dependencies between the API calls. Hence, there ment to mimic the actual PDF file editing. Compila-
is a need to convert the task plan into a software pro- tion errors from the code interpreter are captured as
gram with a logical flow of information. We intro- error logs and combined with the original code so-
duce the Task Plan Program Execution step, where lution to be passed as feedback to the LLM model
the LLM converts the task plan into a software to rework the code solution. The code interpreter
program that can be executed to give the desired again tests the reworked code solution to check for
output PDF file to the user. This stage is divided errors, and the process continues recursively until
into two modules - "Task Plan Code Generation" the code solution is improved and no further errors
and "Error log-based Code Self-Revision". are encountered. Fig. 7 in the Appendix shows an
Task Code Generation: We utilize the LLM code example code solution. Finally, we execute the re-
generation abilities to transform the task plan se- sultant error-free code solution to produce the PDF
quence into executable Python code. However, un- document modifications requested by the user.
restricted LLM-generated code may hallucinate
functions that do not exist, use incompatible li- 4 Implementation Details
braries, be unable to navigate file handling at the
user’s end or perform flawed executions that may Backbone LLM: We use GPT-4 API through the
harm user data, leading to deteriorated user trust. Microsoft Azure platform for all our experiments.
To safeguard against such detrimental cases, we We also tried GPT-3.5 model but it performed con-
incorporate a novel API Encapsulation-based few- sistently worse than GPT-4 owing to its limited
shot prompting with strong guardrails. The prompt context length and weak code generation abilities.
consists of the code documentation of PDFTools() RAG architecture: We utilized FAISS to construct
class, which encapsulates publicly accessible tool the data stores for the Retrieval-augmented tool se-
API function methods and exposes limited informa- lection module. We used SentenceBert (Reimers
tion regarding the function name, input arguments, and Gurevych, 2019) as the embedding model.
and returned values. The LLM can utilize this ab- We used Scikit Learn’s KNN library to get top-
stracted view of tool APIs for program synthesis k request-task plan pairs. We used Gradio for the
without knowing or modifying their internal code demo UI hosted on the AWS cloud platform.
implementation. In this manner, we alleviate the
problem of function hallucinations while ensuring 5 User Evaluation
that only well-trusted and rigorously tested API
functions are used for user data modifications. Ad- We conducted a user evaluation to assess the effi-
ditionally, we augment the prompt with a few shot cacy of DocPilot in supporting users’ PDF work-
examples of task plans (y ) retrieved during the flows. The research goals were as follows:
236
Figure 3: a) Self-reported user satisfaction scores from using DocPilot to complete 80 workflow requests. b)
Simple requests (<=5 actions) had higher satisfaction scores compared to c) complex requests (>5 actions).

5.1 Results
5.1.1 User Workflow Requests
We collected data from 80 workflow requests. 16
workflows were user-provided based on the users’
real-world PDF workflows, and 64 were workflows
suggested to the user. We provided the suggested
workflows to ensure that the workflows evaluated
included a variety of the types of actions used and
the number of actions requested. Appendix A.7.3
has examples of user-provided and suggested work-
flows. Fig. 4a shows the frequency of different
actions referenced as part of the user’s requests.
The most common actions included duplicating a
file (77), renaming a file (74), searching content
(65), redacting content (30), and counting pages
(20). Fig. 4b shows the distribution of actions exe-
Figure 4: (a) Frequency of different actions referenced cuted per request with a median of 5 (IQR 4 − 7).
in task plans; (b) Distribution of actions executed per
request during user evaluation. 5.1.2 Self-Reported Satisfaction Ratings
To understand DocPilot’s performance in suggest-
ing a satisfactory plan in response to a user’s re-
R1 Measure DocPilot’s performance in suggest- quest (R1), we collected self-reported measures of
ing a reasonable plan in response to a user- user satisfaction after each step of the DocPilot
provided multi-step workflow. Relatedly, we pipeline (Fig. 1). Fig. 3a shows user satisfaction ag-
wanted to understand how well our system gregated over all 80 workflow requests. DocPilot
handled ambiguity in user requests. performs extremely well in suggesting a reason-
able initial plan, with 88.75% (71/ 80) receiv-
R2 Understand when/how breakdowns happen ing positive ratings of Extremely satisfied and the
and whether users are able to refine their plan majority of requests not requiring plan revisions
through conversation with DocPilot. from the user. The main concern of dissatisfaction
with DocPilot was related to the task execution
step, which received the Extremely satisfied ratings
Our data collection focused on a case study with only in 36.25% (29/ 80) of requests, similarly
one expert PDF user who works on PDF processing reflected in the Overall satisfaction ratings.
tasks daily as part of his professional work. Our To understand whether workflow complexity im-
evaluator was hired through UpWork with exper- pacts the system’s efficacy in planning, we further
tise in PDF workflows. The evaluator interacted analyze satisfaction ratings by complexity. Fig. 3(b-
with the DocPilot app (Fig. 5) to complete several c) show satisfaction ratings for simple (n=45) and
workflows and provided feedback through a survey complex (n=33) requests. We consider simple re-
form (Methodology details in Appendix A.7.1). quests as those requiring 5 actions or less to be
237
executed to fulfill the users’ request. We observe quests (8/ 80) required more than one LLM self-
that the positive satisfaction ratings (Extremely sat- correction step (Sec. 3.2) to reach a desirable action
isfied + Somewhat satisfied) are higher for simple plan. In contrast, the majority of requests (69/ 80)
requests (25/ 45, or 55.55%) compared to those required at least one LLM self-correction (Sec. 3.4)
on complex requests (11/ 33, or 33.33%). Sim- to produce an executable program that passed all
ple requests also resulted in much higher satisfac- checks. More details in Appendix Table 2.
tion with task execution (27/ 45, or 60%) com-
pared to complex requests (10/ 33, or 30.3%). 6 Discussion, Limitations & Future Work
5.1.3 Qualitative Feedback Our evaluation results show that DocPilot’s Task
To further understand breakdowns in DocPilot Planning step is effective for most workflows as
from the users’ perspective (R2), we conducted a the proposed task plan captures the user’s intent
thematic analysis of users’ requests that resulted in well and requires few clarifications by the user.
failure as well as open-ended user feedback. For Only 10% of the evaluated workflows required
the Task Planning step, the user majorly provided more than one LLM iteration to self-correct the
positive comments, "The plan is concise, to the generated task plans. The majority of breakdowns
point, and explained well. I like that the assistant we observed occurred due to a mismatch in the
understands the request completely". user’s expectations between the plan suggested by
However, we also report a small number of neg- DocPilot and how it was executed. Our current
ative comments that were primarily centered on interface with DocPilot primarily uses a conver-
the Task Execution step, where DocPilot either sational UI. Leveraging interactions from graphi-
missed a step or detail in the resulting files. The cal UIs can help lessen this gap by providing the
user had certain expectations of the results based user affordances for direct manipulation in content
on the plan suggested by DocPilot, which were selection when executing a workflow (Ma et al.,
unmet. We observed instances where DocPilot 2023). Additionally, DocPilot may allow users
executed the action incorrectly, "Instead of delet- to edit action parameters (e.g., page number, pass-
ing pages 1, 2, and 5, the assistant deleted pages word) directly rather than requiring the user to type
1 to 4". Users also reported a few cases where a new request. Both of these future works could
DocPilot simply missed an action, "...the assis- increase user control and understanding of the sys-
tant successfully converted pages but was unable tem plan (Amershi et al., 2019). Quantitatively, we
to add digital signatures.". We also noticed some observe that most workflows (69/ 80) required at
cases where DocPilot did not understand the mul- least one LLM-based code revision to produce an
timodal content in the document properly, which in error-free program, thus introducing latency and
turn affected performance for actions that required impacting the utility of the tool. Self-reported rat-
searching for content in the document. For exam- ings indicate more failures in the task execution
ple, "My request is to redact the numerical values step for complex workflows. Hence, our future
in the ’Annual Energy Use’ and ’Water’ columns work will focus on instruction-tuning LLMs on
of the table. However, the assistant does not under- pairs of ground truth task plans and Python code.
stand and redacts incorrect words." Lastly, we also
recorded a handful of cases where the user had high 7 Conclusion
expectations that were beyond the DocPilot’s cur- We present Docpilot, an LLM-powered copilot
rent tooling capabilities (e.g., replacing text and im- for automating document workflows. Our copilot
ages). In the future, we aim to handle such discrep- helps novices plan document workflows by select-
ancies by improving prompt engineering, extend- ing the appropriate tools and executing the task
ing the PDF tool APIs available to DocPilot, and plan autonomously. DocPilot benefits the users
integrating Large Multimodal Models such as GPT- by enhancing their accessibility, being extensible to
4V for multimodal document search/QA tasks. include more tools, and being consistently reliable.
5.1.4 LLM Iterations & Self-Correction
8 Ethics Statement
To quantify breakdowns due to program execution
(R2), we analyzed the code interpreter error logs Our experiments used publicly available API-
for output code. A small number of workflow re- accessible LLM - GPT-3.5 and GPT-4 (March 2024
238
version). For our user evaluation, participants’ per- Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqi-
sonal information is maintained confidential and uyue Ping, and Qin Chen. 2023. Agentsims: An
open-source sandbox for large language model evalu-
private. Participants were trained and informed
ation.
about the task before participating. Participants
were also compensated fairly, with each annotator Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui,
paid equal to or more than 15 USD/hr. Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng
Dai, and Wenhai Wang. 2023. Controlllm: Augment
language models with tools by searching on graphs.
ArXiv, abs/2310.17796.
References
Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Xiao Ma, Swaroop Mishra, Ariel Liu, Sophie Su, Jilin
Adam Fourney, Besmira Nushi, Penny Collisson, Chen, Chinmay Kulkarni, Heng-Tze Cheng, Quoc
Jina Suh, Shamsi Iqbal, Paul Bennett, Kori Inkpen, Le, and Ed Chi. 2023. Beyond chatbots: Explorellm
Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. for structured thoughts and personalized model re-
2019. Guidelines for human-ai interaction. In CHI sponses. arXiv preprint arXiv:2312.00763.
2019. ACM. CHI 2019 Honorable Mention Award.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Kranti Chalamalasetti, Jana Götze, Sherzod Haki- Carroll Wainwright, Pamela Mishkin, Chong Zhang,
mov, Brielen Madureira, Philipp Sadler, and David Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
Schlangen. 2023. clembench: Using game play to 2022. Training language models to follow instruc-
evaluate chat-optimized language models as conver- tions with human feedback. Advances in neural in-
sational agents. arXiv preprint arXiv:2305.13455. formation processing systems, 35:27730–27744.

Zhipeng Chen, Kun Zhou, Beichen Zhang, Zheng Shishir G Patil, Tianjun Zhang, Xin Wang, and
Gong, Xin Zhao, and Ji-Rong Wen. 2023. Chat- Joseph E Gonzalez. 2023. Gorilla: Large language
CoT: Tool-augmented chain-of-thought reasoning on model connected with massive apis. arXiv preprint
chat-based large language models. In Findings of the arXiv:2305.15334.
Association for Computational Linguistics: EMNLP
2023, pages 14777–14790, Singapore. Association Cheng Qian, Chi Han, Yi Ren Fung, Yujia Qin, Zhiyuan
for Computational Linguistics. Liu, and Heng Ji. 2023. Creator: Tool creation for
disentangling abstract and concrete reasoning of large
Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, language models. In Conference on Empirical Meth-
Joya Chen, Zihan Fan, and Mike Zheng Shou. 2023. ods in Natural Language Processing.
Assistgpt: A general multi-modal assistant that can
plan, execute, inspect, and learn. arXiv preprint Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo,
arXiv:2306.08640. Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei
Lv, and Huajun Chen. 2024. Autoact: Automatic
Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. agent learning from scratch via self-planning. ArXiv,
2024. Toolkengpt: Augmenting frozen language abs/2401.05268.
models with massive tools via tool embeddings. Ad-
vances in neural information processing systems, 36.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Cheng-Yu Hsieh, Si-An Chen, Chun-Liang Li, Yasuhisa Sentence embeddings using siamese bert-networks.
Fujii, Alexander Ratner, Chen-Yu Lee, Ranjay Kr- In Conference on Empirical Methods in Natural Lan-
ishna, and Tomas Pfister. 2023. Tool documenta- guage Processing.
tion enables zero-shot tool-usage with large language
models. arXiv preprint arXiv:2308.00675. Jingqing Ruan, Yihong Chen, Bin Zhang, Zhiwei Xu,
Tianpeng Bao, Guoqing Du, Shiwei Shi, Hangyu
Rongjie Huang, Mingze Li, Dongchao Yang, Jia- Mao, Xingyu Zeng, and Rui Zhao. 2023. Tptu: Task
tong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, planning and tool usage of large language model-
Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. 2023. based ai agents. arXiv preprint arXiv:2308.03427.
Audiogpt: Understanding and generating speech,
music, sound, and talking head. arXiv preprint Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li,
arXiv:2304.12995. Weiming Lu, and Yueting Zhuang. 2024. Hugging-
gpt: Solving ai tasks with chatgpt and its friends
Arthur B Kahn. 1962. Topological sorting of large in hugging face. Advances in Neural Information
networks. Communications of the ACM, 5(11):558– Processing Systems, 36.
562.
Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu,
Katya Kudashkina, Patrick M. Pilarski, and Richard S. Han Qian, Mingbo Song, Hailiang Huang, Cheng
Sutton. 2020. Document-editing assistants and Li, Ke Wang, Rong Yao, Ye Tian, and Sujian Li.
model-based reinforcement learning as a path to con- 2023. Restgpt: Connecting large language models
versational ai. ArXiv, abs/2008.12095. with real-world restful apis.
239
Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, China. Xiaoyan Zhang, Zhao Yang, Jiaxuan Liu,
and Chao Zhang. 2024. Adaplanner: Adaptive plan- Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and
ning from feedback with language models. Advances Gang Yu. 2023a. Appagent: Multimodal agents as
in Neural Information Processing Systems, 36. smartphone users. ArXiv, abs/2312.13771.
Lei Wang, Chengbang Ma, Xueyang Feng, Zeyu Zhang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and
Hao ran Yang, Jingsen Zhang, Zhi-Yang Chen, Ji- Yue Ting Zhuang. 2023b. Data-copilot: Bridging
akai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, billions of data and humans with autonomous work-
Zhewei Wei, and Ji rong Wen. 2023a. A survey flow. ArXiv, abs/2306.07209.
on large language model based autonomous agents.
ArXiv, abs/2308.11432. Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou,
Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan
Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Bisk, Daniel Fried, Uri Alon, et al. 2023. Webarena:
Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang A realistic web environment for building autonomous
Huang, Yanbin Lu, and Yingzhen Yang. 2023b. Rec- agents. arXiv preprint arXiv:2307.13854.
mind: Large language model powered agent for rec-
ommendation. arXiv preprint arXiv:2308.14296. Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Wei-
jie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu,
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Xiaogang Wang, et al. 2023. Ghost in the minecraft:
Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Generally capable agents for open-world enviroments
Yaqin Zhang, and Yunxin Liu. 2023. Empowering via large language models with text-based knowledge
llm to use smartphone for intelligent task automation. and memory. arXiv preprint arXiv:2305.17144.
arXiv preprint arXiv:2308.15272.
A Appendix
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen
Ding, Boyang Hong, Ming Zhang, Junzhe Wang, A.1 DocPilot Demo App
Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao
Figure 5 shows the DocPilot demo app. This app
Wang, Limao Xiong, Qin Liu, Yuhao Zhou, Weiran
Wang, Changhao Jiang, Yicheng Zou, Xiangyang was also used by our evaluator to complete all work-
Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, flow requests. The app was built using Gradio1 .
Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan The app requires an OpenAI token to access the
Zheng, Xipeng Qiu, Xuanjing Huan, and Tao Gui. GPT-4 model. The interface includes a PDF upload
2023. The rise and potential of large language model
based agents: A survey. ArXiv, abs/2309.07864. panel, a PDF viewer, and a chat panel. Users can
directly upload their input PDF file and type in their
Binfeng Xu, Xukun Liu, Hua Shen, Zeyu Han, Yuhan request in the chat panel. The chat panel facilitates
Li, Murong Yue, Zhiyuan Peng, Yuchen Liu, Ziyu
Yao, and Dongkuan Xu. 2023a. Gentopia.AI: A col- multi-turn chat and shows all the intermediate inter-
laborative platform for tool-augmented LLMs. In actions and results generated by the system. Once
Proceedings of the 2023 Conference on Empirical a workflow is executed, the user can download the
Methods in Natural Language Processing: System resulting files for inspection. The users also has
Demonstrations, pages 237–245, Singapore. Associa-
tion for Computational Linguistics. the ability to reset their chat history to start a new
workflow conversation.
Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu,
Zhengyu Chen, and Jian Zhang. 2023b. On the A.2 Implementation Details
tool manipulation capability of open-source large
language models. arXiv preprint arXiv:2305.16504. Backbone LLM: We use GPT-4 API through the
Microsoft Azure platform for all our experiments.
Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil We also tried GPT-3.5 model but it performed con-
Madaan, Madhavan Iyengar, David F. Fouhey, and
Joyce Chai. 2023. Llm-grounder: Open-vocabulary
sistently worse that GPT-4 owing to its limited
3d visual grounding with large language model as an context length and weak code generation abilities.
agent. ArXiv, abs/2309.12311. RAG architecture: We utilized FAISS to construct
the data stores for the Retrieval-augmented tool se-
Shunyu Yao, Howard Chen, John Yang, and Karthik
Narasimhan. 2022. Webshop: Towards scalable real- lection module. We used SentenceBert (Reimers
world web interaction with grounded language agents. and Gurevych, 2019) as the embedding model.
Advances in Neural Information Processing Systems, We used Scikit Learn’s KNN library to get top-
35:20744–20757. k request-task plan pairs. We used Gradio for the
Da Yin, Faeze Brahman, Abhilasha Ravichander, Khy- demo UI hosted on the AWS cloud platform.
athi Raghavi Chandu, Kai-Wei Chang, Yejin Choi, LLM Agent Evaluations: ToolBench (Xu et al.,
and Bill Yuchen Lin. 2023. Lumos: Learning agents 2023b) released a tool manipulation benchmark
with unified data, modular design, and open-source
llms. ArXiv, abs/2311.05657. 1 https://www.gradio.app

240
Figure 5: UI for DocPilot

consisting of diverse software tools for real-world A.4 Task Plan Verification Module
tasks to evaluate LLM capabilities for tool manip- Figure ?? shows a qualitative example of task veri-
ulation. AgentSim (Lin et al., 2023) created an fication checks - Syntax Hallucination, Tool Hallu-
interactive infrastructure for researchers to evaluate cination, Argument Validity, Dependency Halluci-
the task completion abilities of LLM agents in a nation, and Dependency Consistency.
simulated environment. WebArena (Zhou et al., Static composition verification checks the indi-
2023) introduces a benchmark on interpreting high- vidual constituents of the task plan for hallucina-
level realistic natural language commands to con- tions on syntax, tool name and API calls, and func-
crete web-based interactions. ClemBench (Chala- tion arguments:
malasetti et al., 2023) provides a systematic evalu-
ation of LLM’s capability to follow game-play in- 1. Syntax hallucination verification – Incorrect
structions. (Xu et al., 2023a) created the GentPool JSON formatting of the task plan may cause
platform that registers and shares user-customized, downstream JSON parsing errors. This ver-
composable, and collaborative agents. WebShop ification step ensures the task plan returned
(Yao et al., 2022) is another challenging bench- is a list of Python maps with key-value pairs
mark that tests LLM agent’s capabilities to navi- denoting function names, dependencies, input
gate multiple types of webpages, find, customize, arguments, and returned values.
and purchase a product given text instruction in
2. Tool hallucination verification – Despite
an e-commerce website simulation with 1.18 mil-
prompting the syntactically correct task plan,
lion real-world products. This is the first work to
LLMs may hallucinate invalid tool names and
provide a novel benchmark for evaluating LLM
API calls. This step ensures that all PDF tool
agent workflows in a document editing software
APIs are valid and present in the documenta-
environment.
tion.
3. Argument validity verification - Each func-
tion in the task plan has a pre-defined number
A.3 DocPilot PDF Tool APIs
and type of arguments and return values. Any
hallucinations in this regard may cause errors
Table 1 shows the set of PDF tool APIs and their de- during program execution. Hence, we check
scriptions available during the task plan generation for any extra, missing, or incorrect arguments
in DocPilot. in each task plan sequence function call.
241
Tool Description
Duplicate Initializes a duplicate of the input file and saves it as "input.pdf"
Rename Renames the input file to the output file name with a default value "output.pdf"
Search Returns a list of text strings matching the matching query found in the input document denoted as filename; Otherwise, returns an empty list.
QnA Answers a question in the form of a text string from the LLM query result.
Count Pages Counts the number of pages in the PDF file and returns it as an integer
Compress Reduce the PDF file size given as the input filename and save the new file as the output filename.
Convert to PPT Convert the input PDF file into a PowerPoint presentation (ppt) file and save the converted file as output filename
Convert to Word Convert the input PDF file into a Word (docx) file and save the converted file as output filename
Convert to PNG Convert the input PDF file into a PNG image file and save the converted file as output filename
Convert to JPEG Convert the input PDF file into a JPEG image file and save the converted file as output filename
Convert to TIFF Convert the input PDF file into a TIFF image file and save the converted file as output filename
Convert to Excel Convert the input PDF file into an Excel (.xlsx) file and save the converted file as output filename
Add Password Add the input passcode text string as password protection to the input PDF file.
Check Password Check if the input PDF file has password protection
Combine Files Combine all files given in the list of input files into a single file and save the output file as output_filename
Redact Pages Redacts all pages of the input PDF file in the range starting from start_page till end_page. Start and end pages are 1-indexed
Redact Text Redacts all mentions of strings in the list denoted by "object_name" from the input PDF file within the range starting from the start page to the end page. Start and end pages are 1-indexed
Highlight Text Highlight all instances in the input PDF file matched by input string
Underline Text Underline all instances in the input PDF file matched by input string
Extract Pages Extracts pages from the input PDF in the range from the start page to the end page. Start and end pages are 1-indexed
Delete Page Deletes page denoted by integer "page_number_to_delete" from the input PDF file; the page number to be deleted is 1-indexed
Delete Page Range Deletes pages from the input PDF in the range from the start page to the end page. Start and end pages are 1-indexed
Add Signature Add an image of the signature on the page denoted by "page_number" in the input PDF file; input page number is 1-indexed
Add Watermark Fix the watermark image on the input PDF file pages in the range from the start page to the end page. Start and end pages are 1-indexed
Add Comment Add input text comment in the input PDF file at the input page number or by default at the last page
Add Page Text Add a new page to the input PDF file at the page number specific by "page number". The new page has the text string "content" added to it. Page numbers are 1-indexed

Table 1: Overview of tasks and associated tools in DocPilot

“{
"task": " add_speaker_notes ",

“add_speaker_notes” is not a "id": 7, "dep": [5],

valid task in DocPilot
"args": {
Tool Hallucination "input_file": ”ABD.pdf",
Verification "start_page":“4"
}, Argument Validity
"end_page":“6”},
Verification
"returns": {
Return argument “speaker_count”
”output_file": ”ABD.pdf”,
“speaker_count”: “2” not valid
},
Syntax Hallucination
Verification “source”: {
“start_page”:“<resource>_5”, Dependency Hallucination
JSON is not properly formatted "end_page": “<resource>_6>” Verification
},
”<resource>_6>” does not exist
}"

Count number of pages

Add answer to new page

COUNT
Search mentions of "Philly Co."
ADD

REDACT QnA

Redact mentions of "Philly Co." Q-"Philly is CEO of ____?"

Dependency Consistency
Verification

Figure 6: Task Verification example for syntax hallucinations, tool hallucinations, argument validity, dependency hallucinations,
and dependency consistency.

242
A.5 Guard Rails for Task Plan Code fied and the resultant output files generated
Generation by the copilot to the LLM. We achieve this
1. Code Generation Syntax: Most state-of-the- by strongly type-casting all references to in-
art LLM architectures are geared towards a put and output file names and addresses in the
conversational chat interface trained via hu- generated code to their actual values at the
man chat feedback (Ouyang et al., 2022). Con- code execution step. Further, we impose strict
sequently, LLMs may occasionally interleave directory access restrictions on the copilot sys-
conversational text with code syntax during tem, preventing accessing, reading, or saving
generation. Moreover, some LLMs may even files without explicit user permissions. The
provide pseudo-code instead of independently code execution step involves creating a copy
executable Python code. In order to avoid of all files required as inputs to a temporary
such pitfalls, we add explicit instructions in directory and saving all intermediate files and
the prompt to force the LLM to follow a pre- the final output PDF to avoid overwriting or
defined Python syntax with all other extrane- modifying non-permitted files.
ous text formatted as comments in the code
A.6 Task Plan Code Generation Examples
block. Moreover, it has been recently re-
ported that SOTA LLMs like ChatGPT-3.5 Figure 7 shows a qualitative example of a code
and GPT-4 tend to show signs of "lazy assis- solution generated by DocPilot corresponding to
tance" wherein they refuse to generate fully the task plan response to the user request - "Hey,
executable code, instead explaining how the can you please blacken out any sensitive client
user could answer the question. We care- names from my ’VoltGaurd Electric.pdf’ file and
fully designed the LLM prompt with explicit convert it into a PowerPoint presentation".
instructions to satisfy our need for indepen-
A.7 Qualitative Examples
dently executable Python code as the output
of the step. Figures 8 and 9 show qualitative examples of PDf
files edited by a user through DocPilot.
2. Software Import Compatibility: Allowing
A.7.1 Evaluation Procedure
unrestricted permission to import any soft-
ware library or package specified in the LLM- As an introduction, the user was provided with a
generated code may potentially harm user pri- guidelines document that detailed the PDF capa-
vacy and security. Some of these may not bilities of DocPilot. The user was also provided
be compatible with the user hardware, con- access to the DocPilot app (Figure 5) and given a
flict with existing software versions, or be no short tutorial on its usage. For data collection, the
longer supported by programming languages. user was provided with a repository of PDF docu-
Hence, appropriate guard rails are needed to ments (n=61), a suggested prompt library (n=151),
regulate what software libraries can be im- and a link to a survey form for data collection. The
ported during task plan execution. Towards user was instructed their overall evaluation goal
this, we maintain a software safe-list of ap- was to complete several PDF workflows as best as
proved Python packages, libraries, and exe- possible with the help of DocPilot.
cutable files in the tool API documentation For the first task, the user was asked to select a
that are permitted to be invoked by LLM- PDF document and either craft a prompt based on
generated code. We add explicit instructions their own usage or select one from the suggested
to the prompts to forbid the LLMs from gen- list. For the second task, the user was asked to
erating any overhead software libraries and prompt DocPilot with their workflow request and
packages for code execution. Instead, we pre- to carefully review DocPilot’s responses. The user
append the safe-listed software imports to the was encouraged to request changes as needed to the
generated code. suggested plan until satisfied that it met their work-
flow goals. For the third task, the user was asked to
3. File Handling: An essential aspect of copilot- review the actions as executed by DocPilot in the
driven external file modifications is safeguard- resulting files. Last, after completing all interac-
ing data privacy by not exposing the input tions with DocPilot for one workflow, regardless
file names and types that need to be modi- of success or failure, the user was instructed to
243
Figure 7: An example of task plan code solution generated for the query - "Hey, can you please blacken out any sensitive client
names from my ’VoltGaurd Electric.pdf’ file and convert it into a PowerPoint presentation"

Redact all PII and save as “demo.pdf”

Initial Input File Final Output File

Figure 8: Example of a visa document being edited using DocPilot. The user asks to "redact all mentions of
Personally Identifiable Information in the document". DocPilot removes names, passport numbers, date of birth,
sex, nationality, and dates in the input document.

complete a short survey form reflecting on their A.7.2 Measures

experience. Each of the tasks were repeated for
Our data collection included: 1) interaction logs
every new workflow evaluated.
during DocPilot app usage, 2) self-reported feed-
back after each workflow request, and 3) open-
ended user feedback. The interaction logs included
244
Hey, can you underline all dates and redact
any names of people mentioned in this file

Initial Input File Final Output File

Figure 9: Example of a legal court document being edited using DocPilot. The user asks, " Hey, can you underline
all dates and redact any names of people mentioned in this file?". DocPilot covers names ("Saiprasad Kalyankar",
"Mohd Naushad") and underlines dates ("4th Feb 2015", "2014", "January 13 and 21, 2015") in the input document.

the chat history, program execution actions, and the workflow requested to delete page 5 of a docu-
resulting files after execution. The self-reported ment but the document only had 3 pages, then the
measures included overall workflow, satisfaction user modified the workflow prompt accordingly.
with the initial DocPilot suggested plan, satisfac- Examples of user-provided workflows:
tion with DocPilot incorporating user feedback to
1. Add a watermark with the text "DRAFT" on
the plan, and satisfaction with how well actions
every page, underline the test cycle types in
were executed in the resulting files.
the table, and extract the cleaning index values
into a separate list
A.7.3 Example Workflows
2. Highlight the text "ENERGY STAR Test
In total we collected data from 80 workflow re-
Method for Determining Residential Dish-
quests. 16 workflows were user-provided based on
washer Cleaning Performance" in the docu-
the users’ real world PDF workflows and 64 were
ment, convert page into image file, and add a
workflows suggested to the user. We provided the
header with the text "Energy Star Most Effi-
suggested workflows to ensure workflows evalu-
cient 2016
ated included variety in the types of actions used
and in the number of actions requested. The user 3. Underline all section headings, redact the com-
was encouraged to make small adjustments to the pany’s physical address, and add a watermark
suggested workflows (as needed). For example if with text "Evaluation Copy" on each page
245
4. Extract pages 1-2 as a separate file with pass-
word "BUDGET2013", summarize the key
issues discussed and action points, then add
this summary to a new first page.

5. Extract all key terms and concepts, and create

a glossary or index at the end of the document

Examples of suggested workflows:

1. Summarize all mentions of product launch

dates and marketing strategies from the doc-
ument, add a new page in front add this sum-
mary. Finally convert it to a Word file for later
reference.

2. Redact all salary figures from the document,

then add a line at the end stating the average
salary of the listed positions. Underline the
final mean salary figure for emphasis

3. Search for any mentions of project deadlines

and add them as a new page at the end, then
compress the file size to optimize storage
space.

4. Search for any occurrences of the term ’Confi-

dential’ and redact them, after deleting pages
1-2. And add a watermark "Top Secret" to
each remaining page.

5. Identify and highlight any technical or special-

ized terminology used within the document
and add a signature to page 1 and protect the
document with encryption.

A.7.4 Results: Code Iterations

Table 2 illustrates the number of code iterations
for each workflow (n=80). The majority of work-
flows (48/80) required one code iteration, and most
workflows were successful in a maximum of two
LLM-based code revision cycles.

# of iterations Count
0 11
1 48
2 13
3 1
5 2
6 3