Co-Creating Visual Stories With Generative AI
Co-Creating Visual Stories With Generative AI
Co-Creating Visual Stories With Generative AI
Storytelling is an integral part of human culture and significantly impacts cognitive and socio-emotional
development and connection. Despite the importance of interactive visual storytelling, the process of creating
such content requires specialized skills and is labor-intensive. This article introduces ID.8, an open-source
system designed for the co-creation of visual stories with generative AI. We focus on enabling an inclusive
storytelling experience by simplifying the content creation process and allowing for customization. Our user
evaluation confirms a generally positive user experience in domains such as enjoyment and exploration while
highlighting areas for improvement, particularly in immersiveness, alignment, and partnership between the
user and the AI system. Overall, our findings indicate promising possibilities for empowering people to create
visual stories with generative AI. This work contributes a novel content authoring system, ID.8, and insights
into the challenges and potential of using generative AI for multimedia content creation.
CCS Concepts: • Human-centered computing → Interaction design; Interactive systems and tools;
1 Introduction
Storytelling is a defining aspect of the human experience that has been practiced through diverse
media, such as written text, oral traditions, cave paintings, and more [55]. Visual stories are
narratives that are augmented by various forms of media, such as drawings, illustrations, animations,
and videos, serving to enhance the overall storytelling experience [10]. Visual stories tend to increase
interest in and emotional engagement with the narrative even as they improve understanding
and retention of the story’s content [38]; they can be an ideal medium for psycho-educational
interventions [6], health communications [60], language learning [54], intergenerational bonding
[34], and creative self-expression [52].
Despite the benefits and use cases of visual stories, their creation process remains a challenging,
multifaceted task that unfolds via a sequence of essential steps—such as brainstorming to cultivate
ideas, scripting to develop the narrative, storyboarding for visual planning, amassing the necessary
This work was supported by the Malone Center for Engineering in Healthcare at the Johns Hopkins University.
Authors’ Contact Information: Victor Nikhil Antony (Corresponding author), Johns Hopkins University, Balti-
more, MD, USA; e-mail: [email protected]; Chien-Ming Huang, Johns Hopkins University, Baltimore, MD, USA;
e-mail: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 2160-6463/2024/8-ART20
https://doi.org/10.1145/3672277
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:2 V. N. Antony and C.-M. Huang
Fig. 1. ID.8 features a multi-stage visual story authoring workflow facilitated by generative AI. (1) The story
creation begins with users collaborating with ChatGPT to create a storyline and (2) manually editing of
story content for finer edits. (3) Then, ID.8 automatically parses the story co-created with ChatGPT into a
scene-by-scene script to be edited further by the user. (4) The story scenes from the script are automatically pre-
populated and organized in the Storyboard and (5) edited in the Scene Editor, where users use StableDiffusion,
AudioGen, and MusicGen to generate story elements and synchronize story element on the canvas and the
timeline.
media assets, piecing together the elements, and refining the content through editing—culminating
in the distribution of the finalized story [47]. The nature of multi-modal asset creation and the
technical demands of specialized software (e.g., Adobe Creative Studio, Final Cut Pro) present a skill
barrier that hampers both expert and novice creators from fully tapping into visual stories’ creative
potential. Lowering barriers to visual story authoring can enable the production of individualized
and customized content that may lead to improved outcomes in varying use cases and amongst
diverse populations.
Recent advances in generative Artificial Intelligence (AI) have enabled the production of text
[8], images [41], audio [16], and videos [35] from user instructions; generative AI models hold
the potential to help democratize the visual story authoring landscape. Human-AI co-creation is a
paradigm wherein human users collaborate with AI with varying degrees of partnership to create
a product [49]. Several co-creative systems have explored how generative models may help users
create storylines [15, 44], draw visuals [31, 70] and compose music [37]. However, to the best of
our knowledge, no systems exist, co-creative or otherwise, that enable the end-to-end authoring
of visual stories. Leveraging generative AI to simplify the authoring process may help enable the
quick and expressive creation of visual stories. Thus, we aim to explore the following research
questions: RQ1: How can we integrate different generative AI models in an end-to-end visual story
authoring system to support visual story creation process? RQ2: How do people co-create visual
stories with a generative AI system?
Toward empowering users to effectively explore the creative possibilities of and quickly iterate
over and generate visual stories, we built ID.8 (ideate), an open-source, end-to-end authoring system
for visual story creation with generative AI integrated into its workflow (see Figure 1). Our system
enables users to collaborate with ChatGPT (a large language model (LLM)) [8] to co-write a
script for the story, generate visual assets via Stable Diffusion (a text-to-image model) [51], and
generate audio assets with AudioGen (a text-to-audio model) [30] and MusicGen (a text-to-music
model) [16]. Although the various content generation paradigms used in different stages of ID.8
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:3
may have been individually assessed in a co-creative setting, to the best of our knowledge, no
studies thus far have evaluated a multi-modal co-creative experience that occurs during visual
story authoring. Moreover, no open-source system exists that enables end-to-end generation of
visual stories with multi-modal content (e.g., text content, visuals, audio effects). ID.8 embodies a
“human-in-control, AI-in-the-loop” design framework to balance user autonomy and AI assistance.
Our goal is to harmonize the control, agency, content safety, and human touch inherent to the
manual story creation process with the creative variability and production efficiency of generative
AI. Studying the co-creative process in a multi-modal domain such as visual story authoring can
yield novel insights grounded in more realistic scenarios where Generative AI is poised to be used
in the real world.
We conducted a two-phased evaluation of the ID.8 system to assess its usability and creative
breadth. We found that ID.8 provided an enjoyable user experience and that users greatly appreciated
the value of integrating generative AI in the visual story authoring workflow; moreover, users
generated a wide variety of stories via our system, demonstrating its creative capabilities. Through
this evaluation, we also gained a deeper understanding of the challenges faced by users while using
a multi-modal co-creative system, offering insights and design implications for future human-AI
co-creative systems.
This work makes three key contributions:
(1) We design, develop, and release ID.8: a novel, open-source, end-to-end system that enables
visual story authoring via a unified interface and a human-AI co-creative workflow, aiming
to lower the skill ceiling required for and enable the agile iteration of and broader creative
expression in visual storytelling.
(2) Insights from two user evaluations highlight the current opportunities and challenges of
multi-modal content creation via state-of-the-art generative AI.
(3) We put forward a set of design guidelines for human-AI co-creative systems based on our
experience and empirical evidence from evaluating ID.8.
2 Related Work
Here, we explore the three areas of work related to the motivation and development of ID.8.
Section 2.1 highlights how conventional story authoring tools facilitate users in crafting visual
stories but often place the burden of asset production on users. Section 2.2 focuses on state-of-the-art
generative AI models, exploring how advancements in this area hold the potential for democratizing
content creation across modalities. Section 2.3 discusses the potential of human-AI co-creative
systems, which synergize the complementary capabilities of humans and AI in a collaborative
creative process. Through ID.8, we seek to address the gap at the intersection of these areas by
building a platform that leverages generative AI to enhance the visual story authoring experience
in a co-creative setting and to help better understand the challenges of co-creation in a multi-modal
setting.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:4 V. N. Antony and C.-M. Huang
for the integration of common sources such as images, text, and video; it supports multiple layers
and dynamic narratives, however, it is limited to creating interactive articles. Katika [24] is an
end-to-end system that simplifies the process of creating explainer motion graphics videos for
amateurs; it provides a graphical user interface that allows users to create shots based on a script,
add artworks and animation from a crowdsourced library, and edit the video using semi-automated
transitions. These systems demonstrate the productivity improvements that story authoring tools
provide to users however are still limited due to the burden of asset production being placed on the
user.
To bridge the gap that exists due to burden of assets generation being placed on end-users and
limited interactivity, we designed ID.8 and sought to leverage generative AI to support each stage
of the visual story authoring process to empower users to quickly explore the creative landscape
and materialize their vision. Past work has evaluated asset generation using generative AI; however,
they have been limited to a single modality (e.g., text, audio, images); With ID.8, we aim to enable
an end-to-end authoring of visual stories by integrating various generative models in a workflow
that unifies text, audio and video content thus allowing users to collaborate with generative AI in a
more complex workflow; We open-source our system to enable the study of human-AI co-creation
in this multi-modal domain.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:5
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:6 V. N. Antony and C.-M. Huang
The complexity in visual story authoring arises from juggling diverse elements like
narrative arcs and spatio-temporal asset pacing, making it challenging to maintain a
coherent view of the interactions of different narrative elements. ID.8 addresses this by
helping users establish robust mental models of the narrative structure through clear
interaction cues and visual markers among various story elements.
Design Criteria 3: Amplify Creative Exploration
Generative models hold the potential for rapid exploration of a vast creative landscape,
thereby amplifying individual creativity and imagination [65]. ID.8 is designed to stream-
line the exploration and curation of these AI-generated artifacts, allowing users to seam-
lessly integrate these outputs into their creative vision.
Design Criteria 4: Safeguard Creative Autonomy
Critical issues like the loss of user autonomy and the generation of potentially dangerous
or unsafe creative outputs can manifest in co-creative systems, often due to AI decisions
taking precedence over human input [9]. ID.8 is designed to prioritize user control,
empowering them to evaluate, select, and integrate AI-generated assets, thus effectively
mitigating the risk of AI-driven decisions overriding user choices [65].
Design Criteria 5: Support Modularity and Extensions
combine the control, agency, safety, and human touch intrinsic to manual story creation with the
creative accessibility and efficiency gains offered by generative AI. To align this philosophy with
our goal of simplifying visual story creation, we outline our core design criteria (DC) in Table 1.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:7
Fig. 2. ID.8 enables generation of a story (1) by collaborating with ChatGPT and also allows the user (2) to
manually edit the story and then (3) generates—using ChatGPT—a structured script and pre-populates the
storyboard with scenes from the script.
organized as individual nodes in the Storyboard module. Users then edit each scene in the Scene
Editor where users use StableDiffusion, AudioGen, and MusicGen to generate story elements and
synchronize various narrative elements on the canvas and the timeline. Users can watch the story
as a whole or preview a single scene to experience the story as the viewer and adjust accordingly.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:8 V. N. Antony and C.-M. Huang
3.3 Storyboard
As a visual canvas (see Figure 3), the Storyboard module, supports an interactive, node-based
visualization of the narrative (DC2). Users create and link individual nodes, each representing
a specific scene, to create a cohesive story structure and flow. Metadata associated with each
scene—such as titles, background settings, interactive components, and multimedia elements—
are compactly displayed, aiding quick comprehension and navigation (DC2). The interface is
designed to be intuitive, providing features like scene addition, deletion, and replication, as well as
specifying a narrative starting point. To enrich the storytelling experience, the Storyboard module
supports conditional narrative branching through its interaction components. This functionality
aims not only to streamline the user’s creative workflow but also to provide granular control over
the narrative trajectory, ensuring that the Storyboard functions as both a planning tool and an
interactive blueprint for the story (DC4).
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:9
Fig. 3. ID.8 Storyboard allows for organization of the story flow by linking scenes and specifying how story
viewer inputs should impact the flow of the story. Users access the Scene Editor module by double-clicking a
scene node. Users can also preview their story.
contextual menu appears when an element is selected, offering a range of customization options
(e.g., dimensions, start and end locations for an animation path). Dynamic background effects,
mimicking natural phenomena like rain or snow, can also be set using the toolbar (see Figure 3) to
enrich the visual storytelling experience.
Timeline: Orchestrating Timing and Sequencing. Situated below the canvas, the timeline offers a
visual platform for coordinating the sequence and timing of scene components (e.g., characters)
(DC2). It aids in crafting a cohesive narrative flow by allowing users to chronologically synchronize
and quickly adjust these elements, ensuring that the visual and auditory assets sync well with each
other and the script.
Asset Creator. Anchored within the Scene Editor, the Asset Creator (see Figure 4) serves as a
platform for generating, selecting, and adapting both visual and audio assets. This tool, featuring a
dual-tabbed interface comprising a “Visuals Generator” and an “Audio Generator,” allows users to
explore creative possibilities and materialize their storytelling vision.
— Visuals Generator. Using Stable Diffusion [51], a text-to-image model, the Visuals Generator
supports image creation based on user-provided prompts. Users may save generated image(s)
as background or extract parts to be saved as characters using Meta’s “Segment Anything”
model [28] (see Figure 4(2)). Advanced controls for fine-tuning the model’s output include
negative prompts, a variable range for the number of generated images (1–4), denoising steps,
and various modes like panorama and self-attention.
— Audio Generator. Focused on sound crafting, the Audio Generator is powered by Meta’s
AudioGen [30] and MusicGen [16] models, allowing for the creation of sound effects (e.g.,
applause, wolf howls) and musical pieces (e.g., lo-fi background tracks, classical carnatic
compositions) based on user descriptions. Controls like duration (1–10 seconds), top-p and
guidance scale afford more granular manipulation of audio generation.
The Asset Creator also integrates “Leela” (ChatGPT) to facilitate collaborative prompt authoring
with the user (see Figure 4(1)) toward more effective visual and audio generation thus striving to
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:10 V. N. Antony and C.-M. Huang
Fig. 4. (1) The ID.8 Scene Editor enables creation of prompts for text-to-image/audio models in collaboration
with ChatGPT; (2) For character generation, ID.8 empowers users to select parts of the generated output to
be used in the story; (3) ID.8 provides a simple interface for adding interaction with viewer.
help users realize their creative vision (see Appendix B for prompt). Moreover, it offers users preview
functionalities and multiple output options for greater creative control (DC3). This setup ensures
that the AI components play a supportive yet non-dominant role in the creative process (DC4).
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:11
Speech Generator. To enable narration and dialogue, Scene Editor includes a Speech Generation
module powered by Google Cloud Text-to-Speech.5 This feature allows users to select, customize
and preview voice outputs as well as adjust parameters such as pitch and speed for creating more
tailored voices. Users have the option to save their customized speech profiles to ensure consistent
auditory experiences across multiple scenes (DC1).
Viewer Interaction. ID.8 aims to elevate viewer engagement by incorporating interactive elements
that invite viewers’ participation within the story (DC1). Toward accomplishing this, the Scene
Editor provides an option to append interactive questions at the conclusion of individual scenes,
aimed at capturing viewer input. Questions are both displayed on-screen and vocalized through
Google Cloud Text-to-Speech, with a set of selectable responses also presented to the viewer.
Depending on the selected response, specific auditory feedback is triggered, and the storyline
may diverge accordingly. For scenes featuring interactive components, ID.8 enables the creation
of conditional narrative branches. Users have the flexibility to dictate how viewer responses can
influence subsequent scenes, thereby introducing a level of interactivity that has the potential to
impact the story’s direction (DC4). ID.8 provides a stand-alone story viewer platform that is able
to play the generated stories. This viewer platform can be used by embodied agents (e.g., virtual
characters, social robots) to enrich the story viewing experience; for instance, a social robot can
“tell” the story along with expressive movements to engage the viewer in domains such as education
and therapy.
4 Evaluation
We conducted two studies to evaluate ID.8 and to understand how end-users interact and collaborate
with generative AI to create visual stories. Study 1 sought to understand the usability of ID.8, the
creative support by generative AI in the visual story authoring workflow, and whether users can
effectively generate different story elements such as plot, background, characters, audio effects, and
so on. Study 2 aimed to gain a deeper understanding of the ID.8 creative breath and the co-creative
user experience through an open-ended story generation task that participants engage in over
a longer period time (i.e., 1 week) outside of a controlled lab environment. We report details of
the two study designs and study-specific findings in this section and describe the general lessons
learned from the studies and design guidelines for human-AI co-creative systems in Section 5.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:12 V. N. Antony and C.-M. Huang
sub-scales signifies a positive user experience [32]. It is worth noting that the interpretation of
the Agency sub-scale is different from the rest of sub-scales; lower scores suggest the system is
perceived primarily as a tool, while higher scores imply it is seen more as a collaborative partner
[32]. We also slightly adjusted the accompanying set of exploratory questions in MICSI, which focus
on Contribution, Satisfaction, Surprise, and Novelty in relation to the final outcome; specifically,
we replaced the term “sketch” with “story” to better align with the context of ID.8.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:13
Fig. 5. Results from Study 1: (a) SUS scores. (b) MICSI sub-scale scores. (c) Exploratory question responses.
Fig. 6. Scenes from stories generated by participants using ID.8 in Study 1 and Study 2.
P8: “It was really helpful for quickly iterating on new ideas and exploring potential broad
strokes of the story and for thinking about different ways you could represent a character or
different events that could happen or the scenery or music. Relative to me trying to sketch
those out myself, it was really efficient.”
The MICSI Enjoyment sub-scale scores (𝑀 = 5.85, 𝑆𝐷 = 0.91) and the feedback from the inter-
views show how ID.8 facilitated an enjoyable authoring experience.
P4: “I feel like it’s really cool. It’s a really cool tool. Like I feel like I would use it again, if I had
more time to play around with it because it’s really fun.”
The overall positive user experience suggested by the Enjoyment, Exploration, Expressiveness,
Worth, and Alignment sub-scales (Figure 5(b)) indicates that ID.8 has reasonable realization of our
design objectives of streamlining visual story authoring (DC1), enabling creative exploration (DC3),
and maintaining user control (DC4).
Participant responses to the exploratory question of who (System or I) made the story suggest
that users perceived their interaction with the system to be a balanced creative endeavor in terms
of contribution. Yet, the Partnership score (𝑀 = 4.40, 𝑆𝐷 = 1.65) indicates a need for enhancing
the sense of collaboration; this is further supported by the low Agency score (𝑀 = 3.50, 𝑆𝐷 = 1.90)
indicating that the system was generally perceived not as a collaborative partner but rather as a tool.
Similarly, the scores for Immersiveness (𝑀 = 3.55, 𝑆𝐷 = 1.59) and Communication (𝑀 = 4.40, 𝑆𝐷 =
1.90) lagged behind, signaling potential areas for improvement in the co-creative experience design.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:14 V. N. Antony and C.-M. Huang
Fig. 7. Results from Study 2: (a) SUS scores. (b) MICSI sub-scale scores. (c) MICSI exploratory scores.
These lower scores suggest the need to refine the co-creative process to facilitate a more immersive
user experience and to enhance mechanisms for effectively capturing user intentions.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:15
Fig. 8. Scenes depicting the story titled “The Knight and The Dragon” created by P13 using ID.8.
of initiating Docker containers via command line interfaces impacting usability, or a decrease in
the novelty effect. Given the small sample size, it is difficult to draw robust conclusions about
these outcomes. Further research is necessary to explore the factors influencing these results.
Still, participants reported enjoying using the system in the interviews and were satisfied by the
stories they were able to author using ID.8 as indicated by the exploratory question regarding story
satisfaction (see Figure 7(c)).
P14: “Thank you for letting me use the tool. It was cool and now I have it on my laptop. So I’ll
keep using it.”
P15: “I just think that it was really cool and I would actually like to just use it on my own. I
thought it was like, I feel like now that we did round one, if we did a round two would be way
better, you know, like with what we could create.”
This study resulted in a collection of artistically diverse stories, each about 5 minutes in length
(see Figure 6), spotlighting the creative width of ID.8.6 e.g., P13 created a visually engaging story,
see Figure 8, with captivating audio elements and dramatic narration using ID.8 with a dark,
grunge atmosphere with fantastical theme. The creative diversity of the authored stories and the
Exploration (𝑀 = 5.17, 𝑆𝐷 = 0.98) score, see Figure 7(b), further establish that ID.8 supports creative
exploration reasonably well (DC3).
The Alignment (𝑀 = 4.17, 𝑆𝐷 = 1.94) and Communication (𝑀 = 4.83, 𝑆𝐷 = 0.98) sub-scale scores
indicate a need to enable users to effectively communicate their intent with the co-creative system.
This evaluation supports the findings of Study 1 regarding the need to enhanced sense of
collaboration between the user and ID.8; participants again felt that their contributions were equally
matched by the system (see Figure 7(c)), but the Partnership (𝑀 = 4.50, 𝑆𝐷 = 1.87) and Agency
(𝑀 = 3.50, 𝑆𝐷 = 1.38) scores spotlight the need for improving the collaborative experience toward
ID.8 being perceived as a collaborator than just a tool. The low Immersiveness (𝑀 = 3.67, 𝑆𝐷 = 1.51)
score further underscores the need to facilitate a more immersive co-creative experience.
Users reported spending 7–8 hours exploring the system and producing their stories. It was
consistently reported that creating the first few scenes took most of the time; however, as they got
used to the system, the creation process became faster. The manifestation of the ID.8’s learning
curve and discovery of shortcomings in the co-creative process in Study 2’s less constrained creative
6 link to google drive folder with sample stories from study 2: https://tinyurl.com/y3hjswhc
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:16 V. N. Antony and C.-M. Huang
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:17
approach toward building stronger user mental models of how to communicate intent and creative
vision to generative AI models. While research exists on prompting techniques for text-to-image
and LLMs, there is a significant gap in how to prompt generative models for other modalities (e.g.,
audio, video, speech); therefore, future work should focus on creating modality-specific prompt
templates to fully leverage the co-creative potential of generative AI in multi-modal settings.
Design Guideline: Offer pre-designed prompt templates with defined fields (e.g., [Medium][Sub-
ject] [Artist(s)][Details]) to streamline the input process and facilitate user intent communication
with generative models.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:18 V. N. Antony and C.-M. Huang
5.1.2 Building Intuition on Model Capabilities Using a Prompt-Output Pair Library. Generative
models have very impressive capabilities, but their outputs are greatly dependent on the prompts
that they are provided with. We noticed that the difficulties participants experienced while prompt-
ing the models resulted in misunderstanding what the models are capable of generating. For
instance, participants reported having trouble generating specific animals (e.g., narwals, drag-
ons) and maintaining stylistic consistency across generations; similarly, even though participants
were generally successful in prompting Leela (ChatGPT), some desired more emphasis on char-
acter development, the inclusion of thematic elements such as a “moral of the story” and the use
of simpler language for younger target audiences. Interestingly, ChatGPT is capable of making
these changes when asked further, spotlighting a lack of coherent understanding of the model’s
capabilities.
P12: “For me, it was really hard making consistent images. And I felt like it was really hard
creating like a cohesive image group. like the cave scene literally looked like somebody took a
picture of a cave and the next scene has some anime girlies fighting a thing. It just feels super
shocking to someone who like works with visual stuff. When you want everything to feel like
it’s a one story, you know, not like different parts of some weird thing.”
P13: “You don’t see a lot of dragon in my story because [the visuals generator] just couldn’t
do dragon. Its closest estimation was like a horse.”
In contrast, expert users of generative AI are able to guide generative models to produce outputs
that are consistent in style7 and spatial location8 —and they can even produce dragons.9 Our
observations highlight the gaps in participants’ understanding of what “black-box” generative
models like ChatGPT are capable of achieving, highlighting a need to build better user intuition as
to these models’ capabilities and how to guide the models to materialize a creative vision [68].
Providing a diverse library of generated outputs coupled with the prompts used to generate them,
as suggested by some of our study participants, may allow for enhancing users’ mental models
of how to guide these AI models to generate the desired content [64]. This may be particularly
effective in the case of models that are able to accept inputs beyond just text.
Design Guideline: Provide a library of example outputs alongside the prompts that generated
them to enhance users’ understanding of how to effectively guide generative models.
5.1.3 Leveraging LLMs to Enable Effective Prompt Engineering. Recent studies have demon-
strated LLMs’ abilities in engineering effective prompts for generative models [11, 18, 20, 25,
67, 75]. We observed that while attempting to engineer effective prompts to generate artifacts
that matched their creative vision, some participants turned to Leela (ChatGPT) for assistance in
crafting more detailed prompts. They found that Leela was useful in helping them get closer to
their intended output by providing helpful terms that could guide the diffusion models to pro-
duce more desired outputs, although the participants also mentioned that its prompts were overly
verbose.
P4: “I guess also being familiar with the types of styles or types of animation and drawing,
like the names of certain styles. Like I’m not super familiar with the names of certain styles.
So I wouldn’t know how to describe that besides, like, I don’t know, make it look like my fairy
godparents.”
7 https://twitter.com/chaseleantj/status/1700499968690426006
8 https://twitter.com/Salmaaboukarr/status/1701215610963665067
9 https://twitter.com/art_hiss/status/1701623410848096551
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:19
P11: “It took a while for the generation, I guess, to get what I really wanted until I used Leela.
She described, used more descriptive words, and then copying it from there actually helped
a lot. So, like, I didn’t use that feature at the beginning as much…copying from there, even
though the prompts were really long from Leela, picking up some descriptions was really
helpful.”
Our observations suggest there is an opportunity for further research into how best to design
effective collaborations between LLMs and users for the purpose of jointly crafting prompts. Our
prompt for Leela—instructing it to help provide the user with descriptive prompts—was relatively
simple and yet yielded helpful outcomes; future work should focus on how to facilitate constructive
conversations that ensure the LLM gains a clearer understanding of the user’s creative vision [2,
63]. Moreover, exploring this collaborative interaction could be a fruitful avenue for enhancing
creative exploration and imagination, as LLMs may promote new stylistic or functional directions
to explore via their suggested prompts.
Design Guideline: Leverage LLMs to assist users in crafting prompts that effectively communicate
their creative vision to generative models.
5.1.4 Simplifying Model Parameter Terminology. Although generative models are able to generate
an increasingly impressive and practically infinite range of content, and their capabilities may
continue to span many more domains, a key limitation is the lack of easily understandable controls
for the novice user [36]. Consequently, we observed that none of our study participants utilized
the advanced model options, such as self-attention mechanisms, denoising steps, or random seeds;
using these advanced controls could help achieve better control over the models’ outputs. The
absence of engagement with these advanced controls was attributed to a lack of familiarity with
these technical terms, suggesting a need for more user-friendly terminology to effectively explain
the use and impact of these controllable parameters. To further elaborate, the term “denoising
steps” could be simplified to “boost clarity” along with a tooltip or brief description that clarifies
its purpose: “The ‘boost clarity’ option helps eliminate random noise from the model’s output,
making it cleaner and more focused, but requiring longer generation time.” The key hurdle lies in
simplifying the language without sacrificing an accurate representation of how these parameters
can impact the model’s behavior. Providing semantically equivalent proxies for advanced model
parameters in the form of sliders or gestural inputs may also be a viable avenue to democratize
advanced model control parameters [15, 37].
Design Guideline: Use intuitive and semantically accurate descriptors or proxies for advanced
controls to improve user understanding and increase usage.
In summary, facilitating more intuitive prompt engineering that empowers users to produce
outputs consistent with their creative vision represents a significant and intricate challenge. Ad-
dressing this issue necessitates the integration of various methods and mediums of interaction to
provide comprehensive support to users [5, 14], particularly in multi-modal domains like visual
storytelling.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:20 V. N. Antony and C.-M. Huang
Underscoring the concern about inappropriate and harmful outputs, one participant reported a
specific experience that was problematic in terms of both sexualization and racial stereotyping:
P12: “The main frustration I had was working with a story about ninja princesses, which
generated either generic Caucasian princesses or like super fetishized Asian women who had
no clothing on.”
Bias was not confined to sexual or violent content. A participant reported an instance of racial
bias:
P9 : “I asked it to generate, like, a city after an earthquake and all the cities that are generated
were out of almost like a third world country. Then, and then, when I asked to put people in it,
it was really like all the people were people of color.”
The ethical considerations in this context extend beyond mere content moderation or output
filtering; they also concern user safety and emotional well-being. Exposure to such biased or unsafe
content can be especially distressing when the user belongs to the stereotyped or marginalized
group in question. One participant elaborated on this emotional toll:
P12: “I was frustrated at the AI, I was mostly frustrated at like knowing that what makes the
AI work is like the bank of information on the internet and, like, what is mostly available on
the internet. And because it was producing such, like, fetishized, stereotyped images, I knew
that it was because there’s such a large amount of that, like in the world, on the internet. So
that was making me most frustrated. Because it kind of felt like I was being confronted with it
in like a weird way.”
This reiterates the need for maintaining the human-in-control, AI-in-the-loop design philosophy
of our system; however, it also highlights a shortcoming of our current design and emphasizes
the need to safeguard human creators without hindering their creative autonomy. The emotional
stress cited by participants, particularly when their own identity was implicated, underscores
the urgency of resolving these ethical issues. This experience elucidates not only the importance
of implementing robust filtering mechanisms but also points to a crucial need for more ethical
considerations in the design and training of generative models. The impact of these unsafe and
biased outputs on users’ emotional states also highlights the importance of integrating emotional
safety measures, perhaps through the use of trigger warnings or other alert systems, as part of a
holistic approach to system design.
Design Guideline: Integrate safeguard measures such as trigger warnings and automatic content
filters against harmful generative outputs to ensure the emotional well-being of users.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:21
P5: “If you find a similar kind of image, there should be an option for us to tell the AI…yeah,
this is kind of something similar, I want more image like this coming up.”
Recent human-AI co-creative systems (e.g., Talebrush [15], BrainFax [61]) have demonstrated
how more intuitive controls (e.g., sketch-based inputs) can facilitate more natural human-AI
communication, empowering users to materialize their creative vision more easily [46, 69]. Co-
creative systems should either provide users access to a variety of models (e.g., Uni-ControlNet
[74], SketchyGAN [12]) with a range of input modalities or integrate methods to convert the user’s
multi-modal input to a form accepted by the generative model, thereby enabling more intuitive
communication of the user’s creative intent.
Design Guideline: Accept multiple input modalities to enable intuitive communication of intent.
5.3.2 Support Iterative Collaboration to Improve Human-AI Alignment. A noticeable difference
between participants’ perceptions of Leela (ChatGPT) relative to that of the asset generation models
was due to the lack of ability to iterate on the generated assets.
P15: “So I typed in to Leela that I wanted a story about a penguin who learns how to share
and then she spit out a story to me that I liked, but I wanted it to be a little bit more complex.
And I didn’t like the name she chose. So I asked for a different name and then I asked for her
to add a couple different elements into it and then she spat out a story that I really liked.”
P12: “And I feel like once images were created, if I kept adding to the prompt and regenerating,
it felt as if it was already on one path. So in order to do something new, I had to like, change
the prompt like entirely and shift things up. And because of that, I feel like I had a hard time
creating an image that I was really satisfied with. I think a lot of the times it was like, OK, I
think this is gonna be the best that is gonna come out. So I was like, just whatever, that’s fine.”
These observations suggest that users may prefer co-creative models that are able to iterate
on their outputs based on user feedback until they produce an acceptable output—as opposed to
models that only produce sequentially independent outputs. The lack of such iterative co-creation
may also lead to a less immersive collaboration experience; the suboptimal Immersiveness and
Partnership MICSI subscale scores from both our studies further indicate a need to improve the
co-creative process through an enhanced sense of collaboration. Incorporating the ability to iterate
on outputs sequentially may require further research on architectural changes to generative models
but could significantly improve user satisfaction and creative outcomes. Co-creative systems should
be designed to allow for iterative output based on real-time user feedback; this could involve
mechanisms that let users adjust the parameters of generated artifacts without starting from
scratch or review and amend intermediate outputs before the final artifact is generated.
Moving beyond iterating on singular outputs, co-creative systems that enable a more fluid,
bi-directional creative process could allow for easier iteration of general creative decisions across
scenes, thereby enhancing the sense of collaboration and immersion while facilitating more creative
experimentation, all while keeping a consistent artistic style. This would require implementation of
a more unified AI identity and adaptive creative workflow as we discuss later (see Section 5.4).
Design Guideline: Support iterative co-creation through dynamic feedback loops.
5.3.3 Parallel Processing Leads to a Sense of Active Collaboration. State-of-the-art generative
models, while powerful, are resource-intensive and often require lengthy generation times on
commercial hardware; this poses a significant challenge for designing immersive, effective human-
AI co-creative systems. We observed that participants were particularly frustrated with the long
wait times for asset generation; they expressed a desire for features that would allow them to
queue or minimize the generation process, enabling them to work on other aspects of their projects
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:22 V. N. Antony and C.-M. Huang
in parallel. Participants emphasized that these wait times significantly affected their co-creative
experience, making it feel less like an active collaboration and more like a turn-taking exercise.
This is reflected in the low Partnership MICSI subscale scores from Study 2.
P8: “I wasn’t sure what to do when I was waiting for the content to be generated. It would be
nice if there would be a way to add a generation to a queue. And then you could still be doing
something on the screen while you’re waiting for it to be generated. I think that would help
with feeling like it was more of an active collaboration. I mean, right now it did feel like there
was a feeling of collaboration to it. But it was sort of like we trade off who’s the one working
on it rather than maybe working on something more actively together.”
Design Guideline: Provide parallel processing capabilities to facilitate a more dynamic, active
co-creation experience.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:23
to generate the desired type of output, by showing new ways of envisioning characters, and so on).
The positive MICSI scores for the Exploration subscale from both our studies further highlight the
utility of generative models in encouraging creative exploration in one way or another.
P7 : “[While creating the storyline] at the very first I thought, oh, this magic boy might be a
naughty boy. He might play some tricks, you know, do some bad things. But when I type it,
the chatbot told me perhaps he could do some good things. So, yeah, [it] totally changed my
mind”
P15: “I think like, once you see the script and then you get inspiration of what you envision
and you want it to match exactly that. You have to kind of like play around a bit. But I found
that a few times I did find like, oh, I was like, okay, this is perfect. This is exactly exactly what
I was thinking. And then other times when I couldn’t quite get what I wanted, it kind of just
took me in a new direction. I was like, oh, okay, I didn’t envision it like this, but this is really
cool.”
P14: “[While] creating images, sometimes whatever you describe is not what it spits out. But
at the same time, it could end up leading you to create new ideas or, like, give you more
inspiration that oh, Okay, the character could look like this and then you try to describe it in a
better way.”
P16, who did not think the system helped boost creativity, suggested that perhaps if the system
suggested a variety of styles to choose from during a “sandbox”-based asset creation process prior
to scene authoring, it would help users be more creative. This indicates that to be an effective co-
creative partner and optimally encourage creative exploration, co-creative agents must dynamically
adapt to individual differences in the creative process; for example, they could support a non-linear
workflow of ideation and creation rather than simply following a linear pipeline (e.g., from storyline
crafting to asset creation to visual story construction). However, exactly how AI should support
exploration and shape the creative process is unclear. Balancing between user control and AI input
is a complex human–AI interaction problem that requires further research [40].
Design Guideline: Support individual creative workflows to encourage the exploration of a variety
of creative directions.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:24 V. N. Antony and C.-M. Huang
7 Conclusion
Generative AI has the potential to lower barriers to creative expression and visual story generation.
Our work contributes ID.8, an open-source, end-to-end visual story authoring system that integrates
the state-of-the-art generative models in the creative workflow to lower barriers. Our evaluation
demonstrates the potential of human-AI co-creative systems such as ID.8 and elucidates areas for
improvement along with challenges users face when collaborating with generative AI in creative
work. Our findings inform design guidelines for the future human-AI co-creative systems.
Appendices
A Prompt for Storyline Creator
The GPT-3.5 model that powers the Storyline Creator’s chat module (i.e., Leela) was initialized with
the following system prompt: “Speak as if you are collaboratively creating a story with the user.
Try to iteratively and collaboratively create the story with the user by asking the user questions
that determine story content and progression; feel free to suggest your own thoughts on what
would be good to add”
To generate the screenplay, another GPT-3.5 model is initialized with the following system
prompt: “you are creative, imaginative screen writer”. Then, the co-created storyline is passed to
this model with the following prompt: “for the storyline provided, provide a screenplay in JSON
format as a list of scenes each in the following format: {‘sceneName’: ‘’,‘backgroundDescription’: ‘’,
‘narration’: ‘’,‘characters’:[‘’],‘dialogue’:[{‘speaker’:‘’,‘speech’:‘’}]}—no extra commentary, balance
narration 60% and dialogue: 40%, provide each scene a descriptive name. backgroundDescription
should have a short, simple description of the background setting of the scene. do not use double
quotes: [storyline appended here]”
C Pre-Study Survey
We collected age, gender, educational background/field of work in the pre-study survey. Moreover,
we collected responses to the following questions using a 5-point Likert scale.
(1) How would you rate your overall familiarity with large language models? (e.g., chatGPT,
bard, llama)
(2) How would you rate your overall familiarity with diffusion models? (e.g., DallE, stable
diffusion)
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:25
(3) How familiar are you with creating visual stories or any other form of visual content?
(4) To what extent do you agree with this statement: I am a creative person
E MICSI Questionnaire
Note: we replaced the term “sketch” in original MICSI questions 15 through 18 with the term “story”
to better match our study’s context.
# Name/Subscale Question(s)
1 Enjoyment “I would be happy to use this system or tool on a regular basis.”
2 “I enjoyed using the system or tool.”
3 Exploration “It was easy for me to explore many different ideas, options, designs, or
outcomes, using this system or tool.”
4 “The system or tool was helpful in allowing me to track different ideas,
outcomes, or possibilities”
5 Expressiveness “I was able to be very creative while doing the activity inside this system or
tool.”
6 “The system or tool allowed me to be very expressive.”
7 Immersiveness “My attention was fully tuned to the activity, and I forgot about the system
or tool that I was using.”
8 “I became so absorbed in the activity that I forgot about the system or tool
that I was using.”
9 Worth “I was satisfied with what I got out of the system or tool.”
10 “What I was able to produce was worth the effort I had to exert to produce
it.”
11 Communication “I was able to effectively communicate what I wanted to the system.”
12 Alignment “I was able to steer the system toward output that was aligned with my goals.”
13 Agency “At times, I felt that the system was steering me toward its own goals.”
14 Partnership “At times, it felt like the system and I were collaborating as equals.”
15 Contribution “I made the story” vs “The system made the story.”
16 Satisfaction “I’m very unsatisfied with the story” vs “I’m very satisfied with the story.”
17 Surprise “The story was what I was aiming for” vs “The story outcome was unexpected.”
18 Novelty “The story is very typical” vs “The story is very novel.”
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:26 V. N. Antony and C.-M. Huang
References
[1] Prithviraj Ammanabrolu, Ethan Tien, Wesley Cheung, Zhaochen Luo, William Ma, Lara J. Martin, and Mark O Riedl.
2020. Story realization: Expanding plot events into sentences. In Proceedings of the AAAI Conference on Artificial
Intelligence, Vol. 34. 7375–7382.
[2] Seungho Baek, Hyerin Im, Jiseung Ryu, Juhyeong Park, and Takyeon Lee. 2023. PromptCrafter: Crafting text-to-image
prompt through mixed-initiative dialogue with LLM. arXiv:2307.08985. Retrieved from https://arxiv.org/abs/2307.08985
[3] Aaron Bangor, Philip Kortum, and James Miller. 2009. Determining what individual SUS scores mean: Adding an
adjective rating scale. Journal of Usability Studies 4, 3 (2009), 114–123.
[4] Weizhen Bian, Yijin Song, Nianzhen Gu, Tin Yan Chan, Tsz To Lo, Tsun Sun Li, King Chak Wong, Wei Xue, and
Roberto Alonso Trillo. 2023. MoMusic: A motion-driven human-AI collaborative music composition and performing
system. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 16057–16062.
[5] Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Grossman. 2023. Promptify: Text-to-image
generation through interactive prompt exploration with large language models. In Proceedings of the 36th Annual
ACM Symposium on User Interface Software and Technology. 1–14.
[6] Andrew S. Bradlyn, Ivan L. Beale, and Pamela M. Kato. 2003. Psychoeducational interventions with pediatric cancer
patients: Part I. Patient information and knowledge. Journal of Child and Family Studies 12 (2003), 257–277.
[7] John Brooke. 1996. Sus: A quick and dirty’usability. Usability Evaluation in Industry 189, 3 (1996), 189–194.
[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell. 2020. Language models are few-shot learners. Advances in Neural
Information Processing Systems 33 (2020), 1877–1901.
[9] Daniel Buschek, Lukas Mecke, Florian Lehmann, and Hai Dang. 2021. Nine potential pitfalls when designing human-AI
co-creative systems. arXiv:2104.00358. Retrieved from https://arxiv.org/abs/2104.00358
[10] Tony C. Caputo. 2003. Visual Storytelling: The Art and Technique. Watson-Guptill Publications.
[11] Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, Marianna Apidianaki, and
Smaranda Muresan. 2023. I spy a metaphor: Large language models and diffusion models co-create visual metaphors.
arXiv:2305.14724. Retrieved from https://arxiv.org/abs/2305.14724
[12] Wengling Chen and James Hays. 2018. Sketchygan: Towards diverse and realistic sketch to image synthesis. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9416–9425.
[13] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua
Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke,
Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson,
Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan
Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai,
Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou,
Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas
Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling language modeling with pathways. arXiv:2204.02311).
Retrieved from https://arxiv.org/abs/2204.02311
[14] John Joon Young Chung and Eytan Adar. 2023. PromptPaint: Steering text-to-image generation through paint medium-
like interactions. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology.
1–17.
[15] John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar, and Minsuk Chang. 2022. TaleBrush:
Visual sketching of story generation with pretrained language models. In Proceedings of the Extended Abstracts of the
2022 CHI Conference on Human Factors in Computing Systems (CHI EA ’22). 1–4.
[16] Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. 2023.
Simple and controllable music generation. arXiv:2306.05284. Retrieved from https://arxiv.org/abs/2306.05284
[17] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. 2020. Jukebox: A
generative model for music. arXiv:2005.00341. Retrieved from https://arxiv.org/abs/2005.00341
[18] Wala Elsharif, James She, Preslav Nakov, and Simon Wong. 2023. Enhancing Arabic content generation with prompt
augmentation using integrated GPT and text-to-image models. In Proceedings of the 2023 ACM International Conference
on Interactive Media Experiences. 276–288.
[19] Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma,
Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec,
Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Dario Amodei, Tom Brown, Jared Kaplan, Sam
McCandlish, Chris Olah, and Jack Clark. 2022. Predictability and surprise in large generative models. In Proceedings of
the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1747–1764.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:27
[20] Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. 2023. Text-to-audio generation using
instruction-tuned LLM and latent diffusion model. arXiv:2304.13731. Retrieved from https://arxiv.org/abs/2304.13731
[21] Roberto Gozalo-Brizuela and Eduardo C. Garrido-Merchán. 2023. A survey of generative AI applications.
arXiv:2306.02781. Retrieved from https://arxiv.org/abs/2306.02781
[22] Ariel Han and Zhenyao Cai. 2023. Design implications of generative AI systems for visual storytelling for young
learners. In Proceedings of the 22nd Annual ACM Interaction Design and Children Conference. 470–474.
[23] Daphne Ippolito, Ann Yuan, Andy Coenen, and Sehmon Burnam. 2022. Creative writing with an AI-powered writing
assistant: Perspectives from professional writers. arXiv:2211.05030. Retrieved from https://arxiv.org/abs/2211.05030
[24] Amir Jahanlou and Parmit K. Chilana. 2022. Katika: An end-to-end system for authoring amateur explainer motion
graphics videos. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–14.
[25] Hyeonho Jeong, Gihyun Kwon, and Jong Chul Ye. 2023. Zero-shot generation of coherent storybook from plain text
story using diffusion models. arXiv:2302.03900. Retrieved from https://arxiv.org/abs/2302.03900
[26] Pegah Karimi, Jeba Rezwana, Safat Siddiqui, Mary Lou Maher, and Nasrin Dehbozorgi. 2020. Creative sketching
partner: An analysis of human-AI co-creativity. In Proceedings of the 25th International Conference on Intelligent User
Interfaces. 221–230.
[27] Nam Wook Kim, Nathalie Henry Riche, Benjamin Bach, Guanpeng Xu, Matthew Brehmer, Ken Hinckley, Michel
Pahud, Haijun Xia, Michael J. McGuffin, and Hanspeter Pfister. 2019. Datatoon: Drawing dynamic network comics
with pen+ touch interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems.
1–12.
[28] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer
Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment anything. arXiv:2304.02643.
Retrieved from https://arxiv.org/abs/2304.02643
[29] Tim Knapp. 2023. Situating large language models within the landscape of digitial storytelling. In Proceedings of the
MEi: CogSci Conference, Vol. 17.
[30] Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv
Taigman, and Yossi Adi. 2022. Audiogen: Textually guided audio generation. arXiv:2209.15352. Retrieved from
https://arxiv.org/abs/2209.15352
[31] Tomas Lawton, Kazjon Grace, and Francisco J. Ibarrola. 2023. When is a tool a tool? User perceptions of system
agency in human–AI co-creative drawing. In Proceedings of the 2023 ACM Designing Interactive Systems Conference.
1978–1996.
[32] Tomas Lawton, Francisco J. Ibarrola, Dan Ventura, and Kazjon Grace. 2023. Drawing with reframer: Emergence
and control in co-creative AI. In Proceedings of the 28th International Conference on Intelligent User Interfaces.
264–277.
[33] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov,
and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation,
translation, and comprehension. arXiv:1910.13461. Retrieved from https://arxiv.org/abs/1910.13461
[34] Cun Li, Jun Hu, Bart Hengeveld, and Caroline Hummels. 2019. Story-me: Design of a system to support intergenera-
tional storytelling and preservation for older adults. In Companion Publication of the 2019 on Designing Interactive
Systems Conference 2019 Companion. 245–250.
[35] Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, and Jingdong
Wang. 2023. VideoGen: A reference-guided latent diffusion approach for high definition text-to-video generation.
arXiv:2309.00398. Retrieved from https://arxiv.org/abs/2309.00398
[36] Vivian Liu and Lydia B. Chilton. 2022. Design guidelines for prompt engineering text-to-image generative models. In
Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–23.
[37] Ryan Louie, Andy Coenen, Cheng Zhi Huang, Michael Terry, and Carrie J. Cai. 2020. Novice-AI music co-creation
via AI-steering tools for deep generative models. In Proceedings of the 2020 CHI Conference on Human Factors in
Computing Systems. 1–13.
[38] Kristijan Mirkovski, James E. Gaskin, David M. Hull, and Paul Benjamin Lowry. 2019. Visual storytelling for improving
the comprehension and utility in disseminating information systems research: Evidence from a quasi-experiment.
Information Systems Journal 29, 6 (2019), 1153–1177.
[39] Eric Mörth, Stefan Bruckner, and Noeska N. Smit. 2022. ScrollyVis: Interactive visual authoring of guided dynamic
narratives for scientific scrollytelling. IEEE Transactions on Visualization and Computer Graphics (2022).
[40] Changhoon Oh, Jungwoo Song, Jinhan Choi, Seonghyeon Kim, Sungwoo Lee, and Bongwon Suh. 2018. I lead, you help
but only with enough details: Understanding user experience of co-creation with artificial intelligence. In Proceedings
of the 2018 CHI Conference on Human Factors in Computing Systems. 1–13.
[41] Jonas Oppenlaender. 2022. The creativity of text-to-image generation. In Proceedings of the 25th International Academic
Mindtrek Conference. 192–202.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
20:28 V. N. Antony and C.-M. Huang
[42] Jonas Oppenlaender. 2022. A taxonomy of prompt modifiers for text-to-image generation. arXiv:2204.13988. Retrieved
from https://arxiv.org/abs/2204.13988
[43] Jonas Oppenlaender, Rhema Linder, and Johanna Silvennoinen. 2023. Prompting AI art: An investigation into the
creative skill of prompt engineering. arXiv:2303.13534. Retrieved from https://arxiv.org/abs/2303.13534
[44] Hiroyuki Osone, Jun-Li Lu, and Yoichi Ochiai. 2021. BunCho: AI supported story co-creation via unsupervised
multitask learning to increase writers’ creativity in japanese. In Proceedings of the Extended Abstracts of the 2021 CHI
Conference on Human Factors in Computing Systems. 1–10.
[45] Nikita Pavlichenko and Dmitry Ustalov. 2023. Best prompts for text-to-image models and how to find them. In
Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.
2067–2071.
[46] Han Qiao, Vivian Liu, and Lydia Chilton. 2022. Initial images: Using image prompts to improve subject representation
in multimodal AI generated art. In Proceedings of the 14th Conference on Creativity and Cognition. 15–28.
[47] Chia Yi Quah and Kher Hui Ng. 2022. A systematic literature review on digital storytelling authoring tool in education:
January 2010 to January 2020. International Journal of Human–Computer Interaction 38, 9 (2022), 851–867.
[48] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image
generation with clip latents. arXiv:2204.06125. Retrieved from https://arxiv.org/abs/2204.06125
[49] Jeba Rezwana and Mary Lou Maher. 2022. Identifying ethical issues in AI partners in human-ai co-creation.
arXiv:2204.07644. Retrieved from https://arxiv.org/abs/2204.07644
[50] Jeba Rezwana, Mary Lou Maher, and Nicholas Davis. 2021. Creative PenPal: A virtual embodied conversational AI
agent to improve user engagement and collaborative experience in human-AI co-creative design ideation. In Joint
Proceedings of the ACM IUI 2021 Workshops.
[51] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image
synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 10684–10695.
[52] Carolina Beniamina Rutta, Gianluca Schiavo, and Massimo Zancanaro. 2019. Comic-based digital storytelling for
self-expression: An exploratory case-study with migrants. In Proceedings of the 9th International Conference on
Communities & Technologies-Transforming Communities. 9–13.
[53] Nisha Simon and Christian Muise. 2022. TattleTale: Storytelling with planning and large language models. In Proceed-
ings of the ICAPS Workshop on Scheduling and Planning Applications.
[54] Helena Romano Snyder and Israel Colon. 1988. Foreign language acquisition and audio-visual aids. Foreign Language
Annals 21, 4 (1988), 343–348.
[55] Michelle Scalise Sugiyama. 2001. Narrative theory and function: Why evolution matters. Philosophy and Literature 25,
2 (2001), 233–250.
[56] Sangho Suh, Sydney Lamorea, Edith Law, and Leah Zhang-Kennedy. 2022. PrivacyToon: Concept-driven story-
telling with creativity support for privacy concepts. In Proceedings of the Designing Interactive Systems Conference.
41–57.
[57] Lingyun Sun, Pei Chen, Wei Xiang, Peng Chen, Wei-yue Gao, and Ke-jun Zhang. 2019. SmartPaint: A co-creative
drawing system based on generative adversarial networks. Frontiers of Information Technology & Electronic Engineering
20, 12 (2019), 1644–1656.
[58] Ben Swanson, Kory Mathewson, Ben Pietrzak, Sherol Chen, and Monica Dinalescu. 2021. Story centaur: Large language
model few shot learning as a creative writing tool. In Proceedings of the 16th Conference of the European Chapter of the
Association for Computational Linguistics: System Demonstrations. 244–256.
[59] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen,
Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami,
Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian
Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana
Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie,
Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith,
Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng
Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic,
Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
Retrieved from https://arxiv.org/abs/2307.09288
[60] Kelly L. A. van Bindsbergen, Hinke van der Hoek, Marloes van Gorp, Mike E. U. Ligthart, Koen V. Hindriks, Mark A.
Neerincx, Tanja Alderliesten, Peter A. N. Bosman, Johannes H. M. Merks, Martha A. Grootenhuis, and Raphaële R. L.
van Litsenburg. 2022. Interactive education on sleep hygiene with a social robot at a pediatric oncology outpatient
clinic: Feasibility, experiences, and preliminary effectiveness. Cancers 14, 15 (2022), 3792.
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.
ID.8: Co-Creating Visual Stories with Generative AI 20:29
[61] Mathias Peter Verheijden and Mathias Funk. 2023. Collaborative diffusion: Boosting designerly co-creation with
generative AI. In Proceedings of the Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing
Systems. 1–8.
[62] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or. 2023. Sketch-guided text-to-image diffusion models. In Proceed-
ings of the ACM SIGGRAPH 2023 Conference (SIGGRAPH ’23). 1–11.
[63] Yunlong Wang, Shuyuan Shen, and Brian Y Lim. 2023. RePrompt: Automatic prompt editing to refine AI-generative
art towards precise expressions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems.
1–29.
[64] Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. 2022.
Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. arXiv:2210.14896. Retrieved
from https://arxiv.org/abs/2210.14896
[65] Justin D. Weisz, Michael Muller, Jessica He, and Stephanie Houde. 2023. Toward general design principles for generative
AI applications. arXiv:2301.05578. Retrieved from https://arxiv.org/abs/2301.05578
[66] Qiang Wu, Baixue Zhu, Binbin Yong, Yongqiang Wei, Xuetao Jiang, Rui Zhou, and Qingguo Zhou. 2021. ClothGAN:
Generation of fashionable Dunhuang clothes using generative adversarial networks. Connection Science 33, 2 (2021),
341–358.
[67] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2023. Large
language models as optimizers. arXiv:2309.03409. Retrieved from https://arxiv.org/abs/2309.03409
[68] J. D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang. 2023. Why Johnny can’t prompt:
How non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors
in Computing Systems. 1–21.
[69] Chengzhi Zhang, Weijie Wang, Paul Pangaro, Nikolas Martelaro, and daragh Byrne. 2023a. Generative image AI
using design sketches as input: Opportunities and challenges. In Proceedings of the 15th Conference on Creativity and
Cognition. 254–261.
[70] Chao Zhang, Cheng Yao, Jiayi Wu, Weijia Lin, Lijuan Liu, Ge Yan, and Fangtian Ying. 2022. StoryDrawer: A child–
AI collaborative drawing system to support children’s creative visual storytelling. In Proceedings of the 2022 CHI
Conference on Human Factors in Computing Systems. 1–15.
[71] Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. 2023. Text-to-image diffusion model in
generative AI: A survey. arXiv:2303.07909. Retrieved from https://arxiv.org/abs/2303.07909
[72] Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models.
arXiv:2302.05543.
[73] Yixiao Zhang, Gus Xia, Mark Levy, and Simon Dixon. 2021. COSMIC: A conversational interface for human-AI music
co-creation. In Proceedings of the New Interfaces for Musical Expression (NIME ’21).
[74] Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K. Wong.
2023. Uni-ControlNet: All-in-one control to text-to-image diffusion models. arXiv:2305.16322. Retrieved from
https://arxiv.org/abs/2305.16322
[75] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large
language models are human-level prompt engineers. arXiv:2211.01910. Retrieved from https://arxiv.org/abs/2211.01910
ACM Transactions on Interactive Intelligent Systems, Vol. 14, No. 3, Article 20. Publication date: August 2024.