Design Guidelines For Prompt Engineering
Design Guidelines For Prompt Engineering
Generative Models
Vivian Liu Lydia B. Chilton
Columbia University Columbia University
New York, New York, USA New York, New York, USA
[email protected] [email protected]
Figure 1: An example grid of text-to-image generations generated from the following prompt template: "SUBJECT in the style
of STYLE". We analyze over 5000 generations in a series of fve experiments involving 51 subjects and 51 styles to study what
prompt parameters and hyperparameters can help people produce better outcomes from text-to-image generative models.
ABSTRACT as interaction is double-edged; while users can input anything and
Text-to-image generative models are a new and powerful way to have access to an infnite range of generations, they also must en-
generate visual artwork. However, the open-ended nature of text gage in brute-force trial and error with the text prompt when the
result quality is poor. We conduct a study exploring what prompt
keywords and model hyperparameters can help produce coherent
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed outputs. In particular, we study prompts structured to include sub-
for proft or commercial advantage and that copies bear this notice and the full citation ject and style keywords and investigate success and failure modes of
on the frst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
these prompts. Our evaluation of 5493 generations over the course
republish, to post on servers or to redistribute to lists, requires prior specifc permission of fve experiments spans 51 abstract and concrete subjects as well
and/or a fee. Request permissions from [email protected]. as 51 abstract and fgurative styles. From this evaluation, we present
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
design guidelines that can help people produce better outcomes
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9157-3/22/04. . . $15.00 from text-to-image generative models.
https://doi.org/10.1145/3491102.3501825
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Vivian Liu and Lydia B. Chilton
CCS CONCEPTS diferent orderings, function words, and fller words afects
• Human-centered computing → Empirical studies in HCI; • generation quality.
Computing methodologies → Neural networks; • Applied com- • Experiment 2. We test diferent random initializations to
puting → Media arts. fnd an optimal range of generations to produce for each
prompt, accounting for the probabilistic behavior of text-to-
KEYWORDS image frameworks.
• Experiment 3. We vary and study the number of iterations
design guidelines, AI co-creation, computational creativity, multi- to fnd an optimal range for the length of optimization.
modal generative models, text-to-image, prompt engineering. • Experiment 4. We explore styles as a parameter of the
ACM Reference Format: prompt to understand the breadth of styles the system can
Vivian Liu and Lydia B. Chilton. 2022. Design Guidelines for Prompt Engi- reproduce. Specifcally, we explore 51 styles spanning difer-
neering Text-to-Image Generative Models. In CHI Conference on Human Fac- ent time periods (modern vs. premodern vs. digital), schools
tors in Computing Systems (CHI ’22), April 29-May 5, 2022, New Orleans, LA, of culture (Western vs. non-Western), and levels of abstrac-
USA. ACM, New York, NY, USA, 23 pages. https://doi.org/10.1145/3491102. tion (abstract vs. fgurative). Additionally, we look for biases
3501825 across these diferent partitions of styles.
• Experiment 5. We explore subjects as a parameter of the
1 INTRODUCTION prompt to understand how subject and styles interact with
each other. We tested 51 subjects across 31 styles to explore
Recently, advances in computer vision have introduced methods whether the system is better at producing abstract subjects
that are remarkable at generating images based upon text prompts or concrete subjects given an abstract or a fgurative style.
[37, 38]. For example, OpenAI introduced DALL-E, one such text-
In Experiments 4 and 5, we provide qualitative analysis of ob-
to-image model in 2020, and demonstrated that from running a text
served success and failure modes one might encounter while work-
prompt such as "a radish dancing in a tutu", the model could generate
ing with text-to-image generation. We conclude with design guide-
many images matching the prompt. Based on this progress, artists,
lines to help end users prompt text-to-image models for observed
programmers, and researchers have come together on communities
success modes and steer away from observed failure modes.
within Reddit [39] and Twitter [2] and developed diferent models
that open source text-to-image generation. Tutorials [1, 13] and
2 RELATED WORK
interactive notebooks [46] maintained by community members
such as @RiverHaveWings, @advadnoun, and @somewheresy on 2.1 Generative Methods as Creativity Support
Twitter have made these tools broadly accessible [3, 12, 24]. Tools
Text is free-form and open-ended, so the possibilities for image Artist and programmer communities have consistently shown inter-
generation from text prompts are endless. However, this also means est in the potential of generative AI as an art medium. Communities
that the design process for generating an image can easily become conversing about artistic AI have developed for a long time around
brute-force trial and error. People must search for a new text prompt networks such as DeepDream [32], neural style transfer networks
each time they want to iterate upon their generation, a process [16], and generative adversarial networks [6, 21, 26]. Likewise, HCI
that can feel random and unprincipled. In the feld of natural lan- researchers have sought to understand how generative AI can be-
guage processing, this problem is known as prompt engineering [40]. come a creativity support tool for artists. In recent years, systems
Prompt engineering is the formal search for prompts that retrieve embedding generative models have been successfully applied to do-
desired outcomes from language models, where what is desirable is mains such as image generation, poetry, and music [19, 20, 22, 23].
dependent upon the end task and end user [42]. There are a number Often, researchers try to leverage the large space of design solu-
of open questions within prompt engineering to explore for text- tions that generative methods provide to assist users during ideation
to-image models. Some questions relate to hyperparameters: how and iteration [30]. However to do so, researchers have to under-
do variables infuencing the length of optimization and random stand how users can explore these design solution spaces efciently
initializations afect model outcomes? Other questions involve the and efectively. This is an open question that has been investigated
prompt: are there certain classes of words or sentence phrasings through a number of research approaches. For example, Matejka
that yield better outcomes? These questions are necessary for the et al. [31] introduced Dreams Lens, a system implementing design
HCI community to answer so technical advancements in machine galleries and interactive data visualization to visualize diverse de-
learning such as prompt engineering and multimodal models can sign solutions from a 3D modeling generative program. Yang et al.
be translated into usable interaction paradigms. [29] proposed latent space cartography, which used dimensionality
To explore the generative possibilities of this system, we sys- reduction to explore the latent design space of generative AI mod-
tematically approach prompt engineering for a family of prompts els. Shimizu et al. [43] proposed Design Adjectives, a system that
that have found traction within practitioners working with text-to helped users parameterize the design space by frst giving examples
image systems: "SUBJECT in the style of STYLE" prompts. In this of what attributes they liked and disliked for the design of fonts,
paper, we address key questions around prompt engineering in a materials, and motion graphics .
series of fve experiments: One drawback of many of these AI-based approaches is that
• Experiment 1. We test diferent phrasings of the prompt while they can create an inexhaustible number of generations, they
to see how modulating the language of the prompt with lack meaningful and interpretable controls for users. This problem
Design Guidelines for Text-to-Image Prompting CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
has given rise to an area of research at the intersection of HCI This paradigm was formalized by Liu et al. [28], who referred
and AI focused on semantically meaningful exploration. One of the to this emerging paradigm as "pretrain, predict, and prompt". They
earliest works in this direction was a seminal creativity support further enumerated a schema for prompt templates categorizing
system called AttriBit, which allowed users to assemble 3D models prompts based on prompt shape (cloze and prefx prompts), an-
given data-driven suggestions. These suggestions supported seman- swer engineering (answered and flled prompts), and task-specifc
tic goals users crafted for their creations (i.e. creating a łcutež or prompts (i.e. prompt templates tailored for tasks like summariza-
łdangerousž 3D animal) [10]. More recent work leveraging deep tion or translation). Additionally, they expanded on alternative
learning has also tried to produce semantically meaningful editing approaches to prompt engineering such as automated template
operations. For example, Louie et al. [30] introduced CoCoCo, an learning and multi-prompt engineering.
AI music creation system which lets users move sliders to tune While momentum has started to build in prompt engineering
their generated music to be "happier" or "sadder". Geppeto was for text generation purposes, less work has been done to rigorously
another analogous mixed-initiative, co-creative system that gen- examine how users can prompt generative frameworks with natural
erated robot animation according to mood-related semantic goals language for visual generation purposes, which is the focus in this
[14, 30]. Systems that can directly support user goals in a semanti- paper. To our knowledge, one of the few works close to ours is
cally meaningful way are both more interpretable and more usable. by Ge and Parikh et al. [17], who utilize BigSleep (BigGAN+CLIP)
and DeepDaze (SIREN+CLIP) for visual conceptual blending. Their
approach used BERT [15] to generate prompts and help users make
2.2 Text-to-Image Generation visual blends, using shape keywords to prime the generation.
This interest in involving semantics and natural language as a form So far, progress on prompt engineering for visual tasks and
of interaction with generative models has recently found success end user usability has been made informally and in an ad hoc
in machine learning. Recent work within representation learning fashion. Creative technologists have discussed tricks and keywords
has focused on learning text and image understanding together by that help tune models towards their aesthetic goals. For example,
coupling the two modalities through a contrastive objective during Aran Komatsuzaki, a prominent artist and research programmer
optimization. In 2021, Radford et al. [37] from OpenAI introduced noted that using ’unreal engine’ as a prompt helped them add a
CLIP, a method for learning multimodal image representations. hyperrealistic, 3D render quality to their image generation [27].
CLIP was trained on an Internet scale size dataset of 400 million This tweet and many others along the same vein established a
image and text pairs to learn a multimodal embedding space that growing trend within the artistic community to structure prompts
incorporated both text and image understanding. CLIP demon- with the template "X in the style of Y" , where Y would be an artist
strated that the model was able to learn "visual concepts...enabling or art movement that CLIP would ideally have knowledge of. In
zero-shot transfer of the model" on various tasks such as OCR, geo- the experiments in this paper, we evaluate this family of prompts
localization, and others. CLIP was used in DALL-E [38], one of the to systematically conduct prompt engineering.
state of the art models for text-to-image generation. DALL-E learned
a transformer that autoregressively predicted text and image tokens
together in one sequence. The authors of DALL-E demonstrated 2.4 Probing through Prompt Engineering
how the model could handle image operations, perform style trans- Literature has shown that evaluating a constrained set of keywords
fer, and produce novel combinations of elements. An outcropping and prompts can help better explain and interpret learned models.
of text-to-image architectures achieving similar functions followed: For example, infuential work by Caliskan et al. [8] used sets of
DMGAN [49], VQGAN+CLIP [35],BigSleep (BIGGAN+CLIP) [34], words to quantifably demonstrate bias within word embeddings.
DeepDaze (SIREN+CLIP) [33], CLIP-guided difusion [11]. Many Specifcally, they studied the GloVE word embedding and showed
models were open sourced and advanced within the creative tech- that small sets of gendered words signifcantly correlated with
nologist community. attribute words, identifying associations such as female-gendered
words with family-oriented words and male-gendered words with
career-oriented words. These experiments helped formulate a global
2.3 Prompt Engineering understanding of a computational model and the biases embedded
Researchers and practitioners alike now tackle the open problem within them.
of prompt engineering for large pretrained models. Most work in A signifcant amount of work has also gone into probing and
prompt engineering has concentrated within the text generation interpreting what large pretrained models learn and utilize at in-
problem from natural language processing. The term prompt engi- ference time. Work on BERT such as łA Primer on BERTologyž
neering originally came from a popular post online about GPT3 (a [41] and łBERT Relearns the Classical NLP Pipelinež [44] have
large language model) and its capabilities for writing creative fction. probed what BERT learns across its layers and what world knowl-
The author, Gwern Branwen [5], suggested that prompt engineer- edge it holds within. For example, [41] states that BERT łstruggles
ing models could become a new paradigm for interaction; users with abstract attributes of objects as well as visual and perceptual
need only fgure out how to prompt a model to elicit the specifc properties are assumed rather than mentioned.ž
knowledge and abstractions necessary for completing downstream It is important to apply this direction of research to multimodal
tasks. Follow-up work from practitioners have disseminated prompt models such as CLIP (which is a key component within VQGAN+CLIP
engineering methods and tricks such as prefx-tuning and using and multiple other text-to-image generation frameworks) and to
few-shot examples [9]. understand what CLIP holds within its knowledge distribution.
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Vivian Liu and Lydia B. Chilton
Understanding the local behavior and global knowledge distribu- same keywords afect the image generation. For each of these com-
tions of AI models can help users develop better mental models of binations, we tested 9 permutations derived from the the CLIP code
them as agents [18]. Using prompts to generate image evidence of repository and discussion within the online community, generating
AI knowledge is also a way of reducing uncertainty with AI [48]. 1296 images in total. The nine permutations are as follows, and the
Prompt engineering thus is both a human-computer interaction specifc rationale for each permutation is listed in the Appendix:
paradigm to support as well as a valuable method of probing deep • A MEDIUM of SUBJECT in the STYLE style Ð Example:
models. a painting of love in the abstract style
• A STYLE MEDIUM of a SUBJECT Ð Example: an abstract
3 EXPERIMENT 1. PROMPT PERMUTATIONS painting of love
In language, there are many ways to say the same thing in diferent • SUBJECT STYLE Ð Example: love abstract art
words. We wanted to understand the efect this in the context of • SUBJECT. STYLE Ð Example: love.abstract art
text-to-image generation. Would users need to try many diferent • SUBJECT in the style of STYLE Ð Example: love in the
permutations of the same prompt to get a sense of what a prompt style of abstract art
would return, or would just one sufce? Additionally, would there • SUBJECT in the STYLE style Ð Example: love painted in
be certain permutations of the prompt keywords that would lead to the abstract art style
better generations and be the best way to word a prompt? For exam- • SUBJECT VERB in the STYLE style Ð Example: love
ple, would prompting the model with "a woman in a Futurist style" painted in the abstract style
lead to a signifcantly diferent generation than "a woman painted in • SUBJECT made/done/verb in the STYLE art style Ð Ex-
a Futurist style", "woman with a Futurist style", or "a woman. Futur- ample: love done in the abstract art style
ism style"? In this experiment, we wanted to rigorously examine the • SUBJECT with a STYLE style. Ð Example: love with an
following question: do diferent rephrasings of a prompt using abstract art style
the same keywords yield signifcantly diferent generations?
Our original hypothesis about this question was that there would 3.2 Annotation Methodology
be no prompt permutation that would do signifcantly better or We had each combination of subject and style rated by two people
worse than the rest, because none of the rephrasings seemed to who had backgrounds in media arts and art practice respectively.
have signifcantly more meaning than the next. The 144 subject and style combinations were presented in 3x3 grids,
where the prompt permutations were randomly arranged to prevent
3.1 Methodology any efect from ordering. One combination was taken out owing to
To study diferent permutations of prompts, we frst had to generate inappropriate content.
a large number of images. To do this, we used the checkpoint and Annotators were asked to note which images in the grid were
confguration of VQGAN+CLIP pretrained on Imagenet with the either signifcantly better generations or signifcantly worse gener-
16384 codebook size [35]. Each image was generated to be 256x256 ations. We explained that they did not have to judge whether or not
pixels and iterated on for 300 steps on a local NVIDIA GeForce RTX the generation represented the subject or the style; they just had
3080 GPU. to report whether there were generations that were signifcantly
Each image was generated according to a prompt involving diferent from the rest. For example, if there was a diferent element
a subject and style. We chose the following subjects: love, hate, that emerged in one generation or a shift in color palette compared
happiness, sadness, man, woman, tree, river, dog, cat, ocean, and forest. to the restÐthese diferences constituted outliers. All annotators
These subjects were chosen for their universality across media and were compensated $20/hour for however long it took them to com-
across cultures. These subjects additionally were balanced for how plete the task. This rate of compensation was the same for the rest
abstract or concrete they were as a concept as well as for positive of our experiments.
and negative sentiment. We decided on whether a subject fell into
the abstract or concrete category based upon ratings taken from 3.3 Results
a dataset of concreteness values [7]. Our set of abstract subjects From the annotations we collected, we binned the generations
averaged 2.12 on a scale from one to fve (one being most abstract), based upon whether they were annotated as the same as the rest
and our set of concrete subjects averaged 4.80. of the group or marked as an outlier. Outliers were generations
Similarly, we chose 12 styles spanning diferent time periods, that were either "signifcantly better" or "signifcantly worse". After
cultural traditions, and aesthetics: Cubist, Islamic geometric art, aggregating across these two categories, we checked agreement
Surrealism, action painting, Ukiyo-e, ancient Egyptian art, High Re- between our two annotators. We observed high agreement, at 71.3%
naissance, Impressionism, cyberpunk, unreal engine, Disney, VSCO. across 1296 generations. We then calculated interrater reliability,
These styles likewise varied in whether they represented the world where we observed a Cohen’s kappa of 0.0013. This value is low,
in an abstract or fgurative manner. Specifcally, we chose four ab- but we believe this comes from the subjective nature of the task. We
stract styles, four fgurative styles, and four aesthetics related to the can see this in the example grids of Figure 2, which are composed of
digital age. We balanced for time periods (with 6 styles predating slightly varying generations. While we provided examples of what
the 20th century, and 6 styles from the 20th and 21st century). might constitute a signifcantly diferent generation and modeled
We used these 12x12 subject and style combinations to study the task for annotators as best we could, picking outliers is still
the efect of prompt permutations: how diferent rephrasings of the inherently subjective and this subjectivity could have infuenced the
Design Guidelines for Text-to-Image Prompting CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Figure 2: For Experiment 1, annotators judged 3x3 grids where generations from diferent prompt permutations were arranged
randomly. Annotators evaluated 143 grids of generations for signifcantly better generations as well signifcantly worse
generations (outliers in generation quality). We found no signifcant diference between the quality of the images that these
nine prompt permutations generated, and therefore no signifcant diference between diferent prompt permutations.
factor calculated in Cohen’s kappa that models chance. Therefore, Our hypothesis was that no seed would do signifcantly better or
even though our Cohen’s kappa value was low, we proceed based worse than the rest, because changing seeds and altering the random
on the high agreement value of 71.3% across 1296 generations. initialization of the model should not produce any consistent or
We assembled a contingency table based upon the following cat- signifcant signal.
egories of annotation possibilities, same-same, same-outlier, outlier-
same, and outlier-outlier. We performed a Chi-square test based 4.1 Generation Methodology
upon this contingency table. We found that with a Chi-squared To study the efect of seeds, we generated 1296 images from 12
test statistic of 0.354 and a p-value of 0.55, the number of prompt subjects, 12 styles, and 9 seeds. Because neither subject nor style
permutations judged as outliers was insignifcant when compared were the main focus of this experiment, we chose to use the same
to the number of prompt permutations deemed not outliers. set of subjects justifed in the previous section. Likewise, we chose
Hence, we concluded that there was no signifcant diference to use the same set of styles. What we did vary was the seed chosen.
between the nine prompt permutations that we tried. We synthe- We generated images using 10 randomly generated seeds 796, 324,
sized the following guideline from this experiment: When picking 697, 11, 184, 982, 724, 962, and 805 and the prompt łSUBJECT in the
the prompt, focus on subject and style keywords instead of style of STYLEž.
connecting words. The connecting words (i.e. function words,
punctuation, and words for ordering) did not contribute statisti-
4.2 Annotation Methodology
cally meaningful diferences in generation quality. Hence, we moved
forward in the following experiments testing only one prompt per- Two annotators were shown 1296 generations in 3x3 grids where
mutation per subject and style combination rather than multiple the seeds were arranged randomly. We had the nine generations
rephrasings for the same combination. varied by seed for each combination of subject and style rated by
two people. Annotators were again asked to note which images
in the grid were signifcantly better generations and signifcantly
4 EXPERIMENT 2. RANDOM SEEDS worse generations from the rest of the group, if any.
A common parameter in generative models is seeds. Generative
models are stochastic and highly dependent upon their initializa- 4.3 Results
tions, which means that it is often hard to reproduce results. To We again used a Fisher’s exact test to evaluate how many gener-
mitigate this, people often use seeds to have reproducible results and ations were judged as approximately the same versus how many
behavior. We noticed that using diferent seeds with VQGAN+CLIP were judged outliers. We found that with a p-value of <0.01, the
resulted in generations that would difer in composition. We wanted number of generations judged as outliers was signifcant when
to understand: do diferent seeds using the same prompt yield compared to the number of generations deemed not outliers. Our
signifcantly diferent generations? The motivation behind this annotators shared an inter-rater reliability of 0.13, which indicates
question was to understand whether or not users would need to try slight agreement, which we again justify as valid given the highly
multiple seeds before moving onto new combinations of keywords. subjective nature of the task (picking ’better’ or ’worse’ images).
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Vivian Liu and Lydia B. Chilton
Figure 3: For Experiment 2, annotators judged 3x3 grids such as the ones above where generations utilizing diferent seeds
were arranged in random order. These 143 grids were judged for signifcantly better generations as well as signifcantly worse
generations. We found that the number of generations judged as outliers in generation quality was signifcant, meaning that
the choice of initializing seed can signifcantly vary the quality of the generation.
This result was surprising to us, because it demonstrated that even 5.2 Annotation Methodology
outside of the prompt, there are stochastic components of the gen- We had annotators annotate rows of generations saved at diferent
eration that can signifcantly vary the quality of the generation. steps of the iteration. These were specifcally steps that were multi-
We conclude from this experiment that it is prudent to try mul- ples of 100 up to 1000. The 0th iteration was not shown because the
tiple seeds during prompt engineering. A design guideline that generation always began from random noise. Annotators annotated
follows is to generate between 3 to 9 diferent seeds to get a 72 rows for which generation they most preferred from the set of
representative idea of what a prompt can return.. 10.
5 EXPERIMENT 3. LENGTH OF
5.3 Results
OPTIMIZATION
We found that the diferences between the chosen iteration steps
A free parameter during each run of text-to-image models is the
were signifcant upon performing a chi-squared test (p-value=0.01).
length of optimization: the number of iterations the networks are
We include the observed frequencies for the preferred iteration
run for. Typically, we can expect that the more iterations, the lower
steps in Figure 5, where we can see that 200, 100, and 500 iter-
and more stable the loss, and ideally the better the image. We wanted
ations being chosen as the most preferred. We report a Cohen’s
to investigate on average how many iterations are needed to get a
kappa score of 0.33, which represents fair agreement, which we
decent result. We also wanted to see if runs with lower iterations
think is valid considering the highly subjective nature of picking
could produce images with just as good generation quality as runs
a preferred iteration step (but more conducive to agreement than
with higher iterations; a lower number of iterations means faster
picking intuitive favorites in Experiment 2, which evaluated seeds).
results, and for future systems involving text-to-image generation,
This demonstrates that a higher number of iterations did not
we would want to know an average number of iterations needed to
necessarily correlate with a more desirable generation, as one might
arrive at a favorable result. Our specifc research question was: does
have been expected considering that more iterations optimize the
the length of optimization correlate with better evaluated
image and text representations towards one another. This is a non-
generations?
intuitive result meaning that the current multimodal methods are
not necessarily optimizing for generations that we prefer. Possible
5.1 Generation Methodology explanations could include the fact that lower iterations tended
To investigate this, we tested 6 subjects (happiness, sadness, man, to have a softer quality compared to higher iterations, where the
cat, ocean, and forest) across 12 styles, with a constant seed and diferences and contrast seemed to be more exaggerated in higher
one variety of prompt permutation. We ran the generations for iterations.
1000 iterations, and had users evaluate the generations every 100 We conclude from this experiment when generating with fast
iterations. We chose 1000 iterations as the maximum because we iteration in mind, using shorter lengths of optimization be-
wanted to try a number of short to moderate wait times and 1000 tween 100 and 500 iteration is sufcient. However, as can be
is a suggested default. seen in Figure 4, at the lower end of this range (ie. 100 iterations),
Design Guidelines for Text-to-Image Prompting CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Figure 4: For Experiment 3, annotators were shown rows such as the row above. These images represented diferent iteration
steps of the optimization process. Annotators chose the iteration step that they most preferred from these sets of 10.
Medium West Fig. Pre. West Fig. Mod. Non-West Abs. Mod. Non-West Abs. Mod. West Abs. Mod. Internet | Aesthetics
painting Baroque Pop Art Ukiyo-e Mola art action painting fractal
photo High Renaissance Surrealism Chinese ink wash painting Geometric Islamic art Op art VSCO
sketch Impressionism documentary photography Kerala mural Mexican Otomi Bauhaus unreal engine
cartoon Medieval Art deco Mayan art Andean textile Cubism ASCII art
icon Pointillism Hippie movement African masks Aboriginal art Dadaism Disney
vector Neoclassicism photorealism ancient Egyptian art Futurism Studio Ghibli
grafti thangka glitch
3D render Cottagecore
Dark academia
Cyberpunk
Pixar
Pokemon
Table 1: In Experiment 4, we generated generations from 51 styles listed in the table above. Eight styles were general mediums
of visual art. Twelve styles were Internet aesthetics. The remaining 33 were styles that were balanced for representation on both
sides of the following partitions: abstract-fgurative, Western and non-Western, and lastly premodern, modern, and digital.
• 5: Excellent representation of the style, very high number of Across generations of the same style, the model showed the
motifs were present ability to use correct and consistent choices of lines. For styles
such as sketch, the model produced thin lines suggestive of pencil,
Each annotator was instructed to judge how well the style was
while for styles such as Disney or Pokemon, the model consistently
represented, irrespective of how well the subject turned out in the
produced thick black outlines characteristic of cartoons. These
generation.
lines hardly appear at all in styles such as Impressionism, which
were composed instead of a patchwork of small and short strokes
6.3 Results reminiscent of the style’s broken-color technique.
Figure 6 shows average ratings for all 51 styles and illustrated that In Impressionism as well as other styles like Pointillism and Cu-
the model performed better on some styles than others. We frst bism, the model showed its ability to fnd the right style primitives.
elaborate on the success modes and failure modes across all the Pointillism was composed of dots and points and Cubism of decon-
styles as a whole before approaching the partition experiments in structed shapes. In certain styles, however, such as aboriginal art
depth. Refer to Figure 7 for the visual depictions of the success which often uses dots as their primitives, the model was only able
modes we observed. to generate textured patterns suggestive of the dot primitives.
6.4.3 Success Mode. Depicting Space. The model was generally
6.4 Success and Failure Modes for Styles able to capture the right perspective, which we refer to as whether
6.4.1 Success Mode. Salient color paletes and relevant textures. the image was done in two dimensions or three and how light
The frst recurring theme across successful styles was the pres- and shadow were represented in the image. For example, in the
ence of salient color schemes. This was apparent in some of the Medieval style, light and shadow tended to come across fatly, while
most positively judged styles such as Ukiyo-e, glitch art, cyberpunk, on the other extreme in styles such as unreal engine, the 3D scene
and thangka. These generations, pictured in Figure 7, demonstrate lighting was very apparent in the pronounced light glares and
that text-to-image models can match styles to some of their sig- raycast quality of refections of elements like metal or hair.
nature color palettes without explicitly involving color details in In looking at how the generations depicted space we also as-
the prompt. Cyberpunk consistently returned a global aesthetic sessed composition. We found that styles where patterns and decon-
dominated by halogen colors like cyan and magenta, and Glitch art structed gestalts were common were rated favorably. For example,
always pulled together colors reminiscent of TV static. Likewise, the alternating patterns of swirling and swelling black and white
in generations such as "tree in a thangka style" or "man in the pop strips canonical to Op art were present in all Op art generations.
art style", we see diferent but correct understandings of the way Other examples of this success mode include the multiplicity of
primary colors can be saturated, contrasted, and complemented. shapes in Cubism, recursive details in fractals, and concentric varie-
Texture was another element that came across in many styles. gation in aboriginal art. Figurative styles with a high tolerance for
The most successful style seen from the annotations was ink wash deconstructed objects such as Surrealism also performed well in the
painting. All generations of ink wash painting were done in wide ranked annotation. This success mode could have potentially been
swathes of ink that captured the watercolor quality of ink on pa- infuenced by the convolutional components of VQGAN+CLIP’s
per. In many premodern styles such as Ukiyo-e, ancient Egyptian, architecture. Convolutional representations are inherently focused
Medieval, the textures of aged paper and papyrus backgrounding on local neighborhoods, and this could have been latently optimized
the image as hints of canvas also helped express the style. for in the patterns and repetition we observed.
6.4.2 Success Mode. Technique. Another theme across successful 6.4.4 Success Mode. Motifs of the style. Certain styles placed mo-
styles was the emulation of correct technique. Many generations tifs, or distinctive details, within the generation that could imme-
exhibited choices of line, texture, and elementary brush strokes con- diately evoke the style. This was especially true in styles such as
gruent with their style. the Baroque style, for which the model constantly incorporated
Design Guidelines for Text-to-Image Prompting CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Figure 6: For Experiment 4, 51 styles were tested across 12 subjects. This plot aggregated the ratings across all subjects in a style
and ranked the styles from low to high for mean subjective rating.
lavish details such as ornate swirls and heavy moldings, or the misinterpreted as mola, a species of fsh. All generations of mola had
Neoclassicism style, for which the model generated Grecian pillars a predominantly blue color palette that evoked something aquatic,
and drapery within the shapes and contours of the image. It is and many of the subjects were blended to look like fsh ("see man
interesting to note that while these motifs are relevant to the styles, in the mola style" in Figure 8). A potential cause for understanding
they borrow from diferent facets of the meanings for Baroque and mola as a fsh species could be attributed to the bias towards animal
Neoclassicism that do not necessarily represent the visual arts ver- species from ImageNet1000, a signifcant subset of which were
sion of the style that we had intended. We explore the ways the animal species. However, the mola that were represented also were
model misunderstood styles in the next section on failure modes. not photorealistic but rendered in a stylized form. These examples
represent how conficting interpretations of a prompt can lead to
misinterpretation within the generated image. Misunderstandings
6.5 Failure Modes could also arise from diferent parsings of the prompt. For example,
6.5.1 Failure mode: Style misunderstandings due to the multiplicity take the action painting art style, which is meant to refer to artists
of meanings in text. As mentioned in the previous section, the model who painted dynamically using random drips and splatters. When
interpreted Baroque and Neoclassical styles through the lens of the model generated for "a man" or "woman" in the action painting
decor, architecture, and sculpture, generating motifs from Baroque style, it created a generation that implied a man in action or a woman
furniture and Neoclassical architecture and sculpture as opposed in action.
to Baroque and Neoclassical painting. The many diferent instantiations of this failure mode suggests
Many of the styles that performed the most poorly from the anno- that the multiplicity of meanings within language both at the word
tations were misunderstood by VQGAN+CLIP in some dimension. level and the sentence level can present as a problem for text-to-
For example, dark academia, a social media aesthetic captured by image generation. It also represents a fundamental shift in the
a romanticized, Gothic approach to esoteric motifs often returned thought process behind the creation of a visual artwork. Visual
generations that contained components of a cartoon character un- artwork usually involves thinking about the spatial specifcs of the
der dramatic lighting. One possible explanation is that the model composition, which text-to-image generation does not lend well to.
was infuenced by another popular entity on the Internet that also
involved the word ’academia’śthe anime My Hero Academia (the 6.5.2 Failure mode. Inability to capture styles in a complete sense.
characters emerging in many of those generations shared a distinct Another theme within unsuccessful styles was the inability to ex-
green hair color). press styles that were more symbolic than visual. For example,
Another case was the style of mola, a Latin American folk art Dadaism, a style representing the rejection of capitalism and em-
form from Latin America with a vividly saturated color palette and brace of the avant-garde and nonsense, was rated very poorly with a
heavily stylized characterization of subjects. Mola the art style was mean subjective rating of 1.42. Dadaism traditionally was expressed
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Vivian Liu and Lydia B. Chilton
Figure 7: In Experiment 4, we qualify diferent modes of success seen across the 51 styles we generated for. These modes
were color, technique, relationships in space, and motifs. Generations were able to express color well in terms of stereotypical
color palettes, textures, and contrasts. The basics of technique were also captured in the variety of lines, elementary strokes,
and shapes expressed. Generations also demonstrated profciency in setting lighting and forming patterns, establishing good
relationships in space. Motifs from styles were also readily accessible; for example, we can see ornate swirls in Baroque
generations and the dimensional features relevant to Mayan relief sculptures in the last row as motifs of their styles.
through satire and collaging, and it tended to involve cultural knowl- opposed to image. These styles and their poor performance in the
edge and nuanced symbolism relating to pop culture and politics. annotation study illustrate there are still abstractions and pools of
Likewise, Bauhaus was another abstract style that was heavily infu- cultural knowledge that are either not well understood or visually
enced by abstract values like harmony and utility rather than visual representable within text-to-image-models, potentially because of
abstraction. While Bauhaus has a characteristic visual style rooted their diferent angle of abstractness. (An example of "happiness in
in geometric shapes, much of that is illustrated in architecture as the Dadaism style" is shown in row 4, column 1 of Figure 8.)
Design Guidelines for Text-to-Image Prompting CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Figure 8: For Experiment 4, we illustrate the failure modes we observed across styles: misunderstandings owing to the multiple
interpretations of the text prompt, an inability to correctly capture the style, style incongruencies, and defaulting to certain
motifs. Style incongruencies occurred when text or elements otherwise incongruent with the style would emerge in the middle
of the generation. The area shaded in green shows reference images of styles that give context for why certain generations
were failures.
Other styles were simply insufciently captured. For example, come through the generation. These elements could be bucketed
if we refer to any of the images of Kerala mural style generations into two cases: photorealistic elements or text elements.
in Figure 8, we see that they never reached the vividly saturated For example, in Figure 8 we can see a cat done in the style of
and stylized look of actual Kerala mural frescos. However, they Cubism. The cat’s fur is entirely photorealistic, which is out of
approached it, evoking color combinations and motifs such as tra- place in an otherwise abstract image. Likewise, in images of river,
ditional dress that established an association of Kerala murals with a photorealistic texture of a river surface would often surface even
India. if the style was intended to be a sketch. This phenomena tended to
occur for concrete subjects such as dogs, cats, rivers, and oceans.
The second case was when text began to emerge within images
6.5.3 Failure mode. Style incongruency, ofen in the form of emerging across iterations. For example in the generation "happiness in the
photos or text. Sometimes elements that interrupted the style would Dadaism style", we see "DADA" explicitly written out across the
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Vivian Liu and Lydia B. Chilton
generation, potentially as a compensatory technique on the model’s fndings suggest that the diference between Western and Non-
part to optimize towards the prompt. Curiously, for ASCII art, text Western styles was actually insignifcant. One straightforward rea-
never manifested, and each generation was composed of similarly son is that the Internet scale data could have compensated for the
sized blurs of alphanumerical literals. relative obscurity of any style. The alternating and even spread of
the Western and non-Western styles over the x-axis of the bar graph
illustrating individual styles by ranking is also visually suggestive
6.5.4 Failure mode: Defaulting to motifs. Another failure mode
of the same result.
expressed within styles such as unreal engine and High Renaissance
was a defaulting to certain motifs. For example, for the style unreal 6.6.3 Time period: premodern, modern, versus digital. We investi-
engine, most of the text prompts returned a scene with textured gated whether the model would perform better on digital styles
rocks and grass akin to what might be rendered by a game engine. (Internet aesthetics), modern, or premodern art styles. We parti-
tioned the styles into these time periods and colored these ranked
styles by category in Figure 9.
6.6 Results and Discussion of Partitions We found that digital styles performed the worst, then modern
6.6.1 Abstract versus figurative. To investigate whether the model styles, and then premodern styles with aggregate annotator ratings
performed better on abstract styles or fgurative ones, we looked at of 2.41, 2.83, and 3.11 respectively. Using a Kruskal Wallace test, we
a subset of 33 specifc styles, excluding styles such as more general found these diferences to be highly signifcant p-value < 0.001. One
mediums (i.e. a painting, a photo) and Internet aesthetics (i.e. dark potential reason for why digital styles performed the worst could
academia). We looked at a subset because certain styles generally be that the digital styles we covered had more inherent stylistic
did not fall cleanly between abstract and fgurative styles. range. Some digital styles such as Tumblr, could be represented
We found that abstract styles averaged a 2.63 rating (standard by multiple photo flter palettes, while others such as cottagecore
error 0.06) , while fgurative styles averaged a 3.16 rating (standard and dark academia could be represented through diferent aesthetic
error 0.06) . After running a chi squared test on the frequencies of forms (i.e. an outft in fashion, a piece of furniture). Still other styles
ratings we found the diference between these ratings was signif- like Disney encompassed a range of visual styles within itself, even
cant to a p value of < 0.01. In Figure 10, we visualize the top 4 styles though the generations came across colors and lines reminiscent of
and worst 3 styles for abstract and fgurative styles. In Figure 9, we the Disney Renaissance.
color code the ranked styles by their abstract or fgurative nature. Given the results from Figure 6, we can see that the model is
Our original hypothesis was that abstract styles would perform able to capture an extensive range of styles even if it performs
better because we thought they would be more tolerant to the diferently dependent upon the nature of the style. Many perform
deconstructed, global incoherence of many generations. We found well so long as they are not prone to misinterpretation or other
that while our original hypothesis was correct for styles such as aforementioned failure modes. We conclude from this experiment
glitch, Cubism, and Andean textiles for reasons we expected (such the following design guideline: when choosing the style of the
as a high tolerance for deconstruction), abstract styles were prone generation, feel free to try any styles, no matter how niche
to a wide range of failure modes. These failure modes included or broad.
misunderstandings due to misinterpretation and an inability to
access higher-order cultural knowledge. 7 EXPERIMENT 5: INTERACTION BETWEEN
The fgurative styles that performed in the top 4ÐUkiyo-e, Im- SUBJECT AND STYLE
pressionism, documentary photography, and cyberpunkÐdisplayed Given the varied but still successful application of style as a steer-
a diverse range of stylistic details from line to texture to perspec- ing keyword within prompts, we wanted to investigate the subject
tive. The worst performing fgurative styles sufered from diferent keywords similarly and then observe how subject and style as
modes of failure such as an inability to capture the style (Kerala parameters would interact with each other. We frst ran a pilot
mural style generations) or a defaulting to unconvincing motifs experiment studying subject alone. However, we chose not to take
(as in the case of the generation dog in the art deco style seen in this experiment further, because the generations yielded were too
Figure 10). consistently poor due to the underconstrained nature of the prompt.
See the Figure 20 in the Appendix for further examples of this pilot.
6.6.2 Western versus non-Western. To investigate whether the model We focus on the interaction of subject and style in this experiment,
would perform better on Western or Non-Western art styles, we and pursue the following research questions: To what degree do
looked again at a subset of specifc styles, excluding mediums and categories of subject and style infuence one another? Do
Internet aesthetics (as they tended to be more globalized). categories of styles, such as abstract or fgurative styles, per-
We found that Western art styles averaged 2.92 (standard er- form better on certain categories of subjects, such as abstract
ror: 0.07), while non-Western art styles averaged 2.95 (standard or concrete subjects?
error: 0.06). Using a Mann-Whitney test, we found that there was
an insignifcant diference between the distribution of ratings for 7.1 Methodology
Western styles and the distribution for ratings for non-Western To study the efect of interaction of subject and style, we generated
styles (p-value: 0.377). 1581 images from 51 subjects, 31 styles. The full list of subjects
We illustrate the top performing styles in Figure 9, where we and styles are in the Appendix, but follow the same rationale as
show the ranking colored for Western for Non-Western styles. Our previous experiments for coverage across the abstract-concreteness
Design Guidelines for Text-to-Image Prompting CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Figure 9: For Experiment 6, in the left subgraphs, averages are reported for each category of the three partitions studied:
(Abstract, Figurative), (West, Non-West), (Digital, Modern, Premodern). The right fgures are bar graphs which rank each style
included in the partitions by their aggregate means from low to high left to right, coloring for their respective categories.
We found signifcant diferences between the abstract and fgurative styles in aggregate as well as the digital, modern, and
pre-modern styles in aggregate.
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Vivian Liu and Lydia B. Chilton
Figure 10: For Experiment 4, we illustrate some of the best and worst styles along the abstract and fgurative style partition.
Figure 11: For Experiment 4, we illustrate some of the best and worst styles along the Western and Non-Western style partition.
Figure 12: Pictured are some of the best and worst styles along the Western and Non-Western style partition.
spectrum (for subjects) and diversity of styles in terms of time, annotated each generated image for the coherency of subject and
schools of art, and levels of abstraction. style within the image as per the following rubric:
Figure 13: For Experiment 4, we illustrate some of the best and worst styles along the style partition for diferent time periods:
digital, modern, and premodern.
Figure 14: Pictured are some of the best and worst styles along the time periods partition: digital, modern, and premodern.
7.3 Results p-values were signifcant, being well below 0.01. This allows us to
Two variables we wanted to test in the experiment were the abstract conclude that both factors have a signifcant efect on the rating
or concrete nature of the subject and the abstract or fgurative of the generation. Likewise, we saw that their interaction is also
nature of the style. signifcant to a p-value well below 0.01.
We frst studied just the abstract or concrete nature of the noun
alone, aggregating results by subject. We found that the top ten 7.4 Success and Failure Modes for Subjects
subjects were all categorically concrete, with an average concrete- Crossed With Styles
ness value of 4.47. They were all subjects that were universal across
In the following section, we perform a qualitative analysis on what
most cultures: ocean, forest, house, eye, bird. Examples of these top
success and failure modes we observed for subject-style genera-
subjects crossed with diferent styles are illustrated in Figure 15. We
tions.
found that when we compare the abstractness of the noun to the
quality of the generation, there is an r value / Pearson’s coefcient
7.4.1 Success mode. Correct applications of symbolism. In many
of 0.62, which implies a moderate to strong positive association.
subjects, the text-to-image framework was able to demonstrate that
This means that on average there is a trend where concrete subjects
it could access and apply symbols. For example, in most generations
tend to do better.
for hearts, heart symbols emerged out of the image (even if the
We then considered the infuence of the abstract or fgurative
symbol was incongruent with the style, for example as a heart
nature of style as well, by looking at the generations from a facto-
symbol would be in Ukiyo-e art).
rial 2x2 lens. We found the following aggregate rankings for the
However, generations also showed a fexible understanding of
enumerated categories: abstract-abstract (3.05), abstract-concrete
love in the form of kisses, proposals, and hugs. Generations in the
(3.17), fgurative-abstract (3.49), and fgurative-concrete (3.54). In
subject of sadness also demonstrated an expressive range of symbols
running a two-way ANOVA on the annotations we found that all
for sadness such as blueness, frowns, tears, and lonely fgures. For
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Vivian Liu and Lydia B. Chilton
Figure 15: For Experiment 5, generations from the top fve subjects of 51 subjects are visualized above in various styles.
other abstract subjects such as freedom, relaxation, or serenity, the to stand in for an abstract subject is apparent in both abstract and
model was able to demonstrate that it could connect freedom with fgurative styles.
American fags and relaxation with reclining. These associations
are intuitive, even if certain connections, such as freedom with the 7.4.2 Success mode. Integration of motifs with elements of the sub-
United States, have overtones of bias. ject . Another mode of success that we could see in generations
This success mode is primarily what makes the diference be- from Figure 15 such as "eye in the style of Op art" from the ab-
tween good generations and bad generations for abstract subjects. stract style, concrete subject category or "intelligence in the fractal
A generation from an abstract subject is successful only when it style" from the abstract style, abstract subject category was when
is able to fnd purchase in the image as a symbol. Using a symbol components of the subject and style matched and blended well. In
the "intelligence in the fractal style", intelligence is symbolized in a
brain which has recursive convolutions of gray matter, which elicits
Design Guidelines for Text-to-Image Prompting CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Figure 16: For Experiment 5, 51 subjects were crossed with 31 styles. When mean rankings were aggregated across styles, the
top 10 subjects all were concrete subjects. The top fve specifcally were ocean, forest, house, eye, and dancing.
the idea of a brain is a fractal. Other examples such as "fower in The most poorly rated of the subjects was website. This one is
the cyberpunk style" or "nostalgia in the VSCO style" in the other interesting because it represents a challenge to the framework, be-
quadrants of Figure 17 demonstrate how the color palette of a style cause websites and digital media are anachronistic subjects to many
colored the subject. In the former, the fower took on the magenta modern and all premodern styles. We found that the model some-
trademarks of cyberpunk and in the later, nostalgia was established times simply dropped website from the generated image, which we
through a sepia and pastel tones reminiscent of flters. would say is neither correct nor incorrect as the subject could have
What makes generations with concrete subjects successful is conficted with the style. This is another outcome that suggests
when the subject is able to emerge from a style without disrupting that relevancy should be a consideration for users interacting with
it. For example, in the generation rain in the High Renaissance style text-to-image generative frameworks. However, in these genera-
in Figure 17, we see that the rain is pervasive but drawn in fne, tions we could also see positive outliers where the style adapted
white strokes that are characteristic to the style. Likewise, we see to the subject. For example, for "website in the ancient Egyptian
that the car in the unreal engine style image applies the same efects style" seen in Figure 19, we can see a keyboard, and a person using
of depth of feld and scene lighting prevalent in all CG renders. a computer screen.
Figure 17: In Experiment 5, we looked at both subject and style words in the prompt. We crossed abstract and concrete subjects
with abstract and fgurative styles. In this fgure above, we show some success cases within each crossed category. In running a
two-way ANOVA, we found that both subject and style have a signifcant efect on the rating of the generation. Likewise, their
interaction was also statistically signifcant.
emergent forms from parts and patterns) and maturity warnings. and style (Experiments 1, 4, 5) and studied the efects of modu-
Text-to-image generation likely translates the biases learned from lating hyperparameters like the number of iterations and random
the Internet into imagery. These images are excluded from fgures initializations (Experiments 2, 3).
for the sake of propriety. We condense our fndings from the previous experiments into
In summary, for Experiment 5, we concluded the following de- design guidelines and results to elaborate default parameters and
sign guideline: When picking the subject of the generation, methods for end users interacting with text-to-image models.
pick subjects can complement the chosen style in level of • When picking the prompt, focus on subject and style
abstractness and relevance. keywords instead of connecting words. Rephrasings us-
ing the same keywords do not make a signifcant diference
on the quality of the generation as no prompt permutation
8 DISCUSSION consistently succeeds over the rest.
In a series of experiments, we demonstrated that a range of "SUB- • When generating, generate between 3 to 9 diferent
JECT in the style of STYLE" generations can be arrived at quickly seeds to get a representative idea of what a prompt can
and easily with a text-to-image generative framework. We looked return. Generations may be signifcantly diferent owing to
at diferent parameters for prompt engineering such as subject the stochastic nature of hyperparameters such as random
Design Guidelines for Text-to-Image Prompting CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Figure 18: Experiment 5. We use the subject "progress" here to illustrate the failure mode where the subject drops out for certain
styles (see left three images). However, nuances of progress were conveyed nonetheless in the right four images if we consider
progress through diferent defnitions such as progress pictures, progressivism, and something evocative of the future.
Figure 19: Experiment 5. We use the subject "website" to exhibit when the subject would drop out with certain styles. This
subject represented a challenge because it was an anachronistic subject for most styles.
seeds and initializations. Returning multiple results acknowl- make prompt engineering and text-to-image generation otherwise
edges this stochastic nature to users. overwhelming, unbounded, and inexhaustive.
• When generating, for fast iteration, using shorter lengths
of optimization between 100 and 500 iteration is suf-
cient. We found that the number of iterations and length of 8.1 Implications of Borrowing Styles
optimization did not signifcantly correlate with user satis- While text-to-image interaction presents a novel and emerging form
faction of the generation. of human-computer interaction for media creation, this advance-
• When choosing the style of the generation, feel free to ment presents us with new sets of concerns. Suggesting that we use
try any style, no matter how niche or broad. The deep pre-existing styles is at once intuitive and controversial. There are
learning frameworks capture an impressive breadth of style many implications to borrowing styles as keywords, one of which
information, and can be surprisingly good even for niche is that we are relying on a machine’s non-expert understanding of
styles. However, avoid style keywords that may be prone to a style to generate outputs. This makes it possible for text-to-image
misinterpretation. models to return generations that could err towards stereotypes
• When picking the subject of the generation, pick sub- and other misrepresentations.
jects that can complement the chosen style in level of For example, one of the top three styles in terms of ratings for
abstractness. This could be done by picking subjects for Experiment 4 was Ukiyo-e. These generations tended to employ
styles considering how abstract or concrete both are or pair- beige, black, and muted primary colors suggestive of woodblock
ing subjects that are easily interpretable or highly relevant prints. However, Ukiyo-e work in the past was not confned to
to the style. this range of color. Ukiyo-e as a style spanned centuries, during
• When looking at the results, present users with trigger which as a style it exhibited diferent approaches to color ranging
warnings for pareidolia and ofensive content. The mod- from monochromatic ink to brilliant brocades. This implies that the
els currently do not acknowledge the possibility for ofensive model could only shallowly summarize Ukiyo-e. Likewise, sketches
content. tended to return black and white images, belying an stereotypical
understanding of sketches as generally black and white.
While styles can be borrowed as keywords to prompt genera-
Our experiments gave us empirical grounding to focus many of the tions, styles can also be misrepresented. For example, in our analysis
hyperparameters and free parameters for prompts that otherwise of failure modes for styles in Experiment 4, we found that certain
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Vivian Liu and Lydia B. Chilton
ACKNOWLEDGMENTS
8.2 Limitations and Future Work Vivian Liu is supported by the NSF Graduate Research Fellowship
Our focus in this paper was on prompt engineering a text-to-image (DGE-1644869).
framework with text. However, the model could have also received
images to start optimizing from. We believe studying how the model REFERENCES
can be conditioned on an image and text together is interesting [1] 2021. Introduction to VQGAN CLIP. https://docs.google.com/document/d/
1Lu7XPRKlNhBQjcKr8k8qRzUzbBW7kzxb5Vu72GMRn2E/edit
future work that could provide insights into how people can move [2] 2021. VQGANCLIP Hashtag. https://twitter.com/hashtag/vqganclip?src=
between diferent modes of interaction. Interacting with text is hashtag_click
high-level interaction while working with images is low-level and [3] Adverb. 2021. Advadnoun. https://twitter.com/advadnoun
[4] Aesthetics Wiki Community. 2022. Aesthetics Wiki. https://aesthetics.fandom.
more conducive to directly manipulating the generation. Similarly com/wiki/
along the lines of user control, another line of work would be to im- [5] Gwern Branwen. 2020. Gpt-3 creative fction. https://www.gwern.net/GPT-3
[6] Andrew Brock, Jef Donahue, and Karen Simonyan. 2019. Large Scale GAN
prove the capacity of this framework for iteration. Currently, users Training for High Fidelity Natural Image Synthesis. arXiv:1809.11096 [cs.LG]
can only regenerate upon rerunning the framework on previous [7] Marc Brysbaert, Amy Warriner, and Victor Kuperman. 2013. Concreteness
generations, but usability could be improved if more controls for ratings for 40 thousand generally known English word lemmas. Behavior research
methods 46 (10 2013). https://doi.org/10.3758/s13428-013-0403-5
steering the generation at intermediate stages could be exposed [8] Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Semantics derived
[30]. automatically from language corpora contain human-like biases. Science 356,
Our qualitative analysis in both Experiment 4 and 5 demonstrates 6334 (Apr 2017), 183ś186. https://doi.org/10.1126/science.aal4230
[9] Andrew Cantino. 2021. Prompt Engineering Tips and Tricks with GPT-
that more work could be done to explore the nuances of what certain 3. https://blog.andrewcantino.com/blog/2021/04/21/prompt-engineering-tips-
styles can elicit. Styles exist with respect to cultural contexts and and-tricks/
[10] Siddhartha Chaudhuri, Evangelos Kalogerakis, Stephen Giguere, and Thomas
histories, and it is valuable to understand how generations can be Funkhouser. 2013. Attribit: Content Creation with Semantic Attributes. In Pro-
pushed to be more than fat reproductions of styles. For example, ceedings of the 26th Annual ACM Symposium on User Interface Software and
one could say that generations in the style of Impressionism or Technology (St. Andrews, Scotland, United Kingdom) (UIST ’13). Association for
Computing Machinery, New York, NY, USA, 193ś202. https://doi.org/10.1145/
Cubism could emulate these respective styles at least at the surface 2501988.2502008
level, in terms of technique and color palettes. However, it remains [11] Katherine Crowson. 2021. afaka87/clip-guided-difusion: A CLI tool/python
to be explored to what degree could these generations channel the module for generating images from text using guided difusion and CLIP from
OpenAI. https://github.com/afaka87/clip-guided-difusion
nuances of these styles, such as their conceptual values or messages. [12] Katherine Crowson. 2021. Rivers Have Wings. https://twitter.com/
Another limitation of this paper is that for most of the exper- RiversHaveWings
[13] Bestiario del Hypogripho. 2021. Ayuda:Generar imágenes con VQGAN
iments, we only looked at one prompt ("SUBJECT in the STYLE CLIP/English. https://tuscriaturas.miraheze.org/w/index.php?title=Ayuda:
of") for VQGAN+CLIP. We looked at this prompt and framework Generar_imágenes_con_VQGANCLIP/English
because it had traction within creative technologist communities, [14] Ruta Desai, Fraser Anderson, Justin Matejka, Stelian Coros, James McCann,
George Fitzmaurice, and Tovi Grossman. 2019. Geppetto: Enabling Semantic
but further research could look into other prompts and models. For Design of Expressive Robot Behaviors. In Proceedings of the 2019 CHI Conference on
example, what would happen if we typed in the frst line of a poem, Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). ACM, New
a news headline, or a design goal for a moodboard? Additionally, York, NY, USA, Article 369, 14 pages. https://doi.org/10.1145/3290605.3300599
[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
for this prompt and others, there are modifers that we could have Pre-training of Deep Bidirectional Transformers for Language Understanding.
explored to increase the realistic quality of the generation. For ex- arXiv:1810.04805 [cs.CL]
[16] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2015. A Neural Algo-
ample, we could have added and systematically explored modifers rithm of Artistic Style. arXiv:1508.06576 [cs.CV]
like "4k" or "2048px". [17] Songwei Ge and Devi Parikh. 2021. Visual Conceptual Blending with Large-scale
Given that text-to-image generation is an emerging paradigm Language and Vision Models. arXiv:2106.14127 [cs.CL]
[18] Katy Ilonka Gero, Zahra Ashktorab, Casey Dugan, Qian Pan, James Johnson,
of interaction, there are many avenues of prompt engineering for Werner Geyer, Maria Ruiz, Sarah Miller, David R. Millen, Murray Campbell,
visual generation tasks that future work can explore. Sadhana Kumaravel, and Wei Zhang. 2020. Mental Models of AI Agents in a
Design Guidelines for Text-to-Image Prompting CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Cooperative Game Setting. Association for Computing Machinery, New York, NY, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries,
USA, 1ś12. https://doi.org/10.1145/3313831.3376316 Ryan Teehan, Stella Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M.
[19] Katy Ilonka Gero and Lydia B. Chilton. 2019. Metaphoria: An Algorithmic Com- Rush. 2021. Multitask Prompted Training Enables Zero-Shot Task Generalization.
panion for Metaphor Creation. Association for Computing Machinery, New York, arXiv:2110.08207 [cs.LG]
NY, USA, 1ś12. https://doi.org/10.1145/3290605.3300526 [43] Evan Shimizu, Matthew Fisher, Sylvain Paris, James McCann, and Kayvon Fa-
[20] Arnab Ghosh, Richard Zhang, Puneet K. Dokania, Oliver Wang, Alexei A. Efros, tahalian. 2020. Design Adjectives: A Framework for Interactive Model-Guided
Philip H. S. Torr, and Eli Shechtman. 2019. Interactive Sketch and Fill: Multiclass Exploration of Parameterized Design Spaces. In Proceedings of the 33rd Annual
Sketch-to-Image Translation. arXiv:1909.11081 [cs.CV] ACM Symposium on User Interface Software and Technology (Virtual Event, USA)
[21] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- (UIST ’20). Association for Computing Machinery, New York, NY, USA, 261ś278.
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative https://doi.org/10.1145/3379337.3415866
Adversarial Networks. arXiv:1406.2661 [stat.ML] [44] Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT Rediscovers the Classical
[22] Cheng-Zhi Anna Huang, Hendrik Vincent Koops, Ed Newton-Rex, Monica Din- NLP Pipeline. arXiv:1905.05950 [cs.CL]
culescu, and Carrie J. Cai. 2020. AI Song Contest: Human-AI Co-Creation in [45] The Metropolitan Museum of Art. 2022. Heilbrunn Timeline of Art History.
Songwriting. arXiv:2010.05388 [cs.SD] https://www.metmuseum.org/toah/
[23] Youngseung Jeon, Seungwan Jin, Patrick C. Shih, and Kyungsik Han. 2021. Fash- [46] @someheresy Twitter. 2021. VQGAN CLIP Colab Notebook. https:
ionQ: An AI-Driven Creativity Support Tool for Facilitating Ideation in Fashion //colab.research.google.com/drive/1_4Jl0a7WIJeqy5LTjPJfZOwMZopG5C-
Design. Association for Computing Machinery, New York, NY, USA. https: W?usp=sharing#scrollTo=ZdlpRFL8UAlW
//doi.org/10.1145/3411764.3445093 [47] wikiart.org. 2022. Visual Art Encyclopedia. https://www.wikiart.org/
[24] Justin. 2021. Somewhere Systems Twitter. https://twitter.com/somewheresy [48] Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020. Re-
[25] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Examining Whether, Why, and How Human-AI Interaction Is Uniquely Difcult
Timo Aila. 2020. Training Generative Adversarial Networks with Limited Data. to Design. Association for Computing Machinery, New York, NY, USA, 1ś13.
CoRR abs/2006.06676 (2020). arXiv:2006.06676 https://arxiv.org/abs/2006.06676 https://doi.org/10.1145/3313831.3376301
[26] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and [49] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. DM-GAN: Dy-
Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. namic Memory Generative Adversarial Networks for Text-to-Image Synthesis.
arXiv:1912.04958 [cs.CV] arXiv:1904.01310 [cs.CV]
[27] Aran Komatsuzaki. 2021. When you generate images with VQGAN CLIP, the
image quality dramatically improves if you add "unreal engine" to your prompt.
People are now calling this "unreal engine trick" lole.g. "the angel of air. unreal
engine" pic.twitter.com/G4xBgVLyiv. https://twitter.com/arankomatsuzaki/
status/1399471244760649729
[28] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and
Graham Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of
Prompting Methods in Natural Language Processing. arXiv:2107.13586 [cs.CL]
[29] Yang Liu, Eunice Jun, Qisheng Li, and Jefrey Heer. 2019. Latent Space Cartogra-
phy: Visual Analysis of Vector Space Embeddings. Computer Graphics Forum 38,
3 (2019), 67ś78. https://doi.org/10.1111/cgf.13672
[30] Ryan Louie, Andy Coenen, Cheng Zhi Huang, Michael Terry, and Carrie J. Cai.
2020. Novice-AI Music Co-Creation via AI-Steering Tools for Deep Generative
Models. In Proceedings of the 2020 CHI Conference on Human Factors in Computing
Systems. Association for Computing Machinery, New York, NY, USA, 1ś13. https:
//doi.org/10.1145/3313831.3376739
[31] Justin Matejka, Michael Glueck, Erin Bradner, Ali Hashemi, Tovi Grossman, and
George Fitzmaurice. 2018. Dream Lens: Exploration and Visualization of Large-
Scale Generative Design Datasets. In Proceedings of the 2018 CHI Conference on
Human Factors in Computing Systems. Association for Computing Machinery,
New York, NY, USA, 1ś12. https://doi.org/10.1145/3173574.3173943
[32] Alexander Mordvintsev, Michael Tyka, and Christopher Olah. 2015.
google/deepdream. https://github.com/google/deepdream
[33] Ryan Murdock. [n.d.]. lucidrains/big-sleep: A simple command line tool for text to
image generation, using OpenAI’s CLIP and a BigGAN. Technique was originally
created by https://twitter.com/advadnoun. https://github.com/lucidrains/big-
sleep
[34] Ryan Murdock. 2022. lucidrains/big-sleep: A simple command line tool for text to
image generation, using OpenAI’s CLIP and a BigGAN. Technique was originally
created by https://twitter.com/advadnoun. https://github.com/lucidrains/big-
sleep
[35] nerdyroden. 2022. nerdyrodent/VQGAN-CLIP. https://github.com/nerdyrodent/
VQGAN-CLIP
[36] OpenAI. 2022. Prompt_Engineering_for_ImageNet.ipynb. https://github.com/
openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb
[37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models
From Natural Language Supervision. arXiv:2103.00020 [cs.CV]
[38] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Rad-
ford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation.
arXiv:2102.12092 [cs.CV]
[39] reddit.com. 2021. https://www.reddit.com/r/bigsleep/
[40] Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Lan-
guage Models: Beyond the Few-Shot Paradigm. arXiv:2102.07350 [cs.CL]
[41] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A Primer in BERTology:
What we know about how BERT works. arXiv:2002.12327 [cs.CL]
[42] Victor Sanh, Albert Webson, Colin Rafel, Stephen H. Bach, Lintang Sutawika,
Zaid Alyafeai, Antoine Chafn, Arnaud Stiegler, Teven Le Scao, Arun Raja,
Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma,
Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta,
Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen,
Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj,
CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA Vivian Liu and Lydia B. Chilton
A EXPERIMENT 1. PROMPT PERMUTATIONS performed better on nouns than verbs given the noun-centric
We chose the listed prompt permutations for the following reasons: supervision of Imagenet.
• SUBJECT made/done/verb in the STYLE art style Ð Ex-
• A MEDIUM of SUBJECT in the STYLE style Ð We
ample: We tested this prompt permutation as a rephrasing
wanted to test this prompt because the authors of CLIP noted
of prompt permutation #7.
that incorporating MEDIUM words could help return better
• SUBJECT with a STYLE style. Ð We tested this prompt per-
generations [36]. For example, inputting a prompt such as a
mutation to test a diferent ordering with a diferent function
"a painting of a dog in the Cubism style" would lead to better
word.
results than "dog in the Cubism style".
• A STYLE MEDIUM of a SUBJECT Ð We wanted to test this
prompt because it was a reordering of prompt permutation
B EXPERIMENT 3. RANDOM SEEDS
# 1. We chose to expand our list of styles to include a broader range
• SUBJECT STYLE Ð We wanted to test this prompt because of concreteness values, such that they included: love, hate, peace,
it was the most minimal amount of information to prompt progress, relaxation, loyalty, compassion, beauty, pain, dream, thought,
the machine with. At the same time, this sort of text prompt trust, freedom, chaos, success, courage, happiness, nostalgia, intelli-
is how we regularly query image search engines. gence, kindness, time, concern, sadness, reality, serenity, fear, victory,
• SUBJECT. STYLE Ð We wanted to test this prompt, a close happiness, alien, car, house, apple, singing, dancing, sleeping, moun-
cousin of prompt permutation #3 and observe the efect of tain, rain, ocean, forest, fower fsh, bird, snake, boy, woman, eye,
punctuation. computer, website, universe, refection. Likewise, we chose to use
• SUBJECT in the style of STYLE Ð We wanted to test this the same set of styles: photorealism, Studio Ghibli, Neoclassicism,
prompt because it had traction within the creative technolo- African mask, thangka, fractal, Hippie movement, ancient Egyptian,
gists community [27]. art deco, unreal engine, Disney, cartoon, Pop art, VSCO, Futurism, 3D
• SUBJECT in the STYLE style Ð We wanted to test this render, Pointillism, sketch, Surrealism, Andean textile, Aboriginal art,
prompt because it was a rephrasing of prompt permutation Ukiyo-e, High Renaissance, Mayan, grafti, Cubism, Impressionism,
#5. Baroque, Op art, cyberpunk, and painting.
• SUBJECT VERB in the STYLE style Ð We wanted to test
this prompt permutation to see the infuence of verbs. The C EXPERIMENT 5. SUBJECTS
authors of CLIP in their repository noted that the model
Design Guidelines for Text-to-Image Prompting CHI ’22, April 29-May 5, 2022, New Orleans, LA, USA
Figure 20: In a pilot before Experiment 5, we found that using only subjects as keyword dimensions was insufcient. The
underconstrained nature of the generation made generations too poor to evaluate, because they were not grounded in any sort
of aesthetic.