DALL E - Creating Images From Text
DALL E - Creating Images From Text
DALL·E: Creating
Images from Text
We’ve trained a neural network called DALL·E that creates
images from text captions for a wide range of concepts
expressible in natural language.
https://openai.com/blog/dall-e/ 1/13
7/3/2021 DALL·E: Creating Images from Text
January 5, 2021
27 minute read
📄 R E A D PA P E R V I EW CO DE
DALL·E[1] is a 12-billion parameter version of GPT-3 trained to generate images from text
descriptions, using a dataset of text–image pairs. We’ve found that it has a diverse set of
capabilities, including creating anthropomorphized versions of animals and objects,
combining unrelated concepts in plausible ways, rendering text, and applying
transformations to existing images.
TE XT P RO MP T
TE XT P RO MP T
TE XT P RO MP T
https://openai.com/blog/dall-e/ 2/13
7/3/2021 DALL·E: Creating Images from Text
TE X T AN D I MAGE P RO MP T
GPT-3 showed that language can be used to instruct a large neural network to perform a
variety of text generation tasks. Image GPT showed that the same type of neural network
can also be used to generate images with high fidelity. We extend these findings to show
that manipulating visual concepts through language is now within reach.
Overview
Like GPT-3, DALL·E is a transformer language model. It receives both the text and the
image as a single stream of data containing up to 1280 tokens, and is trained using
maximum likelihood to generate all of the tokens, one after another. [2] This training
procedure allows DALL·E to not only generate an image from scratch, but also to
regenerate any rectangular region of an existing image that extends to the bottom-right
corner, in a way that is consistent with the text prompt.
We recognize that work involving generative models has the potential for significant,
broad societal impacts. In the future, we plan to analyze how models like DALL·E relate to
societal issues like economic impact on certain work processes and professions, the
https://openai.com/blog/dall-e/ 3/13
7/3/2021 DALL·E: Creating Images from Text
potential for bias in the model outputs, and the longer term ethical challenges implied by
this technology.
Capabilities
We find that DALL·E is able to create plausible images for a great variety of sentences that
explore the compositional structure of language. We illustrate this using a series of
interactive visuals in the next section. The samples shown for each caption in the visuals
are obtained by taking the top 32 of 512 after reranking with CLIP, but we do not use any
manual cherry-picking, aside from the thumbnails and standalone images that appear
outside. [3]
Controlling attributes
We test DALL·E’s ability to modify several of an object’s attributes, as well as the number
of times that it appears.
https://openai.com/blog/dall-e/ 4/13
7/3/2021 DALL·E: Creating Images from Text
⌄
a stack of 3 cubes. a red cube is on the top, sitting on a green cube. the green cube
is in the middle, sitting on a blue cube. the blue cube is on the bottom.
an emoji of a baby penguin wearing a blue hat, red gloves, green shirt, and yellow
pants
While DALL·E does offer some level of controllability over the attributes and positions of
a small number of objects, the success rate can depend on how the caption is phrased. As
more objects are introduced, DALL·E is prone to confusing the associations between the
objects and their colors, and the success rate decreases sharply. We also note that DALL·E
is brittle with respect to rephrasing of the caption in these scenarios: alternative,
semantically equivalent captions often yield no correct interpretations.
To push this further, we test DALL·E’s ability to repeatedly draw the head of a well-known
figure at each angle from a sequence of equally spaced angles, and find that we can
recover a smooth animation of the rotating head.
https://openai.com/blog/dall-e/ 5/13
7/3/2021 DALL·E: Creating Images from Text
DALL·E appears to be able to apply some types of optical distortions to scenes, as we see
with the options “fisheye lens view” and “a spherical panorama.” This motivated us to
explore its ability to generate reflections.
a plain white cube looking at its own reflection in a mirror. a plain white cube gazing
at itself in a mirror.
https://openai.com/blog/dall-e/ 6/13
7/3/2021 DALL·E: Creating Images from Text
a painting of a capybara sitting in a field at sunrise
a store front that has the word ‘openai’ written on it. a store front that has the word
‘openai’ written on it. a store front that has the word ‘openai’ written on it. ‘openai’
store front.
⌄
With varying degrees of reliability, DALL·E provides access to a subset of the capabilities
of a 3D rendering engine via natural language. It can independently control the attributes
of a small number of objects, and to a limited extent, how many there are, and how they
are arranged with respect to one another. It can also control the location and angle from
which a scene is rendered, and can generate known objects in compliance with precise
specifications of angle and lighting conditions.
a female mannequin dressed in a black leather jacket and gold pleated skirt
a living room with two white armchairs and a painting of the colosseum. the
painting is mounted above a modern fireplace.
a loft bedroom with a white bed next to a nightstand. there is a fish tank beside the
https://openai.com/blog/dall-e/ 7/13
7/3/2021 DALL·E: Creating Images from Text
bed.
⌄
Animal illustrations
In the previous section, we explored DALL·E’s ability to combine unrelated concepts
when generating images of real-world objects. Here, we explore this ability in the context
of art, for three kinds of illustrations: anthropomorphized versions of animals and
objects, animal chimeras, and emojis.
⌄
a professional high quality emoji of a lovestruck cup of boba
https://openai.com/blog/dall-e/ 8/13
7/3/2021 DALL·E: Creating Images from Text
the exact same teapot on the top with ’gpt’ written on it on the bottom
We did not anticipate that this capability would emerge, and made no modifications to
the neural network or training procedure to encourage it. Motivated by these results, we
measure DALL·E’s aptitude for analogical reasoning problems by testing it on Raven’s
progressive matrices, a visual IQ test that saw widespread use in the 20th century.
Geographic knowledge
We find that DALL·E has learned about geographic facts, landmarks, and neighborhoods.
Its knowledge of these concepts is surprisingly precise in some ways and flawed in others.
https://openai.com/blog/dall-e/ 9/13
7/3/2021 DALL·E: Creating Images from Text
Temporal knowledge
In addition to exploring DALL·E’s knowledge of concepts that vary over space, we also
explore its knowledge of concepts that vary over time.
Text-to-image synthesis has been an active area of research since the pioneering work of
Reed et. al, 1 whose approach uses a GAN conditioned on text embeddings. The
embeddings are produced by an encoder pretrained using a contrastive loss, not unlike
CLIP. StackGAN 3 and StackGAN++ 4 use multi-scale GANs to scale up the image
resolution and improve visual fidelity. AttnGAN 5 incorporates attention between the text
and image features, and proposes a contrastive text-image feature matching loss as an
auxiliary objective. This is interesting to compare to our reranking with CLIP, which is
done offline. Other work 2,6,7 incorporates additional sources of supervision during
training to improve image quality. Finally, work by Nguyen et. al 8 and Cho et. al 9
explores sampling-based strategies for image generation that leverage pretrained
multimodal discriminative models.
https://openai.com/blog/dall-e/ 10/13
7/3/2021 DALL·E: Creating Images from Text
Similar to the rejection sampling used in VQVAE-2, we use CLIP to rerank the top 32 of
512 samples for each caption in all of the interactive visuals. This procedure can also be
seen as a kind of language-guided search 16 , and can have a dramatic impact on
sample quality.
Footnotes
. We decided to name our model using a portmanteau of the artist Salvador Dalí and Pixar’s WALL·E. ↩
. A token is any symbol from a discrete vocabulary; for humans, each English letter is a token from a 26-letter
alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption
is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image
is represented using 1024 tokens with a vocabulary size of 8192.
The images are preprocessed to 256x256 resolution during training. Similar to VQVAE, 14,15 each image is
compressed to a 32x32 grid of discrete latent codes using a discrete VAE 10,11 that we pretrained using a
continuous relaxation.12,13 We found that training using the relaxation obviates the need for an explicit
codebook, EMA loss, or tricks like dead code revival, and can scale up to large vocabulary sizes. ↩
. Further details provided in a later section. ↩
. This task is called variable binding, and has been extensively studied in the literature.17,18,19,20 ↩
References
. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H. (2016). “Generative adversarial text to image
synthesis”. In ICML 2016. ↩
. Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H. (2016). “Learning what and where to draw”. In
NIPS 2016. ↩
. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang X., Metaxas, D. (2016). “StackGAN: Text to photo-realistic
image synthesis with stacked generative adversarial networks”. In ICCY 2017. ↩
. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D. (2017). “StackGAN++: realistic image
synthesis with stacked generative adversarial networks”. In IEEE TPAMI 2018. ↩
. Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X. (2017). “AttnGAN: Fine-grained text to image
generation with attentional generative adversarial networks. ↩
. Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., Gao, J. (2019). “Object-driven text-to-image synthesis via
adversarial training”. In CVPR 2019. ↩
. Koh, J. Y., Baldridge, J., Lee, H., Yang, Y. (2020). “Text-to-image generation grounded by fine-grained user
attention”. In WACV 2021. ↩
. Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J. (2016). “Plug & play generative networks:
conditional iterative generation of images in latent space. ↩
https://openai.com/blog/dall-e/ 11/13
7/3/2021 DALL·E: Creating Images from Text
g g p
. Cho, J., Lu, J., Schwen, D., Hajishirzi, H., Kembhavi, A. (2020). “X-LXMERT: Paint, caption, and answer
questions with multi-modal transformers”. EMNLP 2020. ↩
. Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint (2013). ↩
. Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and
approximate inference in deep generative models.” arXiv preprint (2014). ↩
. Jang, E., Gu, S., Poole, B. (2016). “Categorical reparametrization with Gumbel-softmax”. ↩
. Maddison, C., Mnih, A., Teh, Y. W. (2016). “The Concrete distribution: a continuous relaxation of discrete
random variables”. ↩
. van den Oord, A., Vinyals, O., Kavukcuoglu, K. (2017). “Neural discrete representation learning”. ↩
. Razavi, A., van der Oord, A., Vinyals, O. (2019). “Generating diverse high-fidelity images with VQ-VAE-2”. ↩
. Andreas, J., Klein, D., Levine, S. (2017). “Learning with Latent Language”. ↩
. Smolensky, P. (1990). “Tensor product variable binding and the representation of symbolic structures in
connectionist systems”. ↩
. Plate, T. (1995). “Holographic reduced representations: convolution algebra for compositional
distributed representations”. ↩
. Gayler, R. (1998). “Multiplicative binding, representation operators & analogy”. ↩
. Kanerva, P. (1997). “Fully distributed representations”. ↩
Authors
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh & Scott Gray
( P RI MARY AUTHO RS )
Contributions
Aditya Ramesh was the project lead: he developed the approach, trained the models, and wrote most of the
blog copy.
Aditya Ramesh, Mikhail Pavlov, and Scott Gray worked together to scale up the model to 12 billion parameters,
and designed the infrastructure used to draw samples from the model.
Aditya Ramesh, Gabriel Goh, and Justin Jay Wang worked together to create the interactive visuals for
the blog.
Mark Chen and Aditya Ramesh created the images for Raven’s Progressives Matrices.
Pamela Mishkin, Gretchen Krueger, and Sandhini Agarwal advised on broader impacts of the work and assisted
in writing the blog.
Acknowledgments
Thanks to the following for their feedback on this work and contributions to this release: Alec Radford, Andrew
Mayne, Jeff Clune, Ashley Pilipiszyn, Steve Dowling, Jong Wook Kim, Lei Pan, Heewoo Jun, John Schulman,
Michael Tabatowski, Preetum Nakkiran, Jack Clark, Fraser Kelton, Jacob Jackson, Greg Brockman, Wojciech
https://openai.com/blog/dall-e/ 12/13
7/3/2021 DALL·E: Creating Images from Text
Zaremba, Justin Mao-Jones, David Luan, Shantanu Jain, Prafulla Dhariwal, Sam Altman, Pranav Shyam, Miles
Brundage, Jakub Pachocki, and Ryan Lowe.
Cover Artwork
Justin Jay Wang
Filed Under
Research, Milestones, Multimodal
https://openai.com/blog/dall-e/ 13/13