Kandinsky - An Improved Text-to-Image Synthesis With Image Prior and Latent Diffusion

Kandinsky- an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Uploaded by

zth18883272097

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views10 pages

Kandinsky - An Improved Text-to-Image Synthesis With Image Prior and Latent Diffusion

Kandinsky- an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Uploaded by

zth18883272097

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Kandinsky: an Improved Text-to-Image Synthesis with

Image Prior and Latent Diffusion

Anton Razzhigaev1,2 , Arseniy Shakhmatov3 , Anastasia Maltseva3 , Vladimir Arkhipkin3 ,
Igor Pavlov3 , Ilya Ryabov3 , Angelina Kuts 3 , Alexander Panchenko2,1 ,
Andrey Kuznetsov3,1 , and Denis Dimitrov3,1
1
AIRI, 2 Skoltech, 3 Sber AI
{razzhigaev, kuznetsov, dimitrov}@airi.net

Abstract and innovative perspective on this dynamic field

of study. First, we describe the new architecture
Text-to-image generation is a significant do-
main in modern computer vision and has
of Kandinsky and its details. The demo system
achieved substantial improvements through the with implemented features of the model is also
arXiv:2310.03502v1 [cs.CV] 5 Oct 2023

evolution of generative architectures. Among described. Second, we show the experiments, car-
these, there are diffusion-based models that ried out in terms of image generation quality and
have demonstrated essential quality enhance- come up with the highest FID score among exist-
ments. These models are generally split into ing open-source models. Additionally, we present
two categories: pixel-level and latent-level ap- the rigorous ablation study of prior setups that we
proaches. We present Kandinsky1 , a novel ex-
conducted, enabling us to carefully analyze and
ploration of latent diffusion architecture, com-
bining the principles of the image prior models evaluate various configurations to arrive at the most
with latent diffusion techniques. The image effective and refined model design.
prior model is trained separately to map text Our contributions are as follows:
embeddings to image embeddings of CLIP. An-
other distinct feature of the proposed model • We present the first text-to-image architecture
is the modified MoVQ implementation, which designed using a combination of image prior
serves as the image autoencoder component. and latent diffusion.
Overall, the designed model contains 3.3B pa-
rameters. We also deployed a user-friendly • We demonstrate experimental results compara-
demo system that supports diverse genera- ble to the state-of-the-art (SotA) models such
tive modes such as text-to-image generation, as Stable Diffusion, IF, and DALL-E 2, in
image fusion, text and image fusion, image terms of FID metric and achieve the SotA
variations generation, and text-guided inpaint-
score among all existing open source models.
ing/outpainting. Additionally, we released the
source code and checkpoints for the Kandinsky • We provide a software implementation of
models. Experimental evaluations demonstrate
a FID score of 8.03 on the COCO-30K dataset,
the proposed state-of-the-art method for text-
marking our model as the top open-source per- to-image generation, and release pre-trained
former in terms of measurable image genera- models, which is unique among the top-
tion quality. performing methods. Apache 2.0 license
makes it possible to use the model for both
1 Introduction non-commercial and commercial purposes.2 3
In quite a short period of time, generative abilities
• We create a web image editor application
of text-to-image models have improved substan-
that can be used for interactive generation
tially, providing users with photorealistic quality,
of images by text prompts (English and Rus-
near real-time inference speed, a great number of
sian languages are supported) on the basis
applications and features, including simple easy-
of the proposed method, and provides in-
to-use web-based platforms and sophisticated AI
painting/outpainting functionality.4 The video
graphics editors.
demonstration is available on YouTube.5
This paper presents our unique investigation of
2
latent diffusion architecture design, offering a fresh https://github.com/ai-forever/Kandinsky-2
3
https://huggingface.co/kandinsky-community
1 4
The system is named after Wassily Kandinsky, a famous https://fusionbrain.ai/en/editor
5
painter and an art theorist. https://www.youtube.com/watch?v=c7zHPc59cWU
— Text embedding

— Image embedding
Neon lights Red dress
Sad clown face

Tiger on the grass CLIP-text CLIP-text CLIP-image CLIP-text

XLMR-CLIP XLMR-CLIP XLMR-CLIP
Encoder Encoder Encoder Encoder

CLIP-text
Encoder
Clip-image Clip-image Diffusion
Image Prior Image Prior
Encoder Encoder Mapping

+ +
MSE Image Prior

time_emb
Loss

time_emb

time_emb
Latent Diffusion Latent Diffusion Latent Diffusion Latent Diffusion Latent Diffusion

+
+

+
U-Net U-Net U-Net U-Net U-Net

Clip-image MOVQ MOVQ MOVQ MOVQ MOVQ

Encoder Decoder Decoder Decoder Decoder Decoder

image prior
text to image variation image fusion fusion with text inpainting
training

Figure 1: Image prior scheme and inference regimes of the Kandinsky model.

2 Related Work which are now at the forefront of this domain.

Diffusion models achieve state-of-the-art results
Early text-to-image generative models, such as in image generation task both unconditional (Ho
DALL-E (Ramesh et al., 2021) and CogView (Ding et al., 2020; Nichol and Dhariwal, 2021) and con-
et al., 2021), or later Parti (Yu et al., 2022) em- ditional (Peebles and Xie, 2022). They beat GANs
ployed autoregressive approaches but often suf- (Goodfellow et al., 2014) by generating images
fered from significant content-level artifacts. This with better scores of fidelity and diversity without
led to the development of a new breed of models adversarial training (Dhariwal and Nichol, 2021).
that utilized the diffusion process to enhance image Diffusion models also show the best performance
quality. Diffusion-based models, such as DALL- in various image processing tasks like inpainting,
E 2 (Ramesh et al., 2022), Imagen (Saharia et al., outpainting, and super-resolution (Batzolis et al.,
2022b), and Stable Diffusion6 , have since become 2021; Saharia et al., 2022a).
cornerstones in this domain. These models are typ- Text-to-image diffusion models have become
ically divided into pixel-level (Ramesh et al., 2022; a popular research direction due to the high per-
Saharia et al., 2022b) and latent-level (Rombach formance of diffusion models and the ability to
et al., 2022) approaches. simply integrate text conditions with the classifier-
This surge of interest has led to the design of free guidance algorithm (Ho and Salimans, 2022).
innovative approaches and architectures, paving Early models like GLIDE (Nichol et al., 2022), Im-
the way for numerous applications based on open- agen (Saharia et al., 2022b), DALL-E 2 (Ramesh
source generative models, such as DreamBooth et al., 2022) and eDiff-I (Balaji et al., 2022) gen-
(Ruiz et al., 2023) and DreamPose (Karras et al., erate low-resolution image in pixel space and then
2023). These applications exploit image generation upsample it with another super-resolution diffusion
techniques to offer remarkable features, further fu- models. They are also using different text encoders,
eling the popularity and the rapid development of large language model T5 (Raffel et al., 2020) in
diffusion-based image generation approaches. Imagen, CLIP (Radford et al., 2021) in GLIDE and
This enabled a wide array of applications like DALL-E 2.
3D object synthesis (Poole et al., 2023; Tang et al.,
2023; Lin et al., 2022; Chen et al., 2023), video 3 Demo System
generation (Ho et al., 2022b; Luo et al., 2023; Ho
We implemented a set of user-oriented solutions
et al., 2022a; Singer et al., 2023; Blattmann et al.,
where Kandinsky model is embedded as a core
2023; Esser et al., 2023), controllable image editing
imaging service. It has been done due to a variety
(Hertz et al., 2023; Parmar et al., 2023; Liew et al.,
of inference regimes, some of which need specific
2022; Mou et al., 2023; Lu et al., 2023), and more,
front-end features to perform properly. Overall, we
6
https://github.com/CompVis/stable-diffusion implemented two main inference resources: Tele-
Figure 2: Examples of inference regimes using Kandinsky model.

gram bot and FusionBrain website. • image variations – user inputs an image, and
FusionBrain represents a web-based image edi- the system generates several new images sim-
tor with such features as loading and saving images, ilar to the input one.
sliding location window, erasing tools, zooming
in/out, various styles selector, etc. (cf. Figure 3). 4 Kandinsky Architecture
In terms of image generation, the three following
In our work, we opted to deliver state-of-the-art
options are implemented on this side:
text-to-image synthesis. In the initial stages of our
• text-to-image generation – user inputs a text research, we experimented with multilingual text
prompt in Russian or English, then selects an encoders, such as mT5 (Xue et al., 2021), XLMR
aspect-ratio from the list (9:16, 2:3, 1:1, 16:9, (Conneau et al., 2020), XLMR-CLIP7 , to facili-
3:2), and the system generates an image; tate robust multilingual text-to-image generation.
• inpainting – using the specific erasing tool, However, we discovered that using the CLIP-image
user can remove any arbitrary input image embeddings instead of standalone text encoders re-
part and fill it, guided by a text prompt or sulted in improved image quality. As a result, we
without any guidance; adopted an image prior approach, utilizing diffu-
sion and linear mappings between text and image
• outpainting – input image can be extended embedding spaces of CLIP, while keeping addi-
with a sliding window that can be used as tional conditioning with XLMR text embeddings.
a mask for the following generation (if the That is why Kandinsky uses two text encoders:
window intersects any imaged area, then the CLIP-text with image prior mapping and XLMR.
empty window part is generated with or with- We have set these encoders to be frozen during the
out text prompt guidance). training phase.
Inpainting and outpainting options are the main The significant factor that influenced our design
image editing features of the model. Architectural choice was the efficiency of training latent diffu-
details about these generation types can also be sion models, as compared to pixel-level diffusion
found in Figure 1. models (Rombach et al., 2022). This led us to fo-
Telegram bot contains the following image gen- cus our efforts on the latent diffusion architecture.
eration features (cf. Figure 2): Our model essentially comprises three stages: text
• text-to-image generation; encoding, embedding mapping (image prior), and
latent diffusion.
• image and text fusion – user inputs an im- The construction of our model involves three
age and a text prompt to create a new image primary steps: text encoding, embedding mapping
guided by this prompt; (image prior), and latent diffusion. At the embed-
• image fusion – user inputs an image as the ding mapping step, which we also refer to as the
main one and another ’guiding’ image, and 7
https://github.com/FreddeFrallan/
the system generates their fusion; Multilingual-CLIP
Figure 3: Kandinsky web interface for “a corgi gliding on the wave”: generation (left) and in/outpainting (right).

Table 1: Proposed architecture comparison by FID on We conducted a series of experiments and ablation
COCO-30K validation set on 256×256 resolution. * For studies on the specific architecture design of the
the IF model we reported reproduced results on COCO-
image prior model (Table 3, Figure 6). The model
30K, but authors provide FID of 7.19.
with the best human evaluation score is based on
Model FID-30K a 1D-diffusion and standard transformer-encoder
Open Sourced Techologies with the following parameters: num_layers=20,
Kandinsky (Ours) 8.03 num_heads=32, and hidden_size=2048.
Stable Diffusion 2.1 (2022) 8 8.59 The latent diffusion part employs a UNet model
GLIDE 8 (Nichol et al., 2022) 12.24
along with a custom pre-trained autoencoder. Our
IF* (2023) 12 15.10
diffusion model uses a combination of multiple
Kandinsky 1.0 (2022) 9 15.40
ruDALL-E Malevich (2022) 9 20.00 condition signals: CLIP-image embeddings, CLIP-
GLIGEN 10 (Li et al., 2023) 21.04 text embeddings, and XLMR-CLIP text embed-
Proprietary Technologies dings. CLIP-image and XLMR-CLIP embeddings
eDiff-I (Balaji et al., 2022) 6.95 are merged and utilized as an input to the latent
Imagen (Saharia et al., 2022b) 7.27 diffusion process. Also, we conditioned the dif-
GigaGAN (Kang et al., 2023) 9.09 fusion process on these embeddings by adding all
DALL-E 2 (Ramesh et al., 2022) 10.39
of them to the time-embedding. Notably, we did
DALL-E (Ramesh et al., 2021) 17.89
not skip the quantization step of the autoencoder
during diffusion inference as it leads to an increase
image prior, we use the transformer-encoder model. in the diversity and the quality of generated images
This model was trained from scratch with a diffu- (cf. Figure 4). In total, our model comprises 3.3 B
sion process on text and image embeddings pro- parameters (Table 2).
vided by the CLIP-ViT-L14 model. A noteworthy
feature in our training process is the use of element- Table 2: Kandinsky model parameters.
wise normalization of visual embeddings. This Architecture part Params Freeze
normalization is based on full-dataset statistics and Diffusion Mapping 1B False
leads to faster convergence of the diffusion process. CLIP image encoder (ViT-L14) 427M True
We implemented inverse normalization to revert to CLIP text encoder 340M True
the original CLIP-image embedding space in the Text encoder (XLM-R-L) 560M True
inference stage. Latent Diffusion UNet 1.22B False
MoVQ image autoencoder 67M True
The image prior model is trained on text and
image embeddings, provided by the CLIP models.
8
We observed that the image decoding was our
https://github.com/Stability-AI/
stablediffusion
main bottleneck in terms of generated image qual-
9
https://github.com/ai-forever/ru-dalle ity; hence, we developed a Sber-MoVQGAN, our
10
https://github.com/gligen/GLIGEN custom implementation of MoVQGAN (Zheng
Table 3: Ablation study: FID on COCO-30K validation
set on 256 × 256 resolution.

Setup FID-30K CLIP

Diffusion prior with quantization 9.86 0.287
Diffusion prior w/o quantization 9.87 0.286
Linear prior 8.03 0.261
Residual prior 8.61 0.249
No prior 25.92 0.256

et al., 2022) with minor modifications. We trained

this autoencoder on the LAION HighRes dataset
(Schuhmann et al., 2022), obtaining the SotA re-
sults in image reconstruction. We released the
Figure 4: CLIP-FID curves for different setups.
weights and code for these models under an open
source licence11 . The comparison of our autoen-
coder with competitors can be found in Table 4.

5 Experiments
We sought to evaluate and refine the performance
of our proposed latent diffusion architecture in our
experimental analysis. To this end, we employed
automatic metrics, specifically FID-CLIP curves
on the COCO-30K dataset, to obtain the optimal
guidance-scale value and compare Kandinsky with
competitors (cf. Figure 4). Furthermore, we con- Figure 5: Image generation results with prompt "astro-
ducted investigations with various image prior se- naut riding a horse" for original image prior and linear
tups, exploring the impact of different configura- prior trained on 500 pairs of images with cats.
tions on the performance. These setups included:
no prior, utilizing text embeddings directly; lin- back and validate the quality of the generated im-
ear prior, implementing one linear layer; ResNet ages from the perspective of human perception
prior, consisting of 18 residual MLP blocks; and based on the DrawBench dataset (Saharia et al.,
transformer diffusion prior. 2022b).
An essential aspect of our experiments was the
The combination of automatic metrics and hu-
exploration of the effect of latent quantization
man evaluation provides a comprehensive assess-
within the MoVQ autoencoder. We examined the
ment of Kandinsky performance, enabling us to
outputs with latent quantization, both enabled and
make informed decisions about the effectiveness
disabled, to better comprehend its influence on im-
and usability of our proposed image prior to design.
age generation quality.
To ensure a comprehensive evaluation, we also
included an assessment of the IF model 12 , which is 6 Results
the closest open-source competitor to our proposed
model. For this purpose, we computed FID scores Our experiments and evaluations have showcased
for the IF model 13 (Table 1). the capabilities of Kandinsky architecture in text-to-
However, we acknowledged the limitations of au- image synthesis. Kandinsky achieved the FID score
tomatic metrics that become obvious when it comes of 8.03 on the COCO-30K validation set at a resolu-
to capturing user experience nuances. Hence, in tion of 256×256, which puts it in close competition
addition to the FID-CLIP curves, we conducted a with the state-of-the-art models, and among the
blind human evaluation to obtain insightful feed- top performers within open-source systems. Our
11
https://github.com/ai-forever/MoVQGAN
methodical ablation studies further dissected the
12
https://github.com/deep-floyd/IF performance of different configurations: quantiza-
13
https://github.com/mseitzer/pytorch-fid tion of latent codes in MoVQ slightly improves
Figure 6: Human evaluation: competitors vs Kandinsky with diffusion prior on Drawbench. The total count of votes
is 5000.

Table 4: Sber-MoVQGAN comparison with competitors on ImageNet dataset.

Model Latent size Num Z Train steps FID ↓ SSIM ↑ PSNR ↑ L1 ↓

ViT-VQGAN* 32x32 8192 500,000 1.28 - - -
RQ-VAE* 8x8x16 16384 10 epochs 1.83 - - -
Mo-VQGAN* 16x16x4 1024 40 epochs 1.12 0.673 22.42 -
VQ CompVis 32x32 16384 971,043 1.34 0.650 23.85 0.0533
KL CompVis 32x32 - 246,803 0.968 0.692 25.11 0.0474
Sber-VQGAN 32x32 8192 1 epoch 1.44 0.682 24.31 0.0503
Sber-MoVQGAN 67M 32x32 1024 5,000,000 1.34 0.704 25.68 0.0451
Sber-MoVQGAN 67M 32x32 16384 2,000,000 0.965 0.725 26.45 0.0415
Sber-MoVQGAN 102M 32x32 16384 2,360,000 0.776 0.737 26.89 0.0398
Sber-MoVQGAN 270M 32x32 16384 1,330,000 0.686 0.741 27.04 0.0393

the quality of images (FID 9.86 vs 9.87). The best license enabling various, including commercial, ap-
CLIP score and human-eval score are obtained by plications of the developed technology.
diffusion prior. In future research, our goal is to investigate the
The best FID score is achieved using Linear potential of the latest image encoders. We plan to
Prior. This configuration stands out with the best explore the development of more efficient UNet
FID score of 8.03. It is an intriguing outcome: the architectures for text-to-image tasks and focus on
simplest linear mapping showcased the best FID, improving the understanding of textual prompts.
suggesting that there might exist a linear relation- Additionally, we aim to experiment with generat-
ship between visual and textual embedding vector ing images at higher resolutions and to investigate
spaces. To further scrutinize this hypothesis, we new features extending the model: local image
trained a linear mapping on a subset of 500 cat editing by a text prompt, attention reweighting,
images and termed it the "cat prior". Astonish- physics-based generation control, etc. The robust-
ingly, this mapping displayed high proficiency (cf. ness against generating abusive content remains
Figure 5). a crucial concern, warranting the exploration of
real-time moderation layers or robust classifiers to
7 Conclusion
mitigate undesirable, e.g. toxic or abusive, outputs.
We presented Kandinsky, a system for various im-
age generation and processing tasks based on a
novel latent diffusion model. Our model yielded 8 Limitations
the SotA results among open-sourced systems. Ad-
ditionally, we provided an extensive ablation study The current system produces images that appear
of an image prior to design choices. Our system is natural, however, additional research can be con-
equipped with free-to-use interfaces in the form of ducted to (1) enhance the semantic coherence be-
Web application and Telegram messenger bot. The tween the input text and the generated image, and
pre-trained models are available on Hugging Face, (2) to improve the absolute values of FID and im-
and the source code is released under a permissive age quality based on human evaluations.
9 Ethical Considerations line, July 5-10, 2020, pages 8440–8451. Association
for Computational Linguistics.
We performed multiple efforts to ensure that the
generated images do not contain harmful, offen- Prafulla Dhariwal and Alexander Quinn Nichol. 2021.
Diffusion models beat gans on image synthesis. In
sive, or abusive content by (1) cleansing the train- Advances in Neural Information Processing Systems
ing dataset from samples that were marked to be 34: Annual Conference on Neural Information Pro-
harmful/offensive/abusive, and (2) detecting abu- cessing Systems 2021, NeurIPS 2021, December 6-
sive textual prompts. 14, 2021, virtual, pages 8780–8794.
While obvious queries, according to our tests, al- Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
most never generate abusive content, technically it Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou
is not guaranteed that certain carefully engineered Shao, Hongxia Yang, and Jie Tang. 2021. Cogview:
Mastering text-to-image generation via transformers.
prompts may not yield undesirable content. We,
In Advances in Neural Information Processing Sys-
therefore, recommend using an additional layer of tems 34: Annual Conference on Neural Information
classifiers, depending on the application, which Processing Systems 2021, NeurIPS 2021, December
would filter out the undesired content and/or use 6-14, 2021, virtual, pages 19822–19835.
image/representation transformation methods tai- Patrick Esser, Johnathan Chiu, Parmida Atighehchian,
lored to a given application. Jonathan Granskog, and Anastasis Germanidis. 2023.
Structure and content-guided video synthesis with
Acknowledgements diffusion models. CoRR, abs/2302.03011.

As usual, we would like to thank the anonymous Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
reviewers for their useful comments. We would Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C.
Courville, and Yoshua Bengio. 2014. Generative ad-
also like to thank Sergey Markov and his team for versarial nets. In Advances in Neural Information
helpful feedback and discussions, for collabora- Processing Systems 27: Annual Conference on Neu-
tion in multimodal dataset collecting, labelling and ral Information Processing Systems 2014, December
processing. 8-13 2014, Montreal, Quebec, Canada, pages 2672–
2680.
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aber-
References man, Yael Pritch, and Daniel Cohen-Or. 2023.
Prompt-to-prompt image editing with cross-attention
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vah- control. In The Eleventh International Conference
dat, Jiaming Song, Karsten Kreis, Miika Aittala, on Learning Representations, ICLR 2023, Kigali,
Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Rwanda, May 1-5, 2023. OpenReview.net.
Karras, and Ming-Yu Liu. 2022. ediff-i: Text-to-
image diffusion models with an ensemble of expert Jonathan Ho, William Chan, Chitwan Saharia, Jay
denoisers. Whang, Ruiqi Gao, Alexey A. Gritsenko, Diederik P.
Kingma, Ben Poole, Mohammad Norouzi, David J.
Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schön- Fleet, and Tim Salimans. 2022a. Imagen video: High
lieb, and Christian Etmann. 2021. Conditional image definition video generation with diffusion models.
generation with score-based diffusion models. CoRR, CoRR, abs/2210.02303.
abs/2111.13606.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De-
Andreas Blattmann, Robin Rombach, Huan Ling, Tim noising diffusion probabilistic models. In Advances
Dockhorn, Seung Wook Kim, Sanja Fidler, and in Neural Information Processing Systems 33: An-
Karsten Kreis. 2023. Align your latents: High- nual Conference on Neural Information Processing
resolution video synthesis with latent diffusion mod- Systems 2020, NeurIPS 2020, December 6-12, 2020,
els. CoRR, abs/2304.08818. virtual.
Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Jonathan Ho and Tim Salimans. 2022. Classifier-free
2023. Fantasia3d: Disentangling geometry and ap- diffusion guidance. volume abs/2207.12598.
pearance for high-quality text-to-3d content creation.
CoRR, abs/2303.13873. Jonathan Ho, Tim Salimans, Alexey A. Gritsenko,
William Chan, Mohammad Norouzi, and David J.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Fleet. 2022b. Video diffusion models. In NeurIPS.
Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik
moyer, and Veselin Stoyanov. 2020. Unsupervised Park, Eli Shechtman, Sylvain Paris, and Taesung
cross-lingual representation learning at scale. In Pro- Park. 2023. Scaling up gans for text-to-image synthe-
ceedings of the 58th Annual Meeting of the Associa- sis. In IEEE/CVF Conference on Computer Vision
tion for Computational Linguistics, ACL 2020, On- and Pattern Recognition, CVPR 2023, Vancouver,
BC, Canada, June 17-24, 2023, pages 10124–10134. Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben
IEEE. Mildenhall. 2023. Dreamfusion: Text-to-3d using
2d diffusion. In The Eleventh International Con-
Johanna Karras, Aleksander Holynski, Ting-Chun ference on Learning Representations, ICLR 2023,
Wang, and Ira Kemelmacher-Shlizerman. 2023. Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
Dreampose: Fashion image-to-video synthesis via
stable diffusion. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou try, Amanda Askell, Pamela Mishkin, Jack Clark,
Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
Yong Jae Lee. 2023. GLIGEN: open-set grounded ing transferable visual models from natural language
text-to-image generation. CoRR, abs/2301.07093. supervision. In Proceedings of the 38th International
Conference on Machine Learning, ICML 2021, 18-24
Jun Hao Liew, Hanshu Yan, Daquan Zhou, and Jiashi July 2021, Virtual Event, volume 139 of Proceedings
Feng. 2022. Magicmix: Semantic mixing with diffu- of Machine Learning Research, pages 8748–8763.
sion models. CoRR, abs/2210.16056. PMLR.
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. 2022. Wei Li, and Peter J. Liu. 2020. Exploring the limits
Magic3d: High-resolution text-to-3d content creation. of transfer learning with a unified text-to-text trans-
CoRR, abs/2211.10440. former. J. Mach. Learn. Res., 21:140:1–140:67.
Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey
2023. TF-ICON: diffusion-based training-free cross- Chu, and Mark Chen. 2022. Hierarchical text-
domain image composition. CoRR, abs/2307.12493. conditional image generation with CLIP latents.
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott
Huang, Liang Wang, Yujun Shen, Deli Zhao, Jin- Gray, Chelsea Voss, Alec Radford, Mark Chen, and
gren Zhou, and Tieniu Tan. 2023. Videofusion: De- Ilya Sutskever. 2021. Zero-shot text-to-image gen-
composed diffusion models for high-quality video eration. In Proceedings of the 38th International
generation. CoRR, abs/2303.08320. Conference on Machine Learning, ICML 2021, 18-24
July 2021, Virtual Event, volume 139 of Proceedings
Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, of Machine Learning Research, pages 8821–8831.
and Jian Zhang. 2023. Dragondiffusion: Enabling PMLR.
drag-style manipulation on diffusion models. CoRR,
abs/2307.02421. Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. 2022. High-
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. resolution image synthesis with latent diffusion mod-
Improved denoising diffusion probabilistic models. els. In IEEE/CVF Conference on Computer Vision
In Proceedings of the 38th International Conference and Pattern Recognition, CVPR 2022, New Orleans,
on Machine Learning, ICML 2021, 18-24 July 2021, LA, USA, June 18-24, 2022, pages 10674–10685.
Virtual Event, volume 139 of Proceedings of Machine IEEE.
Learning Research, pages 8162–8171. PMLR.
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Pritch, Michael Rubinstein, and Kfir Aberman. 2023.
Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mc- Dreambooth: Fine tuning text-to-image diffusion
Grew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: models for subject-driven generation. In IEEE/CVF
towards photorealistic image generation and editing Conference on Computer Vision and Pattern Recog-
with text-guided diffusion models. In International nition, CVPR 2023, Vancouver, BC, Canada, June
Conference on Machine Learning, ICML 2022, 17-23 17-24, 2023, pages 22500–22510. IEEE.
July 2022, Baltimore, Maryland, USA, volume 162 of
Proceedings of Machine Learning Research, pages Chitwan Saharia, William Chan, Huiwen Chang,
16784–16804. PMLR. Chris A. Lee, Jonathan Ho, Tim Salimans, David J.
Fleet, and Mohammad Norouzi. 2022a. Palette:
Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Image-to-image diffusion models. In SIGGRAPH
Yijun Li, Jingwan Lu, and Jun-Yan Zhu. 2023. ’22: Special Interest Group on Computer Graphics
Zero-shot image-to-image translation. In ACM SIG- and Interactive Techniques Conference, Vancouver,
GRAPH 2023 Conference Proceedings, SIGGRAPH BC, Canada, August 7 - 11, 2022, pages 15:1–15:10.
2023, Los Angeles, CA, USA, August 6-10, 2023, ACM.
pages 11:1–11:11. ACM.
Chitwan Saharia, William Chan, Saurabh Saxena, Lala
William Peebles and Saining Xie. 2022. Scal- Li, Jay Whang, Emily L. Denton, Seyed Kam-
able diffusion models with transformers. CoRR, yar Seyed Ghasemipour, Raphael Gontijo Lopes,
abs/2212.09748. Burcu Karagol Ayan, Tim Salimans, Jonathan Ho,
David J. Fleet, and Mohammad Norouzi. 2022b. Pho-
torealistic text-to-image diffusion models with deep
language understanding.
Christoph Schuhmann, Romain Beaumont, Richard
Vencu, Cade Gordon, Ross Wightman, Mehdi
Cherti, Theo Coombes, Aarush Katta, Clayton
Mullis, Mitchell Wortsman, Patrick Schramowski,
Srivatsa Kundurthy, Katherine Crowson, Ludwig
Schmidt, Robert Kaczmarczyk, and Jenia Jitsev.
2022. LAION-5B: an open large-scale dataset for
training next generation image-text models. In
NeurIPS.
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie
An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron
Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and
Yaniv Taigman. 2023. Make-a-video: Text-to-video
generation without text-video data. In The Eleventh
International Conference on Learning Representa-
tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
OpenReview.net.
Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang,
Ran Yi, Lizhuang Ma, and Dong Chen. 2023. Make-
it-3d: High-fidelity 3d creation from A single image
with diffusion prior. CoRR, abs/2303.14184.
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale,
Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and
Colin Raffel. 2021. mt5: A massively multilingual
pre-trained text-to-text transformer. In Proceedings
of the 2021 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2021,
Online, June 6-11, 2021, pages 483–498. Association
for Computational Linguistics.
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Lu-
ong, Gunjan Baid, Zirui Wang, Vijay Vasudevan,
Alexander Ku, Yinfei Yang, Burcu Karagol Ayan,
Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li,
Han Zhang, Jason Baldridge, and Yonghui Wu. 2022.
Scaling autoregressive models for content-rich text-
to-image generation. Trans. Mach. Learn. Res.,
2022.

Chuanxia Zheng, Tung-Long Vuong, Jianfei Cai, and

Dinh Phung. 2022. Movq: Modulating quantized vec-
tors for high-fidelity image generation. In NeurIPS.
A Additional generation examples

Tissin Positioner TS900-manual E
No ratings yet
Tissin Positioner TS900-manual E
52 pages
All Test Cases PDF
0% (1)
All Test Cases PDF
7 pages
Kandinsky Poster v2
No ratings yet
Kandinsky Poster v2
1 page
Dehouce
No ratings yet
Dehouce
12 pages
AI Art in Architecture
No ratings yet
AI Art in Architecture
11 pages
Paper Math
No ratings yet
Paper Math
13 pages
Palette Diffusion
No ratings yet
Palette Diffusion
26 pages
Image-to-Image Difussion Models
No ratings yet
Image-to-Image Difussion Models
29 pages
Meta
No ratings yet
Meta
17 pages
Text To Image Survey
No ratings yet
Text To Image Survey
40 pages
Base Paper Batch 9 Final Updated 3
No ratings yet
Base Paper Batch 9 Final Updated 3
10 pages
Background and Literature Review
No ratings yet
Background and Literature Review
17 pages
Background and Literature Review
No ratings yet
Background and Literature Review
7 pages
IEEE Editable
No ratings yet
IEEE Editable
8 pages
What's in A Text-To-Image Prompt The Potential of Stable Diffusion in Visual Arts Education
No ratings yet
What's in A Text-To-Image Prompt The Potential of Stable Diffusion in Visual Arts Education
12 pages
Image-Dev An Advance Text To Image AI Model
No ratings yet
Image-Dev An Advance Text To Image AI Model
6 pages
Hierarchical Text-Conditional Image Generation With CLIP Latents
No ratings yet
Hierarchical Text-Conditional Image Generation With CLIP Latents
26 pages
2023 - PixArt-$-alpha$ - Chen Et Al
No ratings yet
2023 - PixArt-$-alpha$ - Chen Et Al
31 pages
Fouri Scale
No ratings yet
Fouri Scale
26 pages
Wk4 - AI Generated Images
No ratings yet
Wk4 - AI Generated Images
30 pages
S: Designing Scale-Wise Transformers For Text-to-Image Synthesis
No ratings yet
S: Designing Scale-Wise Transformers For Text-to-Image Synthesis
20 pages
Dall e 2
No ratings yet
Dall e 2
24 pages
Cui IDAdapter Learning Mixed Features For Tuning-Free Personalization of Text-to-Image Models CVPRW 2024 Paper
No ratings yet
Cui IDAdapter Learning Mixed Features For Tuning-Free Personalization of Text-to-Image Models CVPRW 2024 Paper
10 pages
Cartoonifying An Image Using ML Algorithms
No ratings yet
Cartoonifying An Image Using ML Algorithms
25 pages
Framepainter: Endowing Interactive Image Editing With Video Diffusion Priors
No ratings yet
Framepainter: Endowing Interactive Image Editing With Video Diffusion Priors
16 pages
Survey Paper On Text-to-Image Generation
No ratings yet
Survey Paper On Text-to-Image Generation
8 pages
Dreambooth: Fine Tuning Text-To-Image Diffusion Models For Subject-Driven Generation
No ratings yet
Dreambooth: Fine Tuning Text-To-Image Diffusion Models For Subject-Driven Generation
21 pages
(Nsdi24) Nirvana
No ratings yet
(Nsdi24) Nirvana
18 pages
A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions
No ratings yet
A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions
11 pages
Stable Diffusion With Generative Ai
No ratings yet
Stable Diffusion With Generative Ai
3 pages
Thesis 11 51
No ratings yet
Thesis 11 51
41 pages
Texture: Text-Guided Texturing of 3D Shapes
No ratings yet
Texture: Text-Guided Texturing of 3D Shapes
13 pages
BTP - 6 Sem - Part1
No ratings yet
BTP - 6 Sem - Part1
40 pages
Updated Poster
No ratings yet
Updated Poster
1 page
Stylegan-T: Unlocking The Power of Gans For Fast Large-Scale Text-To-Image Synthesis
No ratings yet
Stylegan-T: Unlocking The Power of Gans For Fast Large-Scale Text-To-Image Synthesis
13 pages
Control Net
No ratings yet
Control Net
12 pages
Adding Conditional Control To Text-to-Image Diffusion Models
No ratings yet
Adding Conditional Control To Text-to-Image Diffusion Models
12 pages
Adding Conditional Control To Text-to-Image Diffusion Models
No ratings yet
Adding Conditional Control To Text-to-Image Diffusion Models
8 pages
Zero Shot Text To Image Generation (DALL E)
No ratings yet
Zero Shot Text To Image Generation (DALL E)
20 pages
IEEE Xplore Reference Download 2024.9.24.8.30.58
No ratings yet
IEEE Xplore Reference Download 2024.9.24.8.30.58
2 pages
Dynamic Image Generation From Text Prompt Research Paper-JOT-5135
100% (1)
Dynamic Image Generation From Text Prompt Research Paper-JOT-5135
7 pages
Diff IT
No ratings yet
Diff IT
22 pages
Illustrating Classic Brazilian Books Using A Text-To-Image Diffusion Model
No ratings yet
Illustrating Classic Brazilian Books Using A Text-To-Image Diffusion Model
7 pages
Documents 5
No ratings yet
Documents 5
5 pages
Slides Curso Generacion Imagenes Ai
No ratings yet
Slides Curso Generacion Imagenes Ai
180 pages
Cartooniation Using White-Box Technique in Machine Learning
100% (2)
Cartooniation Using White-Box Technique in Machine Learning
5 pages
Koley Its All About Your Sketch Democratising Sketch Control in Diffusion CVPR 2024 Paper
No ratings yet
Koley Its All About Your Sketch Democratising Sketch Control in Diffusion CVPR 2024 Paper
11 pages
Img 4
No ratings yet
Img 4
5 pages
Nataniel Ruiz Dreambooth Fine Tuning Text To Image
No ratings yet
Nataniel Ruiz Dreambooth Fine Tuning Text To Image
11 pages
RND Report
No ratings yet
RND Report
10 pages
2 PB
No ratings yet
2 PB
9 pages
1-Effective Data Augmentation With Diffusion Models
No ratings yet
1-Effective Data Augmentation With Diffusion Models
23 pages
Image Generation A Review
No ratings yet
Image Generation A Review
39 pages
DIFFBLENDER Scalable and Composable
No ratings yet
DIFFBLENDER Scalable and Composable
18 pages
Synthetic Image Verification in The Era of Generative AI: What Works and What Isn't There Yet
No ratings yet
Synthetic Image Verification in The Era of Generative AI: What Works and What Isn't There Yet
11 pages
Editar: Unified Conditional Generation With Autoregressive Models
No ratings yet
Editar: Unified Conditional Generation With Autoregressive Models
22 pages
Bio Robotics
No ratings yet
Bio Robotics
2 pages
Adding Conditional Control To Text-to-Image Diffusion Models
No ratings yet
Adding Conditional Control To Text-to-Image Diffusion Models
33 pages
GLIDE: Towards Photorealistic Image Generation and Editing With Text-Guided Diffusion Models
No ratings yet
GLIDE: Towards Photorealistic Image Generation and Editing With Text-Guided Diffusion Models
20 pages
Image Transformer: Van Den Oord & Schrauwen 2014 Bellemare Et Al. 2016
No ratings yet
Image Transformer: Van Den Oord & Schrauwen 2014 Bellemare Et Al. 2016
10 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Live Trace Visualization for System and Program Comprehension in Large Software Landscapes
From Everand
Live Trace Visualization for System and Program Comprehension in Large Software Landscapes
Florian Fittkau
No ratings yet
XpressBees ReverseReattemptDate CustomerAlternateAddress MobileUpdationAPI
No ratings yet
XpressBees ReverseReattemptDate CustomerAlternateAddress MobileUpdationAPI
5 pages
MA3151 Matrix and Calculus Unit Wise
No ratings yet
MA3151 Matrix and Calculus Unit Wise
5 pages
BSR Presentation For TOTAL
No ratings yet
BSR Presentation For TOTAL
16 pages
High Dry Dewatering of Sludge Based On Different Pretreatment Conditions
No ratings yet
High Dry Dewatering of Sludge Based On Different Pretreatment Conditions
10 pages
The Cruel Prince
No ratings yet
The Cruel Prince
4 pages
Teknik Menjawab Kimia 3 SPM
No ratings yet
Teknik Menjawab Kimia 3 SPM
31 pages
Half-Wave Rectifier Feeding A DC Motor
No ratings yet
Half-Wave Rectifier Feeding A DC Motor
4 pages
Guidance Note C - B - ENV 002, July 02
No ratings yet
Guidance Note C - B - ENV 002, July 02
12 pages
Week 1 Byteshell 1
No ratings yet
Week 1 Byteshell 1
14 pages
SIGRADE
No ratings yet
SIGRADE
2 pages
Shape Vocabulary Word Mat
No ratings yet
Shape Vocabulary Word Mat
4 pages
Chemical Engineering - Why in A Normal Distillation Column Does Temperature and Pressure Gradient Exist From Bottom To Top - Quora PDF
No ratings yet
Chemical Engineering - Why in A Normal Distillation Column Does Temperature and Pressure Gradient Exist From Bottom To Top - Quora PDF
6 pages
Math - Problem Solving One-Step Equation Word Problems Rubric
No ratings yet
Math - Problem Solving One-Step Equation Word Problems Rubric
2 pages
I-V Curves Report - Template
No ratings yet
I-V Curves Report - Template
8 pages
THS527 Datasheet
No ratings yet
THS527 Datasheet
5 pages
15.18 Auxiliary Power Units (APUs)
No ratings yet
15.18 Auxiliary Power Units (APUs)
24 pages
Code
No ratings yet
Code
13 pages
Kubernetes Container
No ratings yet
Kubernetes Container
7 pages
MCQ Hot Air Oven
No ratings yet
MCQ Hot Air Oven
15 pages
Arithmetic Questions For Ibps RRB Prelims Exam With Video Explanation
No ratings yet
Arithmetic Questions For Ibps RRB Prelims Exam With Video Explanation
18 pages
Hedging Strategies Using Futures
No ratings yet
Hedging Strategies Using Futures
37 pages
Cantiliver Retaing Wall
No ratings yet
Cantiliver Retaing Wall
14 pages
Current Unbalance Monitoring in Four-Wire System Based
No ratings yet
Current Unbalance Monitoring in Four-Wire System Based
9 pages
Hysys Course 2012
100% (2)
Hysys Course 2012
71 pages
MIKE11 UserManual
No ratings yet
MIKE11 UserManual
542 pages
JAROL Assumes The Promotion of Energy-Saving Technology As Its Own Task! 1. PREFACE NOTICE
50% (2)
JAROL Assumes The Promotion of Energy-Saving Technology As Its Own Task! 1. PREFACE NOTICE
182 pages
CBSE Class 12 Chemistry Question Paper Solution 2019
No ratings yet
CBSE Class 12 Chemistry Question Paper Solution 2019
6 pages
ND Computer Science
No ratings yet
ND Computer Science
224 pages