0% found this document useful (0 votes)
40 views8 pages

Image Harmonization With Diffusion Model

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 8

Image Harmonization with Diffusion Model

Jiajie Li1 Jian Wang2 Chen Wang3 Jinjun Xiong1


1 2
University at Buffalo Snap Inc. 3 IBM Research

Abstract tures for generating more realistic images. However, they


arXiv:2306.10441v1 [cs.CV] 17 Jun 2023

are often limited to supervised learning settings.


Image composition in image editing involves merging a In this work, we explore both classifier-guided [6] and
foreground image with a background image to create a com- classifier-free [9] conditional diffusion models for image
posite. Inconsistent lighting conditions between the fore- harmonization tasks. By conditioning on unharmonized im-
ground and background often result in unrealistic compos- ages, our image-to-image diffusion models generate high-
ites. Image harmonization addresses this challenge by ad- quality outputs with realistic and consistent colors. Our
justing illumination and color to achieve visually appealing classifier-free approach begins by training Denoising Dif-
and consistent outputs. In this paper, we present a novel fusion Probabilistic Models (DDPM) [8] in an end-to-end
approach for image harmonization by leveraging diffusion manner. We then employ Latent Diffusion Models (LDM)
models. We conduct a comparative analysis of two con- [14] that utilizes ControlNet [23] for fine-tuning the pre-
ditional diffusion models, namely Classifier-Guidance and trained Stable Diffusion [14] model. The LDM model har-
Classifier-Free. Our focus is on addressing the challenge monizes the input image on the latent space, resulting in
of adjusting illumination and color in foreground images high-fidelity outputs. To overcome the issue of detail loss in
to create visually appealing outputs that seamlessly blend the results obtained from Stable Diffusion, we combine the
with the background. Through this research, we establish classifier-guidance method to ensure the appearance con-
a solid groundwork for future investigations in the realm of sistency. We propose a method to selectively transfer color
diffusion model-based image harmonization. information from the generated images, making it adaptable
to other tasks. To enhance composite image harmoniza-
tion, we integrate background ”light” using a straightfor-
1. Introduction ward brightness prediction method. This ensures the con-
sistency of appearance in the generated images.
Image composition, a common operation in image edit- Our contributions are as follows: (1) we develop the
ing, involves merging a foreground image with a back- first image harmonization diffusion model frameworks us-
ground image to create a composite. However, inconsistent ing both DDPM and LDM; (2) we analyze and address
lighting conditions between the foreground and background the challenges of latent diffusion models in image editing
often result in unrealistic composites. tasks, proposing universal methods to maintain appearance
Image harmonization aims to adjust the appearance of consistency by leveraging the classifier-guidance method;
the foreground image to achieve compatibility with the (3) we present comprehensive experiments demonstrating
background, creating a natural and realistic composite. In- the effectiveness of our diffusion model-based approach,
harmonious elements, such as differences in color, illumina- achieving significantly superior performance compared to
tion, and texture, can violate natural laws and disrupt visual previous methods in image harmonization.
coherence. Achieving image harmonization without alter-
ing the structure or semantics of the composite image is a 2. Related Works
crucial and challenging task.
2.1. Image Harmonization
Traditional methods for image harmonization focused
on color transformation to match the foreground’s color Image harmonization aims to adjust the foreground to
statistics with the background. While efficient, these meth- make it consistent with the background to produce a real-
ods often fail to capture realism adequately. Recent deep istic composite image. Recent image harmonization meth-
learning approaches have utilized end-to-end image trans- ods can be categorized into traditional methods and deep
formation to improve harmonization quality. These meth- learning-based methods.
ods [5, 20, 21], such as encoder-decoder frameworks with Traditional methods use color transformation techniques
U-Net [16] structures, capture semantic and low-level fea- to match the color statistics between the foreground and

1
background. These methods include color distribution ing the given image information. Blended Latent Diffusion
matching [2, 12, 13], color histogram transformation [22], (BLD) [1] uses LDMs for image editing, it allows users to
multi-scale statistics analysis [20], color clustering [10]. It’s control the generation process by specifying semantic at-
easier to see that the major difference between those meth- tributes and blending them with the latent codes of exist-
ods is how the image of an image is represented (e.g. dis- ing images. Similar to [11, 14] that the diffusion process is
tribution, histogram, etc.). These methods are fast and sim- on the latent space, and image harmonization is similar to
ple but often fail to handle complex scenarios and produce image-to-image translation.
artifacts since the realism of the image is usually not well
reflected in those statistics. 3. Proposed Method
Deep learning methods are adopted to mitigate the is-
In this section, we first discuss the proposed mechanism
sue that hand-crafted features such as color statistics used
to force appearance consistency throughout the diffusion
in previous machine learning methods cannot well reflect
process of image harmonization. Then, we introduce a one-
the realism of the image. [24] first proposed to train a CNN
time color transfer method that does not change the diffu-
classifier to predict the realism of the composite image,
sion process to ensure appearance consistency after the dif-
and uses a gradient-based method to optimize the realism
fusion process.
of the composite image. Recent works started to use end-
to-end networks that output the harmonized composite im- 3.1. Appearance Consistency Discriminator
age given the unharmonized image. [21] is the first to use
The appearance consistency discriminator evaluates how
a CNN to perform the end-to-end transformation, where an
much two colored images are similar to each other in
encoder-decoder structure is adopted to capture the context
terms of appearance. By converting the colored image to
and semantic information. [5] enhanced the neural network
a grayscale image, we get to know the brightness informa-
with a spatial-separated attention module that learns the fea-
tion from the value of each pixel in the grayscale image,
tures of specific regions in spatial space. [7] adopted Vision
which can be seen as a representation of the appearance.
Transformers which leverages its powerful ability to model
For simplicity, we derive the grayscale image by averaging
long-range context dependencies. Different from all exist-
the RGB channels from the colored RGB image
ing methods, we devote solving image harmonization with
the diffusion model. Y = C(X) = (XR + XG + XB )/3,
2.2. Diffusion Models where Y is the illuminance of the grayscale image, and C
is the function that converts images to grayscale.
Diffusion models are a family of generative models that
Our non parameter appearance consistency discriminator
can produce realistic images from random noise, they have
can be defined as
been shown to achieve state-of-the-art performance in im-
age synthesis tasks [6]. Ho et al. [8] proposed denoising D(X1 , X2 ) = (C(X1 ) − C(X2 ))2 .
diffusion probabilistic models (DDPMs) based on a Marko-
vian diffusion process that gradually adds noise to an im- The discriminator can capture only the difference in appear-
age until it becomes pure noise, and then a deep neural ance between two images, which can guide the diffusion
network is trained as the noise estimator to reverses the process to ensure appearance consistency and leave space
process to generate a new image from the noise. Song for the noise estimator to adjust the color.
et al. [19] propose to use a non-Markovian diffusion pro-
3.2. Classifier-guided LDM
cess to train and sample from DDIMs, which is faster and
more flexible than the Markovian diffusion process used in We derive the classifier-guided LDM from the classifier-
DDPMs. Due to the powerful ability to generate realistic guided DDIM algorithm [6]. Different from the original
images, diffusion models have been applied by researchers algorithm, the gradients from the appearance consistency
to a wide range of image synthesis tasks. [14] applied la- discriminator are passed through the discriminator.
tent diffusion models (LDMs) to the text-to-image gener- One of the challenges of using the classifier guidance in
ation task, which are diffusion models applied in the la- LDM is that the classifier should also work for noisy in-
tent space of pre-trained auto-encoders. Palette [17] builds put. Since our method in Section 3.1 is nonparametric that
multi-task image-to-image diffusion models for end-to-end we cannot further train it on the noisy data. Thus, we pro-
colorization, inpainting, uncropping, and JPEG restoration. pose a method to enhance the capability of the discriminator
SR3 [18] presents a method for image super-resolution via on noisy inputs. Instead of calculating the guidance gra-
repeated refinement. RePaint [11] uses a pre-trained uncon- dients between the reconstructed noisy image xt and one
ditional DDPM as the generative prior and conditions the noisy guidance image yt , we add noise to the guidance im-
generation process by sampling the unmasked regions us- age multiple times to get multiple noisy guidance images

2
yt0 , yt1 , . . . , ytn , and then calculate the gradients using all the composite images specifically designed for Image Harmo-
images nization research. It comprises four sub-datasets: HCOCO,
n
HAdobe5k, HFlickr, and Hday2night. Each sub-dataset
X √ presents a distinct set of challenges and characteristics. The
G(xt , y) = D(xt , yti ), yti ← N ( ᾱt x0 , (1 − ᾱt )I).
i=1
training set comprises a total of 65,742 samples, while the
test set contains 7,404 samples. The synthesized compos-
This avoids our reconstructed image guided by the random ite images are generated using color transfer methods to
noise, instead focusing only on the useful information in transfer color information from reference images to real im-
the noisy guidance image, which remains the same in all ages. Four representative methods, selected from different
possible noisy guidance images at the same timestep. categories based on parametric/non-parametric and corre-
lated/decorrelated color space, were used.
Algorithm 1 Classifier Guided LDM Sampling given a con-
ditional diffusion model ϵ(ht , c), classifier pϕ (y|xt ), en-
Models For DDPM, we use the same U-Net model in [6]
coder E(x), decoder D(h), and gradient scale s.
as the noise predictor. For LDM, we follow the network
Input: condition image c, gradient scale s architecture in ControlNet [15], we added four stable dif-
xT ← sample from N (0, I) fusion encoder blocks and a middle block to the original
for all t from T to 0 do network with skip connections, in which each block has the
ht ← E(xt ) same shape as the original model. During the training, the
for all i from√1 to n do weights of all the layers from the original stable diffusion
yti ← N ( ᾱt x0 , (1 − ᾱt )I) model are frozen, and we only train the additional encoder
end for √ Pn and middle blocks. We use Adam optimizer with a learning
ϵ̂ ← ϵθ (ht ) − s 1 −√ᾱt ∇xt i=1 D(D(ht ), yti ) rate 1e-5, and use a batch size of 4. We use Mean-Squared
√ √
ht−1 ← ᾱt−1 ( ht −√ᾱ1− t
ᾱt ϵ̂
) + 1 − ᾱt−1 ϵ̂ Errors (MSE) and PSNR scores on RGB channels as the
xt ← D(ht ) evaluation metric. We report the average of MSE and PSNR
end for over the test set. We resize the input images as 256 × 256
Output: x0 during both training and testing. MSE and PSNR are also
calculated based on 256 × 256 images.

3.3. Color Transfer 5. Results


To only transfer the color information from one image to 5.1. Comparing with existing methods
another, we need to separate the information contained in
one image into appearance information and color informa- Dataset Metric Composite DIH S2 AM DoveNet Ours
tion. HCOCO PSNR 33.70 33.59 35.09 35.83 34.33
Our method first converts the generated image from MSE 70.39 56.17 35.65 34.26 59.55
HAdobe5k PSNR 28.31 32.36 34.23 35.13 33.18
RGB space to HSV/HSL space. Since in the HSV/HSL MSE 345.54 94.89 53.93 56.86 161.36
space, there is a channel that represents the lightness in- HFlickr PSNR 28.43 29.08 30.53 30.75 29.21
formation, which keeps the appearance of the image, we MSE 264.35 168.35 123.36 125.85 224.05
Hday2night PSNR 34.36 33.59 34.48 34.87 34.08
replace that channel with the same channel in the unharmo-
MSE 109.65 86.25 54.39 57.17 122.41
nized composite image. Then we perform the linear trans- All PSNR 31.78 32.73 34.32 35.04 32.70
formation on the replaced channel based on the lightness MSE 172.47 80.55 51.13 51.51 141.84
channel of the background image to ensure the background
and foreground have consistent brightness. Table 1. Quantitative comparison on iHarmony4.

4. Experiments In Table 1, we provide the quantitative results of the


MSE and PSNR scores of the harmonized images generated
In this section, we demonstrate the harmonization capac- using the fine-tuned stable diffusion model guided by the
ity of our method, we evaluate our method with several ex- appearance consistency discriminator in Section 3.2, and
perimental settings following the most representative latent post-process with color transferring method in Section 3.3.
diffusion model, stable diffusion [15], and fine-tuned on the We compare the results with existing SOTA methods such
ControlNet [23] framework. as S2 AM [5] and DoveNet [4]. Visual results are presented
in Figure 1.
Datasets We evaluate our method on the iHarmony4 Unlike traditional deep learning approaches, the opti-
dataset, it is a comprehensive collection of synthesized mization objective of the diffusion model is not to directly

3
Mask Ground Truth Composite Image Ours

Figure 1. Image harmonization results on HCOCO dataset. From left to right, it is the mask, ground truth image, composite image and
harmonized image (our method).

4
Mask Composite Image Ours CDTNet

Figure 2. Image harmonization results on real composite images. From left to right, it is the mask, composite image and our result and
CDTNet’s result [3]. Our results outperform the SOTA CDTNet in matching lightness, hat rendering, and object integration. We excel in
accurately matching the lightness of persons and animals in the first four rows. Additionally, the last two images demonstrate improved
object integration, with better alignment of the bottle and women to the background.

minimize the difference between the generated image and tribution of noise. As a result, the diffusion model may
the target image. Instead, it focuses on predicting the dis- not outperform traditional metrics in terms of conventional

5
Mask Composite Image Ground Truth Ours

Figure 3. Effectiveness of color transfer. From left to right, it is the mask, ground truth image, composite image and harmonized image.
The top images are the results with color transfer, while the bottom shows the results without color transfer. In the images above, the
foreground image is the glove, without the color transfer, we found the characters on the glove is hard to read. By applying the color
transfer method, the characters appear to be clear again.

evaluation measures. However, it still demonstrates the stochasticity. This enables us to generate multiple different
ability to generate high-quality harmonization results. results for the same input, providing users with a range of
Furthermore, the iHarmony4 dataset presents a unique options to choose from. In Section 5.1, we showcase several
characteristic where composite images are generated by al- practical examples that utilize multiple output results from
tering the colors of a portion of real images to create fore- our model. By leveraging the stochastic nature of the Diffu-
ground images. While this provides a favorable learning sion Model, we can offer users increased flexibility and con-
target for the Diffusion Model, it also introduces a discrep- trol over the harmonization process. This capability allows
ancy between the dataset’s input conditions and real-world for personalized and subjective adjustments, empowering
usage scenarios, where foreground images typically origi- users to select the output that best aligns with their prefer-
nate from separate sources. To address this limitation and ences or specific requirements. The examples presented in
assess the performance of the models in realistic scenar- Figure 7 highlight the diverse range of harmonized results
ios, we conducted additional experiments using synthesized that can be achieved through our model, demonstrating its
composite images generated from the Open Image Dataset effectiveness and potential for creative applications.
V6 and Flick Dataset. We compare our results with the cur-
rent SOTA image harmonization model CDTNet [3]. The 6. Conclusion
results of these experiments are presented in Figure 2.
In this paper, we have proposed a novel method for im-
5.2. Effectiveness of color transfer age harmonization based on diffusion models. Our method
can effectively adjust the foreground image to match the
When applying LDMs to image editing tasks, one of the background image in terms of illumination and color, re-
issues is the reconstructed image suffers from the recon- sulting in realistic and harmonious composite images. We
structing loss caused by the autoencoder. We show in Fig- have conducted extensive experiments and ablation studies
ure 3 that, our proposed method in Section 3.3 can guide on synthesized image harmonization datasets and compared
the stable diffusion model to maintain the consistency in our method with existing methods. The results have shown
appearance while only changing the color space of the com- that our method achieves state-of-the-art performance and
posite image. outperforms the baselines by a large margin. Our method
can be applied to various image editing tasks that require
5.3. Multi-Output
consistent lighting conditions. In the future, we plan to
Our Diffusion Model possesses an advantage over tra- extend our method to handle real-world image harmoniza-
ditional end-to-end methods in that it exhibits inherent tion scenarios, where the foreground and background im-

6
Mask Composite Image CDTNet Ours

Figure 4. Examples of Multiple Output Results from the Diffusion Model. We provide 5 results from our method for each composite
image and compares them with CDTNet’s result [3].

ages may have complex and diverse lighting conditions. monization via collaborative dual transformations. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 18470–18479, 2022. 5, 6, 7
References
[4] Wenyan Cong, Jianfu Zhang, Li Niu, Liu Liu, Zhixin Ling,
[1] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended Weiyuan Li, and Liqing Zhang. Dovenet: Deep image
latent diffusion. arXiv preprint arXiv:2206.02779, 2022. 2 harmonization via domain verification. In Proceedings of
[2] Daniel Cohen-Or, Olga Sorkine, Ran Gal, Tommer Leyvand, the IEEE/CVF conference on computer vision and pattern
and Ying-Qing Xu. Color harmonization. In ACM SIG- recognition, pages 8394–8403, 2020. 3
GRAPH 2006 Papers, pages 624–630. 2006. 2 [5] Xiaodong Cun and Chi-Man Pun. Improving the harmony
[3] Wenyan Cong, Xinhao Tao, Li Niu, Jing Liang, Xuesong of the composite image by spatial-separated attention mod-
Gao, Qihao Sun, and Liqing . High-resolution image har- ule. IEEE Transactions on Image Processing, 29:4759–

7
4771, 2020. 1, 2, 3 [19] Jiaming Song, Chenlin Meng, and Stefano Ermon.
[6] Prafulla Dhariwal and Alexander Nichol. Diffusion models Denoising diffusion implicit models. arXiv preprint
beat gans on image synthesis. Advances in Neural Informa- arXiv:2010.02502, 2020. 2
tion Processing Systems, 34:8780–8794, 2021. 1, 2, 3 [20] Kalyan Sunkavalli, Micah K. Johnson, Wojciech Matusik,
[7] Zonghui Guo, Dongsheng Guo, Haiyong Zheng, Zhaorui and Hanspeter Pfister. Multi-scale image harmonization.
Gu, Bing Zheng, and Junyu Dong. Image harmonization ACM Transactions on Graphics (Proc. ACM SIGGRAPH),
with transformer. In Proceedings of the IEEE/CVF interna- 29(4), 2010. 1, 2
tional conference on computer vision, pages 14870–14879, [21] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli,
2021. 2 Xin Lu, and Ming-Hsuan Yang. Deep image harmonization.
[8] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- In Proceedings of the IEEE Conference on Computer Vision
sion probabilistic models. Advances in Neural Information and Pattern Recognition, pages 3789–3797, 2017. 1, 2
Processing Systems, 33:6840–6851, 2020. 1, 2 [22] Su Xue, Aseem Agarwala, Julie Dorsey, and Holly Rush-
meier. Understanding and improving the realism of image
[9] Jonathan Ho and Tim Salimans. Classifier-free diffusion
composites. ACM Transactions on graphics (TOG), 31(4):1–
guidance. arXiv preprint arXiv:2207.12598, 2022. 1
10, 2012. 2
[10] Jean-Francois Lalonde and Alexei A Efros. Using color com-
[23] Lvmin Zhang and Maneesh Agrawala. Adding conditional
patibility for assessing image realism. In 2007 IEEE 11th
control to text-to-image diffusion models, 2023. 1, 3
International Conference on Computer Vision, pages 1–8.
[24] Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, and
IEEE, 2007. 2
Alexei A Efros. Learning a discriminative model for the
[11] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher perception of realism in composite images. In Proceedings
Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting of the IEEE International Conference on Computer Vision,
using denoising diffusion probabilistic models. In Proceed- pages 3943–3951, 2015. 2
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 11461–11471, 2022. 2
[12] Francois Pitie, Anil C Kokaram, and Rozenn Dahyot. N-
dimensional probability density function transfer and its ap-
plication to color transfer. In Tenth IEEE International Con-
ference on Computer Vision (ICCV’05) Volume 1, volume 2,
pages 1434–1439. IEEE, 2005. 2
[13] Erik Reinhard, Michael Adhikhmin, Bruce Gooch, and Peter
Shirley. Color transfer between images. IEEE Computer
graphics and applications, 21(5):34–41, 2001. 2
[14] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 10684–10695, 2022. 1, 2
[15] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pages 10684–10695, June 2022. 3
[16] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
net: Convolutional networks for biomedical image segmen-
tation. In Medical Image Computing and Computer-Assisted
Intervention–MICCAI 2015: 18th International Conference,
Munich, Germany, October 5-9, 2015, Proceedings, Part III
18, pages 234–241. Springer, 2015. 1
[17] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee,
Jonathan Ho, Tim Salimans, David Fleet, and Mohammad
Norouzi. Palette: Image-to-image diffusion models. In
ACM SIGGRAPH 2022 Conference Proceedings, pages 1–
10, 2022. 2
[18] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali-
mans, David J Fleet, and Mohammad Norouzi. Image super-
resolution via iterative refinement. arxiv. arXiv preprint
arXiv:2104.07636, 2021. 2

You might also like