Deep Reinforcement Learning For Image-to-Image Translation
Deep Reinforcement Learning For Image-to-Image Translation
8, AUGUST 2015 1
Abstract—Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of deep learning (DL) models.
However, designing a single-step model often requires many parameters and suffers from overfitting. Inspired by the analogy between
diffusion models and reinforcement learning, we reformulate I2IT as an iterative decision-making problem via deep reinforcement
learning (DRL) and propose a computationally efficient RL-based I2IT (RL-I2IT) framework. The key feature in the
RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively
arXiv:2309.13672v3 [cs.CV] 3 Feb 2024
transform the source image to the target image. Considering the challenge of handling high-dimensional continuous state and action
spaces in the conventional RL framework, we introduce meta policy with a new “concept Plan” to the standard Actor-Critic model. This
plan is of a lower dimension than the original image, which facilitates the actor to generate a tractable high-dimensional action. In the
RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the
performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the
proposed method when facing high-dimensional continuous action space problems. Our implementation of the RL-I2IT framework is
available at https://github.com/Algolzw/SPAC-Deformable-Registration.
Index Terms—Image to Image Translation, Deep Reinforcement Learning, Meta Policy, Auxiliary Learning.
1 I NTRODUCTION
action spaces are usually discrete, making them unsuitable for efficient. For example, compared to a recent one-step I2IT
I2IT, which requires continuous action spaces. A promising di- model pix2pixHD of size 45.9M [17], the size of our model
rection for learning continuous actions is the maximum entropy is only 9.7M. The training speed of RL-I2ITis estimated to
reinforcement learning (MERL), which improves both exploration be at least one order of magnitude faster than that of Palette
and robustness by maximizing a standard RL objective with an [5].
entropy term [12]. Soft actor-critic (SAC) [12] is an instance of • Our RL-I2IT framework is flexible in incorporating many
MERL and has been applied to solve continuous action tasks [13]. advanced auxiliary learning methods for various complex
However, the main issue hindering the applicability of SAC on I2IT applications. Experimental results on a variety of ap-
I2IT is its inability to handle high-dimensional states and actions plications, from face inpainting and neural style transfer to
effectively. Recently, RAE [14] tried to address this problem by digits transform and deformable image registration, show that
combining SAC with a regularized autoencoder, but it only pro- our approach achieves state-of-the-art performance.
vides an auxiliary loss for end-to-end RL training and is incapable This paper extends our previous conference papers [18], [19]
of the I2IT tasks. Besides, high-dimensional states and actions and [20] substantially in the following aspects: (i) We propose
require training an I2IT-based RL model to make much more an efficient general RL-based framework for the I2IT problem.
exploration and exploitation, which leads to unstable training [14]. In this regard, our previous works [18], [19] and [20] can be
One solution to stabilize training is to extract a lower-dimensional considered as special cases of the general framework in this paper.
visual representation with a separately pre-trained DNN model (ii) We provide more technical details for each application of the
and learn the value function and corresponding policy in the RL-I2IT framework, such as the detailed network architectures.
latent space [15]. However, this approach can not be trained from (iii) We provide additional diagnostic experiments for each appli-
scratch. Otherwise, it can lead to inconsistent state representations cation to demonstrate the effectiveness of our RL-I2IT frame-
with an optimal policy. work in computer vision and medical image applications. In
Inspired by the analogy between diffusion models and RL the neural style transfer task, we add additional experiments to
[16], we propose a new DRL framework, named RL-I2IT, for evaluate the necessity of high-dimensional latent space and the
I2IT problems to handle high-dimensional continuous state and user case study. In the medical image registration task, we add
action spaces. As shown in Fig. 2, the RL-I2IT framework more experimental analysis for hyper-parameters, the trade-off
comprises three core deep neural networks: a planner, an actor, between performance and inference time, etc.
and a critic. We introduce a new “concept plan” to decompose The remaining content of this paper is organized as follows.
the decision-making process into two steps, state → plan and After introducing the background in Section 2, we describe the
plan → action. We call this process meta policy. The plan is a RL-I2IT framework for step-wise I2IT in Section 3. In Sec-
subspace of appropriate actions based on the current state. It is not tion 4 and 5, we demonstrate experimentally the effectiveness
applied to the state directly. Instead, it is used to guide the actor and robustness of the RL-I2IT framework on computer vision
to generate a tractable high-dimensional action that interacts with applications (digits transform 4.1, face inpainting 4.2, realistic
the environment. The plan can be considered as an intermediate image translation 4.3, and neural style transfer 4.4) and medical
transition between state and action. As the input of the actor, the image applications (deformable medical image registration 5.1),
plan has a much lower dimension compared with the state, making respectively. We conclude the paper in Section 6 with discussions
it easier for the actor to learn to predict actions. Meanwhile, the of the limitations of the framework and future works.
plan can be evaluated by the critic efficiently since the Q function
is easier to learn in the low-dimensional latent space. Furthermore,
compared with training a one-step differentiable DL-based model, 2 BACKGROUND
it is much harder to learn from such a high-dimensional continuous
2.1 Image-to-Image Translation
control problem with traditional RL frameworks. To address it,
we also employ a task-specific auxiliary learning strategy to Image-to-image translation (I2IT) aims to translate input images
stabilize the training process and improve the performance of from a source domain to a target domain, such as generating re-
the corresponding task. The auxiliary learning part could be any alistic photos from semantic segmentation labels [4], synthesizing
learning technology that is flexible and can readily leverage any completed visual targets from images with missing regions [21],
other advanced losses or objectives. For example, we use the deformable image registration [22], neural style transfer [23], etc.
standard L2 reconstruction loss as auxiliary learning in many I2IT Autoencoder is leveraged in most research works to learn this
tasks. Our main contributions can be summarized as follows: process by minimizing the reconstruction error between the pre-
• A new DRL framework RL-I2IT is proposed to handle dicted image and the target. In addition, the generative adversarial
the complex I2IT problem with high-dimensional continuous network (GAN) is also vigorously studied in I2IT to synthesize
actions by decomposing the monolithic learning process into realistic images [4]. Subsequent works enhance I2IT performance
small steps. by using a coarse-to-fine deep learning framework [24] that
• To tackle the high-dimensional continuous action learning recursively sets the output of the previous stage as the input of
problem, we propose a stochastic meta policy that divides the next stage. In this way, the I2IT task is transformed into a
the decision-making processing into two steps: state → low- multi-stage, coarse-to-fine solution. Although the recursion can be
dimensional plan and plan → action. The plan guides the infinitely applied in practice, it is limited by the increasing model
actor to predict a tractable action, and the critic evaluates the size and training instability. More I2IT-related works can be found
plan. The approach makes the whole learning process feasible in a recent survey paper [1].
and computationally efficient. More recently, diffusion models have found successful ap-
• Compared to existing DL-based models, our DRL-based plications in many vision tasks including I2IT [5], [6], [25]. In
model is lightweight, making it simple and computationally Palette [5], diffusion models (DM) outperform strong GAN and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
Fig. 2: Our RL-I2IT framework with a Planner-Actor-Critic structure. Left: At time step t, the environment receives executable action
at , and outputs state and reward (st , rt ). In our meta policy, latent plan pt is sampled from the planner to guide the actor to generate
executable action at that interacts with the environment. The plan is also evaluated by the critic. The nature of at is also task-dependent,
for tasks like Deformable Image Registration, at may be a deformation field applied to the current state. For tasks aiming at realistic
image generation, such as face inpainting or neural style transfer, at could directly be the target image. Right: Task-specific auxiliary
learning objectives depend on specific tasks for various purposes, such as stabilizing the training process or improving performance.
regression baselines on four I2IT tasks without task-specific hyper- More recently, stochastic latent actor-critic (SLAC) [26] im-
parameter tuning, architecture customization, or any auxiliary proves the SAC by learning representation spaces with a latent
loss. This work has inspired several DM-based approaches to variable model which is more stable and efficient for complex
I2IT such as Brownian Bridge Diffusion Model (BBDM) [25] continuous control tasks. It can improve both the exploration and
and score-decomposed diffusion models (SSDM) [6]. Inspired by robustness of the learned model. However, the capability of SLAC
the success of vision-language models, text-driven I2IT based on is limited in a continuous action space. The reason is that the latent
plug-and-play diffusion features [7] has shown high fidelity to state representation in SLAC is only used to facilitate the training
input structure and scene layout, while significantly changing the of the critic, which cannot handle tasks with a high-dimensional
perceived semantic meaning of objects and their appearance. action space.
two distinct steps: first state → plan, and then plan → action. We
term this new two-step mapping process as a “meta policy”, which JQ (θ) =
allows for more intricate and layered decision-making compared
2
1
to standard reinforcement learning models. The new MDP for our E(st ,pt )∼D Qθ (st , pt ) − rt + γ Est+1 [Vθ̄ (st+1 )] ,
RL-I2IT can be represented by the tuple (S, P, A, U, r, γ). S 2
is a set of states, P is a continuous plan, A is a continuous action, where Vθ̄ (st ) = Ept ∼κψ [Qθ̄ (st , pt ) − α log κψ (pt |st )]. We use
and U : S × P × S × A → [0, ∞) represents the state transition a target network Qθ̄ to stabilize training, whose parameters θ̄ are
probability density of the next state st+1 given state st ∈ S , plan obtained by an exponentially moving average of parameters of
pt ∈ P and action at ∈ A. the critic network [28]: θ̄ → τ θ + (1 − τ )θ̄. Hyper-parameter
Our RL-I2IT framework is shown in Fig. 2. It comprises τ ∈ [0, 1]. To optimize JQ (θ), we can do the stochastic gradient
three core deep neural networks: the planner, the actor, and the descent with respect to parameters θ as follows,
critic with parameters ψ , ϕ, and θ, respectively. The planner aims
to generate a high-level plan in low-dimensional latent space to
guide the actor. In some sense, the plan can be considered as action θ = θ − ηQ ▽θ Qθ (st , pt ) Qθ (st , pt ) − rt
(2)
clusters or action templates, which are high-level crude actions.
− γ [Qθ̄ (st+1 , pt+1 ) − α log κψ (pt+1 |st+1 )] .
Unlike classic policy models, the input of the actor is a stochastic
plan instead of the state. That is, the generated plan is forwarded Since the critic works on the planner, the optimization procedure
to the actor further to create the high-dimensional action in our will also influence the planner’s decisions. Following [12], we
meta-policy model. Meanwhile, this plan is evaluated by the critic. can use the following objective to minimize the KL divergence
By using the meta policy and the stochastic planner-actor-critic between the policy and a Boltzmann distribution induced by the
structure, RL-I2IT makes the learning process of a complex Q-function,
I2IT task easier.
Formally supposing a meta policy is defined as (κ, π ). The
Jκ (ψ) =Est ∼D Ept ∼κψ [α log(κψ (pt |st )) − Qθ (st , pt )]
stochastic plan is modeled as a subspace of the deformation
field that gives a low-dimensional vector pt based on the state =Est ∼D,ϵt ∼N (µ,σ) α log(κψ (fψ (ϵt , st )|st ))
st , while action at is determined by the plan pt . Consider a − Qθ (st , fψ (ϵt , st )) .
parameterized planner κψ and actor πϕ , the stochastic plan is
The last equation holds because pt can be evaluated by fψ (ϵt , st )
sampled as a representation: pt ∼ κψ (pt |st ), and the action is
as we discussed before. It should be mentioned that hyperparame-
generated by decoding the plan vector pt into a high-dimensional
ter α can be automatically adjusted by using the method proposed
executable action: at = πϕ (at |pt ). In practice, we reparameterize
in [12]. Then we can apply the stochastic gradient method to
the planner and stochastic plan jointly using a neural network
optimize parameters as follows,
approximation pt = fψ (ϵt , st ), known as the reparameterization
trick [2], where ϵt is an input noise vector sampled from a fixed
Gaussian distribution. Moreover, we maximize the entropy of ψ =ψ − ηψ ▽ψ α log(κψ (pt |st ))+
the plan to improve exploration and robustness. The augmented
objective function is formulated as follows: ▽pt α log(κψ (pt |st )) − ▽pt Qθ (st , pt ) ▽ψ fψ (ϵt , st ) .
(3)
The derivation for the case of the critic evaluating the actor
T
X can be found in Appendix A. Besides, our experimental results
max E(st ,pt ,at )∼ρ(κ,π) [rt (st , pt , at ) + αH(κψ (·|st ))] , (as reported in Table 10) also show that the critic evaluates actor’s
ψ,ϕ
t=1 action results in an inferior performance to that the critic evaluates
(1)
where α is the temperature and ρ(κ,π) is a trajectory distribution the planner.
under κψ (pt |st ) and πϕ (at |pt ).
3.4 Task Specific Auxiliary Learning
Following our meta policy (κψ , πϕ ), the framework derives the
3.3 Learning Planner and Critic executable action at . To enhance convergence and performance,
we adopt auxiliary learning for the planner and actor, tailored to
Unlike conventional RL algorithms, the critic Qθ in our frame- specific tasks. This approach is highly adaptable and capable of
work evaluates the plan pt instead of the action at since learning integrating various advanced losses and techniques.
a low-dimensional plan in an I2IT problem is easier and more For instance, in face inpainting tasks, we focus on recon-
effective. Specifically, the low-dimensional plan is concatenated structing the predicted faces to match the original ones while also
into the downsampled vector of the critic and outputs the soft Q synthesizing more realistic images. This is achieved by employing
function Qθ (st , pt ), which is an estimation of the current state a discriminator on predicted images with an adversarial loss. The
plan value, as shown in Fig. 2. nature of at is also task-dependent: for tasks like Deformable
When the critic is used to evaluate the planner, rewards and Image Registration, at may be a deformation field applied to
soft Q values are used to iteratively guide the stochastic meta- the current state. In contrast, for tasks aiming at realistic image
policy improvement. In the evaluation step, by following SAC generation, such as face inpainting or neural style transfer, at
[12], RL-I2IT learns κψ (planner) and fits parametric Q-function could directly be the target image y. Detailed explanations and
Qθ (st , pt ) (critic) using transitions sampled from the replay pool examples of these applications are provided in the experimental
D by minimizing the soft Bellman residual: sections. We elaborate on the auxiliary learning process using face
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
Fig. 4: Top Left: examples of using RL-I2IT to gradually transform the source digits (leftmost) to the target digits (rightmost). Bottom
Left: Given a random set of digits, continuously transform the leftmost digit to the rightmost digit sequentially. Right: comparison
between our RL-I2IT and VoxelMorph (VM) [33].
which is a state-of-the-art DL-based method for transformation Method PSNR ↑ SSIM ↑ LPIPS ↓ FID ↓
tasks. It can be seen clearly that our method is better at recovering CE [21] 25.764 0.850 0.0955 14.454
the details and the shapes of the target digit than VM does. CA [24] 24.556 0.840 0.0715 9.950
PIC [37] 26.703 0.870 0.0844 12.470
Fig. 5 shows the average Dice scores on the random transform PEN [36] 23.196 0.634 0.1342 35.422
and some transformed results. In the random transform experi- RN [38] 25.123 0.835 0.0698 7.388
ment, the moving images are randomly scaled and rotated, which Shift-Net [39] 26.476 0.851 0.0703 7.597
results in a much larger and more complex deformation field. ILO [40] 22.709 0.783 0.0958 13.122
Palette [5] 24.926 0.850 0.0567 4.909
Our method significantly outperforms VM over all digits, both
quantitatively and qualitatively, which indicates that the proposed Ours(PSNR) 27.351 0.897 0.0439 4.697
Ours(SSIM) 27.598 0.899 0.0433 4.917
method has better generalizability and can work well on images
with large deformations. TABLE 1: Quantitative results of all methods on Celeba-HQ.
We use SNGAN + PSNR and SNGAN + SSIM as rewards
respectively.
4.2 Face Inpainting
In this section, we apply our RL-I2IT framework to the face
inpainting task, which aims to fill in a cropped region in the central 4.2.2 Experiment
area of a face with synthesized contents that are both semantically We use the Celeba-HQ dataset in this task, which includes 28, 000
consistent with the original face and visually realistic. images for training and 2, 000 images for testing. All images have
a cropped region of 64 × 64 pixels in the center. We compare our
method with several recent face inpainting methods, including CE
4.2.1 RL-I2IT Setting
[21], CA [24], PEN [36], PIC [37], RN [38], Shift-Net [39], ILO
For the state, we use the original image with a missing region [40], and Palette [5]. Following the previous work [21], [24], [37],
(center cropped) as the initial state, and the next state is obtained we use PSNR and SSIM as the evaluation metrics.
by adding the new predicted image to the missing region. We use Results and Analysis. The qualitative results produced by our
the peak signal-to-noise ratio (PSNR) as the reward. We apply the framework and existing state-of-the-art methods are shown in
L1 loss with an adversarial loss for the auxiliary learning, which Fig. 7. We can see easily that the RL-I2IT gives obvious visual
tries to make the predicted image more realistic and closer to the improvement for synthesizing realistic faces. The RL-I2IT re-
ground truth image. The λrec and λadv in Eq. 6 are set to 1.0 and sults are very reasonable, and the generated faces are sharper and
0.02, respectively. more natural. This may be attributed to the high-level latent plan
The network architecture for face inpainting is shown in pt , which focuses on learning the global semantic structure and
Fig. 6. For the planner-actor, we use a similar architecture with then directs the actor with auxiliary learning to further improve the
context-encoder [21] except for the skip connections and the local details of a generated image. We can also see that the synthe-
stochastic sampling operation in the planner. We use the same sized images of the RL-I2IT can have very different appearances
network structure for all the types of discriminators except minor from the ground truth, which indicates that, although our training
changes for different GANs. Specifically, for WGAN-GP, the is based on paired images, the RL-I2IT can successfully explore
sigmoid function is removed from the final output layer. A spectral and exploit data for producing diverse results.
normalization is added to each layer of the discriminator of The quantitative comparison is shown in Table 1. We can
SNGAN [35]. Moreover, the convolution layers of the planner, see that our method achieves the best PSNR and SSIM scores
critic, and discriminator use 4 × 4 kernels, and the downsampling when compared with the existing state-of-the-art methods. As we
is performed by convolution with a stride of 2. In this application, mentioned before, the reward function in our RL framework is
the latent action dimension is set to 256. very flexible. Both PSNR- and SSIM-based rewards are suitable
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
Visual results
Target Source VM Ours DF
Fig. 5: Left: The plot box of Dice scores on 10 digits. Center: Step-wise comparison between our RL-I2IT and VoxelMorph (VM)
[33]. Right: Visual comparison of our method with VM. The scaled and rotated digits are transformed to the fixed (target) digits. The
Deformation Filed (DF) column visualizes the estimated deformable fields using RL-I2IT.
4.3.2 Experiment
We use three realistic photo translation tasks to evaluate our
Fig. 6: The network architecture of RL-I2IT for face inpainting. framework, (1) segmentation labels→images with CMP Facades
Each rectangle represents a 2D image (or feature map), the number dataset [43], (2) segmentation labels→images and images→labels
of channels is shown inside the rectangle, and the responding with Cityscapes dataset [44], (3) edges→shoes with Edge and
resolution is printed underneath (or on the left for discriminator). shoes dataset [45].
We compare our framework with existing methods, pix2pix
[4], PAN [46], and the methods designed for high-quality I2IT
Method PSNR ↑ SSIM ↑ task, pix2pixHD [17], DRPAN [47], and CHAN [48]. Moreover,
PA + SNGAN 26.884 0.871 we replace MERL with PPO [49], denoted as Ours-PPO. We use
Ours (+ WGAN-GP) 27.091 0.875 PSNR, SSIM, and LPIPS [50] as the evaluation metrics.
Ours (+ RaGAN) 27.080 0.873
Ours (+ SNGAN) 27.176 0.882
Results and Analysis. The quantitative results are shown in
Table 3. With a similar network structure, the proposed method
TABLE 2: Ablation study of our RL-I2IT framework on Celeba- significantly outperforms the pix2pix and PAN models on PSNR,
HQ testing dataset (all trained with the PSNR reward). SSIM, and LPIPS over all the datasets and tasks. Our method even
achieves a comparable or better performance than the high-quality
pix2pixHD and DRPAN models, which have much more com-
plex architectures and training strategies. Moreover, using MERL
for face inpainting with the RL-I2IT framework.
instead of PPO obviously improves performance on most tasks.
Ablation Study. To illustrate the stability of training GANs in our These experiments illustrate that the proposed RL-I2IT frame-
framework, we jointly use L1 and several advanced GAN losses, work is a robust and effective solution for I2IT.
i.e., WGAN-GP [41], RaGAN [42], and SNGAN [35] for auxiliary More importantly, our model is much simpler, with the same
learning. We also separately train a planner-actor (PA) model by architecture as pix2pix. The number of parameters and the com-
jointly optimizing the L1 and the SNGAN loss. The results are putational complexity are shown in Table 4. We can see that the
shown in Table 2, which indicates that the RL-I2IT framework RL-I2IT has much fewer parameters and lower computational
with different GANs is stable and significantly improves the complexity. We conclude that our model is lightweight, efficient,
performance of training the planner-actor with SNGAN alone, and effective.
further demonstrating the power of the RL-I2IT framework. The qualitative results of our RL-I2IT with other I2IT meth-
ods on different tasks are shown in Fig. 8. We can observe that
pix2pix and PAN sometimes suffer from mode collapse and yield
4.3 Realistic Photo Translation blurry outputs. The pix2pixHD is unstable on different datasets,
especially on Facades and Cityscapes. The DRPAN is more likely
In this section, we evaluate our RL-I2IT framework on the to produce blurred artifacts in several parts of the predicted image
general realistic photo translation task. on Cityscapes. In contrast, the RL-I2IT produces more stable
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
Fig. 7: Visual comparison of different face inpainting methods. GT means ground truth. Our RL-I2IT uses SNGAN for auxiliary
learning. # indicates what reward is used for RL training. Our results have good visual quality even for a large pose face.
Fig. 8: Visual comparison of our RL-I2IT with pix2pix, PAN, pix2pixHD, and DRPAN over photo translation tasks.
TABLE 3: Quantitative results of our RL-I2IT and other methods over all datasets. ↓ means lower is better, Ours-PPO means our
RL-I2IT using PPO.
Method pix2pixHD DRPAN CHAN Ours four tasks are shown in Fig. 9, which indicates that, by using the
#Params 45.874M 11.378M 59.971M 9.730M stochastic meta policy and the maximum entropy framework, the
#FLOPs 10.340G 14.208G 19.743G 3.519G training process is significantly improved.
TABLE 4: Comparison of the number of parameters and FLOPs
(floating point operations, which represent the computational com- 4.4 Image Style Transfer
plexity of the model).
Neural Style Transfer (NST) refers to the generation of a pastiche
and realistic results. Using stochastic meta-policy and MERL image combining the semantic content of one image (the content
helps explore more possible solutions so as to seek out the best image) and the visual style of the other (the style image) using a
generation strategy by trial-and-error in the training steps, leading deep neural network. NST can be used to create a stylized non-
to a more robust agent for different datasets and tasks. photorealistic rendering of digital images with enriched expres-
Evaluation of RL Algorithms. To demonstrate the effectiveness siveness and artistic flavors.
of stochastic meta policy and MERL, we substitute the key The one-step DL approach has an apparent limitation: it is
components of RL-I2IT with other structures or other state-of- hard to determine a proper level of style for different users since
the-art RL algorithms to test their importance. We use DDPG and the ultimate metric of style transfer is too subjective. It has
PPO, respectively. The learning curves of different variants on the been observed that generated stylized images by the current NST
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
Fig. 10: Illustration of our step-wise style transfer process using the RL-I2IT framework. The content images are stylized stronger with
the perdition steps smoothly. The model tends to preserve more details and structures of the content in the early steps and synthesize
more style patterns in the later steps. Our step-wise framework allows a user to control the stylization degree easily.
Fig. 11: Details of the RL-I2IT framework for the NST. The state is initialized with the content image. After the first iteration, we
use only the moving image as the state. The plan is sampled from a 2D Gaussian distribution and is concatenated with the critic. The
predicted moving image is generated by the actor. Note that the VGG networks are pre-trained and fixed for feature extraction during
the training process.
methods tend to be under- or over-stylization [51]. A remedy to shows some examples of our step-wise NST. We can see that our
under-stylization is to use the DL model multiple times, taking RL-based step-wise method tends to preserve more details and
the output of the previous round as the input in the current round. structures of the content image in early steps and synthesize more
However, this may suffer from the high computation cost due to style patterns in later steps, resulting in a more flexible control
the intrinsic complexity of one-step DL models. Other existing of the stylization degree. Furthermore, our model is a lightweight
methods, like [23] and [52], play a trade-off between content and and flexible NST model compared to existing methods, making
style by adjusting hyper-parameters, but this approach is inefficient it more efficient computationally. To the best of our knowledge,
and hard to control. this is the first work that successfully leverages RL for the NST
Our RL-I2IT framework provides a good solution for NST. scenario.
It can be used to learn a lightweight NST model that is applied
iteratively for NST. To preserve spatial structures of images, 4.4.1 RL-I2IT Setting
the latent plans in our model are sampled from a 2D Gaussian We set the moving image as state st , which is initialized by the
distribution that is estimated by the planner and forwarded to the content image. The moving image at time t, i.e., state image st+1 ,
actor to generate intermediate images. In addition, we develop a is created by the actor and current state image st and plan pt .
Fully Convolutional Network (FCN) based planner-actor structure The reward is obtained by measuring the difference between the
so that the model can process input images of any size. Fig. 10 current state st and the style image. The higher the difference is,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
Fig. 12: Qualitative comparison. The first two columns show the content and style images, respectively. The rest of the columns show
the stylization results generated with different style transfer methods, including our step-wise results at the rightmost two columns.
Methods Johnson et al. AdaIN WCT SANet LapStyle ArtFlow IEContraAST AdaAttN StyTR2 Ours(step=1) Ours(step=10)
Content loss 1.597 2.222 2.322 1.941 2.292 1.538 1.668 1.447 1.510 0.868 1.387
Style loss 1.985e-05 1.269e-05 1.626e-05 7.062e-06 2.117e-05 1.486e-05 8.863e-06 1.033e-05 9.178e-06 3.353e-06 1.594e-06
0.014 0.140 0.690 0.010 0.047 0.127 0.019 0.025 0.058
Time (s) 0.004 0.089
(3.5×) (35×) (172.5×) (2.5×) (11.75×) (31.75×) (4.75×) (6.25×) (14.5×)
1.68 7.01 34.24 20.91 7.79 6.46 21.12 13.63 35.39
#Params (M) 0.18 0.18
(9.33×) (38.94×) (190.22×) (116.17×) (43.28×) (35.89×) (117.33×) (75.72×) (196.61×)
TABLE 5: Quantitative comparison of our RL-NST with the baseline methods on the MS-COCO dataset. The speed is obtained with a Pascal
Tesla P100 GPU. (·×) represents the ratio between current baseline and our method (step=1) under the same metric. The best results are shown
in bold.
Fig. 17. In general, the results of our algorithm are favored by the
most subjects.
Fig. 18: Illustration of our step-wise video style transfer process using the RL-I2IT framework. The first row and the second row
represent two different frames from the same video sequence, respectively. The frames are stylized stronger with the perdition steps
smoothly. The model tends to preserve more details and structures of the content in the early steps and synthesize more style patterns
in the later steps. Our model is capable of generating stable stylized results across different degrees of stylization. Meanwhile, at the
same level of stylization, the same object exhibits similar stylized characteristics in two different frames.
Fig. 19: Details of the RL-I2IT framework for the video NST. The state is initialized with the frame. After the first iteration, we use
only the moving image as the state. Note that the VGG networks are pre-trained and fixed for feature extraction during the training
process.
Styles
La muse Sketch En campo gris Brushstrokes Picasso Trial Asheville Contrast Average
Methods
LinearStyleTransfer 2.602 1.792 1.795 2.321 2.947 1.451 5.043 4.524 2.809
ReReVST 1.450 8.155 7.050 7.026 10.772 7.888 19.493 12.886 9.340
MCCNet 4.493 2.050 2.759 2.591 2.854 2.486 6.750 4.820 3.600
AdaAttN 3.442 1.976 2.660 2.561 2.941 1.698 5.775 3.587 3.080
Ours(Step=1) 0.885 1.196 0.453 0.883 1.447 0.527 1.735 1.045 1.021
Ours(Step=5) 1.436 1.509 0.855 1.499 1.980 0.704 2.327 1.550 1.483
Ours(Step=10) 1.867 1.695 1.141 1.807 2.394 0.852 2.854 1.842 1.807
TABLE 7: Comparison of the average temporal losses (×10−2 ) from 23 different sequences of our method with other baseline methods on
different styles. The last column shows the average scores among all styles in each method.
Fig. 20: Comparison of video style transfer between our method and the compared methods. For each method, the top portion shows the video
frame stylized results. The bottom portion shows the heatmap of the differences between two adjacent video frames.
Fig. 21: Comparison of our method, our method without using RL (PA method), and our method without using Step-wise GRU. RL
makes the results from our model more stable and Step-wise GRU makes the output has higher quality.
Styles
some regions that are close to the edge of objects such as the head La muse Brushstrokes
Methods
and shoulder. Ours w/o FWG 1.8939 1.0933
Step=1
Quantitative Results. As shown in Table 7, we choose 23 differ- Ours 1.1351 0.8679
ent sequences from the MPI Sintel dataset [70] and eight different Ours w/o FWG 2.3991 1.4731
style images to calculate the average of temporal losses for
Step=5
Ours 1.8883 1.4331
comparison. It is clear that our method (step=1 and 5) outperforms Ours w/o FWG 3.1329 1.8000
the compared methods in all style settings. Our method still has a Step=10
Ours 2.3836 1.7053
low temporal error even if step=10.
Ablation Study. We investigate the effect of the individual parts TABLE 8: Comparison of our method with and without using Frame-
of the network structure group on the results, including the RL, wise GRU (FWG). The average of temporal losses (×10−2 ) from
eight sequences are reported on two styles.
Step-wise GRU, and Frame-wise GRU.
(1) As shown in Fig. 21 (first and second rows), our method
generates more stable and clearer stylized results than the method
using only Planner-Actor without RL. After step 5, PA no longer external details of the protagonist and dragon are completely lost
has the ability to keep the content information and style informa- in step 20. Our method with using Step-wise GRU, on the other
tion, while our method with RL can still produce good results. hand, obtains very fine results even at step 20.
(2) We compared our method with the results produced when (3) Table 8 shows the comparison of the temporal loss of our
Step-wise GRU is removed, and the results are shown in Fig. 21 method with Frame-wise GRU (FWG) and without FWG. We find
(second and third rows). We can clearly see that most of the face that the temporal loss is very low if we use FWG, which means
details of the protagonist and dragon have been lost at step 10 the obtained final results are more consistent from frame to frame.
when using our method without the Step-wise GRU. Also, the The above experiments have shown that RL, Step-wise and FWG
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15
all greatly improve the performance of the model. Registr ation Envir onment
!
5 A PPLICATIONS ON M EDICAL I MAGE "
' 1 .
warp
!
0
5.1 Deformable Image Registration )
'
*
2
Given a pair of images (IF , IM ), both from the image domain "
T
1X 5.1.2 Experiment
min E(ψ, ϕ) := G(IF , IMt ◦ Ωtψ,ϕ ) + λR(Ωtψ,ϕ ),
ψ,ϕ T t=1 In this section, we evaluate our RL-I2IT framework on the 2D
(13) and 3D medical image registration tasks.
where we use our RL-I2IT framework to learn a tuple (ψ, ϕ) Datasets. For the 2D registration, we use 2,302 pre-processed
instead of the parameter w in Eq.(10). 2D scans from ADNI [84], ABIDE [85], and ADHD [86] for
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16
2D Registration 3D Registration
Method LPBA Time(s) #Params SLIVER LSPIG Time(s) #Params
SyN [72] 55.47±3.96 4.57 - 89.57±3.34 81.83±8.30 269 -
Elastix [73] 53.64±3.97 2.20 - 90.23±2.39 81.19±7.47 87.0 -
LDDMM [74] 52.18±3.48 3.27 - 83.94±3.44 82.33±7.14 41.4 -
VM [75] 55.36±3.94 0.02 105K 86.37±4.15 81.13±7.28 0.13 356K
VM-diff [76] 55.88±3.78 0.02 118K 87.24±3.26 81.38±7.21 0.16 396K
R2N2 [77] 51.84±3.30 0.46 3,591K - - - -
GMFlow [78] 52.52±1.90 0.05 468k - - - -
COTR [79] 52.53±1.89 2312.29 1838k - - - -
RCN [80] - - - 89.59±3.18 82.87±5.69 2.44 21,291K
SYMNet [81] - - - 86.97±3.82 82.78±7.20 0.18 1,124K
RL-I2IT (t=20, SSIM reward) 56.43±3.76 0.16 107K 90.27±3.85 83.69±6.74 1.05 458K
RL-I2IT (t=1, Dice reward) 55.21±3.55 0.02 107K 84.81±4.42 80.61±7.94 0.07 458K
RL-I2IT (t=10, Dice reward) 56.12±3.68 0.08 107K 90.01±3.79 84.67±6.05 0.55 458K
RL-I2IT (t=20, Dice reward) 56.57±3.71 0.16 107K 90.28±3.66 84.40±6.24 1.05 458K
TABLE 9: The Dice score (%) results of our RL-I2IT (t indicates the t-th step) and the baseline methods. The execution time for the
3D registration is tested on the SLIVER dataset. Note that R2N2, GMFlow, and COTR work only for the 2D registration, and RCN and
SYMNet are only for the 3D registration.
Fig. 24: A step-wise registration example of RL-I2IT on the LPBA dataset. The first row is the visualized displacement field, where
deep color represents a large deformation. The red number at the top right corner is the Dice score.
50
DDPG TABLE 11: Quantitative results of the Jacobian determinants.
40 PPO
Ours w/o Reg
30 Ours
unsupervised registration loss. In addition, our RL-I2IT achieves
better performance than PPO.
20
Ablation Experiments. Some important components in our
10 framework, such as reinforcement learning, unsupervised registra-
0 2000 4000 6000 8000 10000 tion learning, and the evaluating plan with the critic, are analyzed.
Episode
Note that when the registration loss is discarded, the deformation
Fig. 26: Learning curves of several RL-based methods on the
field is practically the only action, and both the planner and actor
LPBA dataset.
are trained with the RL objective. As summarized in Table 10, the
result is unsatisfactory if we train RL-I2IT without reinforce-
and DL-based methods, such as VM-dff, RCN, and R2N2, using ment learning, and it becomes worse if the training discards un-
step-wise registration. As the step increases, the performance of supervised registration loss. When the critic evaluates the actor’s
DL-based methods becomes worse, while the RL-based methods action (RL-I2IT-action), it results in an inferior performance
are quite stable. In addition, the Dice score of RL-I2IT increases as compared with the RL-I2IT when the critic evaluates the
all the time on both LPBA and SLIVER datasets. planner.
Compare with other RL methods. To demonstrate the We also use the Jacobian determinant to assess the regularity
effectiveness of our method on the reinforcement learning side, of the predicted displacement field. The results are shown in
we modify our framework with other popular RL models such as Table 11. A small standard deviation of the Jacobian determinant
PPO [49] and DDPG [28] in the Planner-Critic learning process. indicates a smooth displacement. We can see from the table that
We also compare with our variant that discards the DL-based our deformation fields are plausible and smooth. Furthermore,
unsupervised registration loss (RL-I2IT w/o Reg). The quan- we are the first to use SSIM as a reward function to perform
titative results are shown in Table 10, and their training curves are registration. The comparison between using SSIM and the Dice
shown in Fig. 26. We can see that RL-I2IT w/o Reg and DDPG, score as the reward is shown in Table 12. Despite using SSIM,
which use deterministic policy, fail to converge. This indicates that which can perform well compared with other methods, its overall
the RL agent can hardly deal with the DIR problem without the performance of all steps is still inferior to the Dice reward, as
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18
Ours(t=1)
54 88 83.0 RCN
SYMNet
Elastix
82.5
53 87 SYMNet 82.0
VM Ours(t=1)
LDDMM 81.5
52 R2N2 86 Ours(t=1) VM
81.0
0 1 2 3 4 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
Inference time (s) Inference time (s) Inference time (s)
Fig. 27: Trade-off between Dice score and inference time over all datasets.
Reward
SLIVER LSPIG A PPENDIX A
t=1 t=10 t=20 t=1 t=10 t=20
L EARNING WITH C RITIC ON ACTOR
SSIM 83.69 89.71 90.27 79.14 83.80 83.69
DICE 84.81 90.01 90.28 80.61 84.67 84.40 When the critic is used to evaluate the actor, the rewards and the
soft Q values are used to guide the stochastic policy improvement
TABLE 12: Comparison of SSIM and segmentation rewards. iteratively, where the at is concatenated on the state st as the input
of the critic. In evaluation step, follow SAC [12], RL-I2IT learns
we can see from Table 12. The tradeoff between the Dice score the actor πϕ and fits the parametric Q-function Qθ (st , at ) (critic)
and the inference time is shown in Fig. 27. We can see that the using transitions sampled from the replay pool D by minimizing
proposed RL-I2IT achieves a better tradeoff between registration the soft Bellman residual,
performance and computational efficiency.
1 2
JQ (θ) = ED Qθ (st , at ) − rt + γ E [Vθ̄ (st+1 )] ,
2
6 C ONCLUSION (15)
where Vθ̄ (st ) = Eat ∼πϕ [Qθ̄ (st , at ) − α log πϕ (at |pt )]. γ is the
In this paper, we propose a reinforcement learning-based discount factor. We use a target network Qθ̄ to stabilize training,
framework, RL-I2IT, to handle the I2IT problem. Our whose parameters θ̄ are obtained by an exponentially moving av-
RL-I2IT framework is an off-policy planner-actor-critic model. erage of parameters of the critic network [28]: θ̄ → τ θ +(1−τ )θ̄.
It can efficiently learn good policies in spaces with high- The hyper-parameter τ ∈ [0, 1]. To optimize the JQ (θ), we can do
dimensional continuous states and actions. The core component the stochastic gradient descent [12] with respect to the parameters
in RL-I2IT is the proposed meta policy with a new component θ as follows,
‘plan’, which is defined in latent subspace and can guide the actor
to generate high-dimensional executable actions. To the best of
θ = θ − ηQ ▽θ Qθ (st , at ) Qθ (st , at ) − rt
our knowledge, we are the first to propose an RL framework
for the I2IT problem. Experiments based on diverse applications − γ [Qθ̄ (st+1 , at+1 ) − α log πϕ (at+1 |pt+1 )] .
demonstrate that this architecture achieves significant gains over (16)
existing state-of-the-art methods. Since the critic works on the actor, the optimization proce-
There are several potential limitations in our proposed frame- dure will also influence the planner’s decisions. Therefore, the
work. One potential limitation is that our framework can perform improvement step attempts to optimize the actor and the planner
only single-style NST tasks. Arbitrary style transfer methods parameters ϕ, ψ . Following [12], we can use the following ob-
usually use a pre-training model to extract depth features, while jective to minimize the KL divergence between the policy and a
our current RL-based framework directly interacts with the current Boltzmann distribution induced by the Q-function,
state instead of using pre-extracted features as input. This differ-
ence limits our current framework, which cannot perform arbitrary
Jκ,π (ψ, ϕ) =ED [α log(πϕ (at |pt )) − Qθ (st , at )]
style transfer. However, the main goal of the NST task in this
paper is to show the effectiveness of stylization-level controlling =ED [α log(πϕ (at |fψ (ϵt , st ))) − Qθ (st , at )] .
with our RL-based method and the superiority of our method in (17)
achieving the best NST quality. Using a single-style NST model The last equation holds because pt can be replaced by
is sufficient to serve the purpose. That said, we can extend the fψ (ϵt , st ) as we discussed before. It should be mentioned that
current framework to support arbitrary style transfer by observing the hyperparameter α can be automatically adjusted by using one
the depth features of the state. Another potential limitation is that proposed method from [12]. Then we can apply the stochastic
the number of steps of the testing process is a predefined hyper- gradient method to optimize parameters as follows,
parameter, which can be improved by learning from the model α▽pt πϕ (at |pt ) · ▽ψ fψ (ϵt , st )
automatically. ψ = ψ − ηψ , (18)
πϕ (at |pt )
In the future, we will try to address the aforementioned limi-
tations of our proposed framework. We expect that the proposed α▽at πϕ (at |pt )
architecture can potentially be extended to all I2IT tasks. ϕ = ϕ − ηϕ . (19)
πϕ (at |pt )
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 19
[36] Y. Zeng, J. Fu, H. Chao, and B. Guo, “Learning pyramid-context [62] J. An, S. Huang, Y. Song, D. Dou, W. Liu, and J. Luo, “Artflow: Unbiased
encoder network for high-quality image inpainting,” in Proceedings of image style transfer via reversible neural flows,” in CVPR, 2021, pp. 862–
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 871.
2019, pp. 1486–1494. [63] H. Chen, Z. Wang, H. Zhang, Z. Zuo, A. Li, W. Xing, D. Lu et al.,
[37] C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic image completion,” in “Artistic style transfer with internal-external learning and contrastive
Proceedings of the IEEE/CVF Conference on Computer Vision and learning,” NeurIPS, vol. 34, 2021.
Pattern Recognition, 2019, pp. 1438–1447. [64] S. Liu, T. Lin, D. He, F. Li, M. Wang, X. Li, Z. Sun, Q. Li, and E. Ding,
[38] T. Yu, Z. Guo, X. Jin, S. Wu, Z. Chen, W. Li, Z. Zhang, and S. Liu, “Adaattn: Revisit attention mechanism in arbitrary neural style transfer,”
“Region normalization for image inpainting.” in AAAI, 2020, pp. 12 733– in Proceedings of the IEEE/CVF International Conference on Computer
12 740. Vision, 2021, pp. 6649–6658.
[39] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan, “Shift-net: Image inpainting [65] X. Li, S. Liu, J. Kautz, and M.-H. Yang, “Learning linear transformations
via deep feature rearrangement,” in Proceedings of the European confer- for fast image and video style transfer,” in Proceedings of the IEEE/CVF
ence on computer vision (ECCV), 2018, pp. 1–17. Conference on Computer Vision and Pattern Recognition, 2019, pp.
[40] G. Daras, J. Dean, A. Jalal, and A. G. Dimakis, “Intermediate layer 3809–3817.
optimization for inverse problems using deep generative models,” arXiv [66] Y. Deng, F. Tang, W. Dong, H. Huang, C. Ma, and C. Xu, “Arbitrary
preprint arXiv:2102.07364, 2021. video style transfer via multi-channel correlation,” AAAI, 2021.
[41] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, [67] W. Wang, S. Yang, J. Xu, and J. Liu, “Consistent video style transfer via
“Improved training of wasserstein gans,” Advances in neural information relaxation and regularization,” IEEE Transactions on Image Processing,
processing systems, 2017. vol. 29, pp. 9125–9139, 2020.
[68] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino,
[42] A. Jolicoeur-Martineau, “The relativistic discriminator: a key element
M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu et al., “Learning to
missing from standard gan,” arXiv preprint arXiv:1807.00734, 2018.
navigate in complex environments,” ICLR, 2017.
[43] R. Tyleček and R. Šára, “Spatial pattern templates for recognition
[69] “Pexels,” https://www.pexels.com/, 2022, accessed: 2022-03-12.
of objects with regular structure,” in German Conference on Pattern
[70] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open
Recognition. Springer, 2013, pp. 364–374.
source movie for optical flow evaluation,” in ECCV. Springer, 2012, pp.
[44] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- 611–625.
son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for [71] X. Yang, R. Kwitt, and M. Niethammer, “Fast predictive image regis-
semantic urban scene understanding,” in Proc. of the IEEE Conference tration,” in Deep Learning and Data Labeling for Medical Applications.
on Computer Vision and Pattern Recognition (CVPR), 2016. Springer, 2016, pp. 48–57.
[45] A. Yu and K. Grauman, “Fine-grained visual comparisons with local [72] B. B. Avants, C. L. Epstein, M. Grossman, and J. C. Gee, “Symmetric
learning,” in Proceedings of the IEEE Conference on Computer Vision diffeomorphic image registration with cross-correlation: evaluating auto-
and Pattern Recognition, 2014, pp. 192–199. mated labeling of elderly and neurodegenerative brain,” Medical image
[46] C. Wang, C. Xu, C. Wang, and D. Tao, “Perceptual adversarial networks analysis, vol. 12, no. 1, pp. 26–41, 2008.
for image-to-image transformation,” IEEE Transactions on Image Pro- [73] S. Klein, M. Staring, K. Murphy, M. A. Viergever, and J. P. Pluim,
cessing, vol. 27, no. 8, pp. 4066–4079, 2018. “Elastix: a toolbox for intensity-based medical image registration,” IEEE
[47] C. Wang, W. Niu, Y. Jiang, H. Zheng, Z. Yu, Z. Gu, and B. Zheng, transactions on medical imaging, vol. 29, no. 1, pp. 196–205, 2009.
“Discriminative region proposal adversarial network for high-quality [74] M. F. Beg, M. I. Miller, A. Trouvé, and L. Younes, “Computing large
image-to-image translation,” International Journal of Computer Vision, deformation metric mappings via geodesic flows of diffeomorphisms,”
2019. International journal of computer vision, vol. 61, no. 2, pp. 139–157,
[48] F. Gao, X. Xu, J. Yu, M. Shang, X. Li, and D. Tao, “Complementary, 2005.
heterogeneous and adversarial networks for image-to-image translation,” [75] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca,
IEEE Transactions on Image Processing, vol. 30, pp. 3487–3498, 2021. “Voxelmorph: a learning framework for deformable medical image reg-
[49] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- istration,” IEEE transactions on medical imaging, vol. 38, no. 8, pp.
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 1788–1800, 2019.
2017. [76] A. V. Dalca, G. Balakrishnan, J. Guttag, and M. R. Sabuncu, “Unsuper-
[50] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The vised learning of probabilistic diffeomorphic registration for images and
unreasonable effectiveness of deep features as a perceptual metric,” in surfaces,” Medical image analysis, vol. 57, pp. 226–236, 2019.
Proceedings of the IEEE conference on computer vision and pattern [77] R. Sandkühler, S. Andermatt, G. Bauman, S. Nyilas, C. Jud, and P. C.
recognition, 2018, pp. 586–595. Cattin, “Recurrent registration neural networks for deformable image
[51] J. Cheng, A. Jaiswal, Y. Wu, P. Natarajan, and P. Natarajan, “Style-aware registration,” in Advances in Neural Information Processing Systems,
normalized loss for improving arbitrary style transfer,” in CVPR, 2021. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and
[52] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019.
adaptive instance normalization,” in CVPR, 2017, pp. 1501–1510. [78] H. Xu, J. Zhang, J. Cai, H. Rezatofighi, and D. Tao, “Gmflow: Learning
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image optical flow via global matching,” in Proceedings of the IEEE/CVF
recognition,” in CVPR, 2016, pp. 770–778. Conference on Computer Vision and Pattern Recognition, 2022, pp.
[54] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style 8121–8130.
transfer and super-resolution,” in ECCV. Springer, 2016. [79] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, and K. M. Yi, “Cotr:
Correspondence transformer for matching across images,” in Proceedings
[55] L. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convolu-
of the IEEE/CVF International Conference on Computer Vision, 2021,
tional neural networks,” NeurIPS, vol. 28, pp. 262–270, 2015.
pp. 6207–6217.
[56] F. Phillips and B. Mackintosh, “Wiki art gallery, inc.: A case for critical [80] S. Zhao, Y. Dong, E. I. Chang, Y. Xu et al., “Recursive cascaded
thinking,” Issues in Accounting Education, vol. 26, no. 3, pp. 593–608, networks for unsupervised medical image registration,” in Proceedings
2011. of the IEEE/CVF International Conference on Computer Vision, 2019,
[57] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, pp. 10 600–10 610.
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in [81] T. C. Mok and A. Chung, “Fast symmetric diffeomorphic image reg-
context,” in ECCV. Springer, 2014, pp. 740–755. istration with convolutional neural networks,” in Proceedings of the
[58] Y. Deng, F. Tang, X. Pan, W. Dong, C. Ma, and C. Xu, “Stytrˆ IEEE/CVF conference on computer vision and pattern recognition, 2020,
2: Unbiased image style transfer with transformers,” arXiv preprint pp. 4644–4653.
arXiv:2105.14576, 2021. [82] J. MacQueen et al., “Some methods for classification and analysis of
[59] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal multivariate observations,” in Proceedings of the fifth Berkeley sympo-
style transfer via feature transforms,” Advances in neural information sium on mathematical statistics and probability, vol. 1. Oakland, CA,
processing systems, vol. 30, 2017. USA, 1967, pp. 281–297.
[60] D. Y. Park and K. H. Lee, “Arbitrary style transfer with style-attentional [83] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based
networks,” in CVPR, 2019, pp. 5880–5888. noise removal algorithms,” Physica D: nonlinear phenomena, vol. 60,
[61] T. Lin, Z. Ma, F. Li, D. He, X. Li, E. Ding, N. Wang, J. Li, and X. Gao, no. 1-4, pp. 259–268, 1992.
“Drafting and revision: Laplacian pyramid network for fast high-quality [84] S. G. Mueller, M. W. Weiner, L. J. Thal, R. C. Petersen, C. R. Jack,
artistic style transfer,” in CVPR, 2021, pp. 5141–5150. W. Jagust, J. Q. Trojanowski, A. W. Toga, and L. Beckett, “Ways
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 21