0% found this document useful (0 votes)
16 views21 pages

Deep Reinforcement Learning For Image-to-Image Translation

Uploaded by

eangmaolin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views21 pages

Deep Reinforcement Learning For Image-to-Image Translation

Uploaded by

eangmaolin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2015 1

Deep Reinforcement Learning for


Image-to-Image Translation
Xin Wang∗ , Senior Member, IEEE, Ziwei Luo, Jing Hu∗ , Chengming Feng,
Shu Hu, Bin Zhu, Xi Wu, Xin Li, Fellow, IEEE, Siwei Lyu, Fellow, IEEE

Abstract—Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of deep learning (DL) models.
However, designing a single-step model often requires many parameters and suffers from overfitting. Inspired by the analogy between
diffusion models and reinforcement learning, we reformulate I2IT as an iterative decision-making problem via deep reinforcement
learning (DRL) and propose a computationally efficient RL-based I2IT (RL-I2IT) framework. The key feature in the
RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively
arXiv:2309.13672v3 [cs.CV] 3 Feb 2024

transform the source image to the target image. Considering the challenge of handling high-dimensional continuous state and action
spaces in the conventional RL framework, we introduce meta policy with a new “concept Plan” to the standard Actor-Critic model. This
plan is of a lower dimension than the original image, which facilitates the actor to generate a tractable high-dimensional action. In the
RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the
performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the
proposed method when facing high-dimensional continuous action space problems. Our implementation of the RL-I2IT framework is
available at https://github.com/Algolzw/SPAC-Deformable-Registration.

Index Terms—Image to Image Translation, Deep Reinforcement Learning, Meta Policy, Auxiliary Learning.

1 I NTRODUCTION

M ANY computer vision problems, such as face inpainting,


semantic segmentation, image registration, realistic photo
generated from sketch, and neural style transfer, can be unified
under the framework of learning image-to-image translation (I2IT)
[1]. Existing approaches to I2IT can be categories into either
one-step deep-learning (DL) framework (e.g., Variational Autoen-
coders [2], U-Net [3], and conditional GANs [4]) or iterative
diffusion models (e.g., Palette [5], SSDM [6], Plug-and-Play
[7]). Directly learning I2IT with one-step DL models typically
suffers from two major challenges. One is that to handle high-
dimensional I2IT problems, one-step DL models typically have
complex structures and many parameters, making them difficult
to train and hard to deploy in resource-limited scenarios such as
mobile devices. The other is that many of these models do not
generalize well [8] due to the abundance of global minima caused
by the over-parameterized setting. Although these problems can be
potentially alleviated by using multi-scale models or multi-stage
pipelines such as diffusion models, we still face the challenge of
prohibitive computational complexity. Fig. 1: Top: I2I Problem translates an image from a source domain
to a target domain. Mid: Example of one-step method CGAN [4].
Bottom: Our RL-based stepwise I2IT progressively transforms the
• Xin Wang is with the Department of Epidemiology and Biostatistics, School source image, and the process is demonstrated clearly.
of Public Health, University at Albany, State University of New York
(SUNY), NY 12222, USA. (e-mail: [email protected])
• Ziwei Luo is with Uppsala University, Sweden. e-mail:([email protected]) To address these limitations of existing methods, we explore
• Jing Hu, Chengming Feng, and Xi Wu are with Chengdu University of solving I2IT problems by leveraging the recent advances in deep
Information Technology, China. e-mail: (jing [email protected], fengxiaom-
[email protected], [email protected])
reinforcement learning (DRL). The key idea is to decompose the
• Shu Hu is with the Department of Computer and Information Technology, monolithic learning process into small steps with a lightweight
Purdue University, IN, 46202, USA (e-mail: [email protected]) CNN, aiming to improve the quality of predicted results progres-
• Bin Zhu is with Microsoft Research Asia. e-mail:([email protected]) sively (See Fig. 1). By decomposing a one-step complex task into
• Xin Li is with the Department of Computer Science, University at Albany,
SUNY, NY 12222, USA. (e-mail: [email protected]). a series of simpler tasks, our approach can handle the simplified
• Siwei Lyu is with the Department of Computer Science and Engineering, task with a much simpler network rather than using a large, heavily
University at Buffalo, SUNY, USA. e-mail:([email protected]) parameterized network. Although recent works have successfully
∗ Corresponding authors. applied DRL to solve several visual tasks [9], [10], [11], their
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

action spaces are usually discrete, making them unsuitable for efficient. For example, compared to a recent one-step I2IT
I2IT, which requires continuous action spaces. A promising di- model pix2pixHD of size 45.9M [17], the size of our model
rection for learning continuous actions is the maximum entropy is only 9.7M. The training speed of RL-I2ITis estimated to
reinforcement learning (MERL), which improves both exploration be at least one order of magnitude faster than that of Palette
and robustness by maximizing a standard RL objective with an [5].
entropy term [12]. Soft actor-critic (SAC) [12] is an instance of • Our RL-I2IT framework is flexible in incorporating many
MERL and has been applied to solve continuous action tasks [13]. advanced auxiliary learning methods for various complex
However, the main issue hindering the applicability of SAC on I2IT applications. Experimental results on a variety of ap-
I2IT is its inability to handle high-dimensional states and actions plications, from face inpainting and neural style transfer to
effectively. Recently, RAE [14] tried to address this problem by digits transform and deformable image registration, show that
combining SAC with a regularized autoencoder, but it only pro- our approach achieves state-of-the-art performance.
vides an auxiliary loss for end-to-end RL training and is incapable This paper extends our previous conference papers [18], [19]
of the I2IT tasks. Besides, high-dimensional states and actions and [20] substantially in the following aspects: (i) We propose
require training an I2IT-based RL model to make much more an efficient general RL-based framework for the I2IT problem.
exploration and exploitation, which leads to unstable training [14]. In this regard, our previous works [18], [19] and [20] can be
One solution to stabilize training is to extract a lower-dimensional considered as special cases of the general framework in this paper.
visual representation with a separately pre-trained DNN model (ii) We provide more technical details for each application of the
and learn the value function and corresponding policy in the RL-I2IT framework, such as the detailed network architectures.
latent space [15]. However, this approach can not be trained from (iii) We provide additional diagnostic experiments for each appli-
scratch. Otherwise, it can lead to inconsistent state representations cation to demonstrate the effectiveness of our RL-I2IT frame-
with an optimal policy. work in computer vision and medical image applications. In
Inspired by the analogy between diffusion models and RL the neural style transfer task, we add additional experiments to
[16], we propose a new DRL framework, named RL-I2IT, for evaluate the necessity of high-dimensional latent space and the
I2IT problems to handle high-dimensional continuous state and user case study. In the medical image registration task, we add
action spaces. As shown in Fig. 2, the RL-I2IT framework more experimental analysis for hyper-parameters, the trade-off
comprises three core deep neural networks: a planner, an actor, between performance and inference time, etc.
and a critic. We introduce a new “concept plan” to decompose The remaining content of this paper is organized as follows.
the decision-making process into two steps, state → plan and After introducing the background in Section 2, we describe the
plan → action. We call this process meta policy. The plan is a RL-I2IT framework for step-wise I2IT in Section 3. In Sec-
subspace of appropriate actions based on the current state. It is not tion 4 and 5, we demonstrate experimentally the effectiveness
applied to the state directly. Instead, it is used to guide the actor and robustness of the RL-I2IT framework on computer vision
to generate a tractable high-dimensional action that interacts with applications (digits transform 4.1, face inpainting 4.2, realistic
the environment. The plan can be considered as an intermediate image translation 4.3, and neural style transfer 4.4) and medical
transition between state and action. As the input of the actor, the image applications (deformable medical image registration 5.1),
plan has a much lower dimension compared with the state, making respectively. We conclude the paper in Section 6 with discussions
it easier for the actor to learn to predict actions. Meanwhile, the of the limitations of the framework and future works.
plan can be evaluated by the critic efficiently since the Q function
is easier to learn in the low-dimensional latent space. Furthermore,
compared with training a one-step differentiable DL-based model, 2 BACKGROUND
it is much harder to learn from such a high-dimensional continuous
2.1 Image-to-Image Translation
control problem with traditional RL frameworks. To address it,
we also employ a task-specific auxiliary learning strategy to Image-to-image translation (I2IT) aims to translate input images
stabilize the training process and improve the performance of from a source domain to a target domain, such as generating re-
the corresponding task. The auxiliary learning part could be any alistic photos from semantic segmentation labels [4], synthesizing
learning technology that is flexible and can readily leverage any completed visual targets from images with missing regions [21],
other advanced losses or objectives. For example, we use the deformable image registration [22], neural style transfer [23], etc.
standard L2 reconstruction loss as auxiliary learning in many I2IT Autoencoder is leveraged in most research works to learn this
tasks. Our main contributions can be summarized as follows: process by minimizing the reconstruction error between the pre-
• A new DRL framework RL-I2IT is proposed to handle dicted image and the target. In addition, the generative adversarial
the complex I2IT problem with high-dimensional continuous network (GAN) is also vigorously studied in I2IT to synthesize
actions by decomposing the monolithic learning process into realistic images [4]. Subsequent works enhance I2IT performance
small steps. by using a coarse-to-fine deep learning framework [24] that
• To tackle the high-dimensional continuous action learning recursively sets the output of the previous stage as the input of
problem, we propose a stochastic meta policy that divides the next stage. In this way, the I2IT task is transformed into a
the decision-making processing into two steps: state → low- multi-stage, coarse-to-fine solution. Although the recursion can be
dimensional plan and plan → action. The plan guides the infinitely applied in practice, it is limited by the increasing model
actor to predict a tractable action, and the critic evaluates the size and training instability. More I2IT-related works can be found
plan. The approach makes the whole learning process feasible in a recent survey paper [1].
and computationally efficient. More recently, diffusion models have found successful ap-
• Compared to existing DL-based models, our DRL-based plications in many vision tasks including I2IT [5], [6], [25]. In
model is lightweight, making it simple and computationally Palette [5], diffusion models (DM) outperform strong GAN and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Fig. 2: Our RL-I2IT framework with a Planner-Actor-Critic structure. Left: At time step t, the environment receives executable action
at , and outputs state and reward (st , rt ). In our meta policy, latent plan pt is sampled from the planner to guide the actor to generate
executable action at that interacts with the environment. The plan is also evaluated by the critic. The nature of at is also task-dependent,
for tasks like Deformable Image Registration, at may be a deformation field applied to the current state. For tasks aiming at realistic
image generation, such as face inpainting or neural style transfer, at could directly be the target image. Right: Task-specific auxiliary
learning objectives depend on specific tasks for various purposes, such as stabilizing the training process or improving performance.

regression baselines on four I2IT tasks without task-specific hyper- More recently, stochastic latent actor-critic (SLAC) [26] im-
parameter tuning, architecture customization, or any auxiliary proves the SAC by learning representation spaces with a latent
loss. This work has inspired several DM-based approaches to variable model which is more stable and efficient for complex
I2IT such as Brownian Bridge Diffusion Model (BBDM) [25] continuous control tasks. It can improve both the exploration and
and score-decomposed diffusion models (SSDM) [6]. Inspired by robustness of the learned model. However, the capability of SLAC
the success of vision-language models, text-driven I2IT based on is limited in a continuous action space. The reason is that the latent
plug-and-play diffusion features [7] has shown high fidelity to state representation in SLAC is only used to facilitate the training
input structure and scene layout, while significantly changing the of the critic, which cannot handle tasks with a high-dimensional
perceived semantic meaning of objects and their appearance. action space.

2.2 Reinforcement Learning with Continuous Action


3 R EINFORCEMENT L EARNING FOR I2IT
RL is described by an infinite-horizon Markov decision process
(MDP), defined by the tuple (S, A, U, r, γ), where S is a set 3.1 Problem Formulation
of states, A is action, U : S × S × A → [0, ∞) represents In our study, Image-to-Image Translation (I2IT) is reformulated
the state transition probability density given state s ∈ S and as a multistep decision-making problem, where the transformation
action a ∈ A, r : S × A → R is the reward emitted from each from an input image to a target image is not executed in a single
transition, and γ ∈ [0, 1] is the reward discount factor. Standard step. Instead, we introduce a lightweight Deep Reinforcement
RL learns to maximize the expected sum of rewards from the Learning (DRL) model that incrementally performs the trans-
episodic environments under the trajectory distribution ρπ . formation, allowing the progressive addition of new details. We
Maximum Entropy RL (MERL) incorporates an entropy conceptualize I2IT as a Markov Decision Process (MDP), where
term translation, denoted as T , moves from the current state s to the
PT with the policy, and the resulting objective is defined as
t=1 E(st ,at )∼ρπ [rt (st , at ) + αH(πϕ (·|st ))], where α is a tem-
target y through a defined policy. This approach allows for a more
perature parameter controlling the balance of the entropy H and delicate and progressive process of image transformation within
the reward rt . MERL model has proven stable and powerful in low the MDP framework, which can be formulated as follows.
dimensional continuous action tasks, such as games and robotic
controls [12]. However, when facing complex visual problems T (s) = Tt ◦ Tt−1 ◦ · · · T0 (s) = y,
such as I2IT, where observations and actions are high-dimensional,
where ◦ is a composition operator, Tt is the t-th translation step,
it remains a challenge for MERL models [26]. Soft actor-critic
which can predict the image from state st . Moreover, state s can
(SAC) [12] has been shown as a promising framework for learning
be defined according to the specific I2IT task.
continuous actions, which is an off-policy actor-critic method
that uses the above entropy-based framework to derive the soft
policy iteration. The advantage of SAC is that it provides sample 3.2 Stochastic Meta Policy of Planning and Acting
efficient learning and stability. It can improve both the exploration Our RL-I2IT framework is designed to handle high-dimensional
and robustness of the learned model. The original SAC paper continuous states and actions in an infinite-horizon Markov Deci-
reports its performance on continuous control tasks with up to sion Process (MDP). It incorporates a novel component, a planner,
21 dimensions, which is far from enough for handling I2IT tasks. specifically for continuous plan space. This approach diverges
Recent studies [14], [26] have shown that the SAC has limitations from traditional policies that directly map environmental states
when handling high-dimensional states and actions. to actions [27]. Instead, it bifurcates the mapping process into
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

two distinct steps: first state → plan, and then plan → action. We
term this new two-step mapping process as a “meta policy”, which JQ (θ) =
allows for more intricate and layered decision-making compared  
 2

1
to standard reinforcement learning models. The new MDP for our E(st ,pt )∼D Qθ (st , pt ) − rt + γ Est+1 [Vθ̄ (st+1 )] ,
RL-I2IT can be represented by the tuple (S, P, A, U, r, γ). S 2
is a set of states, P is a continuous plan, A is a continuous action, where Vθ̄ (st ) = Ept ∼κψ [Qθ̄ (st , pt ) − α log κψ (pt |st )]. We use
and U : S × P × S × A → [0, ∞) represents the state transition a target network Qθ̄ to stabilize training, whose parameters θ̄ are
probability density of the next state st+1 given state st ∈ S , plan obtained by an exponentially moving average of parameters of
pt ∈ P and action at ∈ A. the critic network [28]: θ̄ → τ θ + (1 − τ )θ̄. Hyper-parameter
Our RL-I2IT framework is shown in Fig. 2. It comprises τ ∈ [0, 1]. To optimize JQ (θ), we can do the stochastic gradient
three core deep neural networks: the planner, the actor, and the descent with respect to parameters θ as follows,
critic with parameters ψ , ϕ, and θ, respectively. The planner aims
to generate a high-level plan in low-dimensional latent space to 
guide the actor. In some sense, the plan can be considered as action θ = θ − ηQ ▽θ Qθ (st , pt ) Qθ (st , pt ) − rt
(2)
clusters or action templates, which are high-level crude actions.

− γ [Qθ̄ (st+1 , pt+1 ) − α log κψ (pt+1 |st+1 )] .
Unlike classic policy models, the input of the actor is a stochastic
plan instead of the state. That is, the generated plan is forwarded Since the critic works on the planner, the optimization procedure
to the actor further to create the high-dimensional action in our will also influence the planner’s decisions. Following [12], we
meta-policy model. Meanwhile, this plan is evaluated by the critic. can use the following objective to minimize the KL divergence
By using the meta policy and the stochastic planner-actor-critic between the policy and a Boltzmann distribution induced by the
structure, RL-I2IT makes the learning process of a complex Q-function,
I2IT task easier.
Formally supposing a meta policy is defined as (κ, π ). The 
Jκ (ψ) =Est ∼D Ept ∼κψ [α log(κψ (pt |st )) − Qθ (st , pt )]

stochastic plan is modeled as a subspace of the deformation 
field that gives a low-dimensional vector pt based on the state =Est ∼D,ϵt ∼N (µ,σ) α log(κψ (fψ (ϵt , st )|st ))

st , while action at is determined by the plan pt . Consider a − Qθ (st , fψ (ϵt , st )) .
parameterized planner κψ and actor πϕ , the stochastic plan is
The last equation holds because pt can be evaluated by fψ (ϵt , st )
sampled as a representation: pt ∼ κψ (pt |st ), and the action is
as we discussed before. It should be mentioned that hyperparame-
generated by decoding the plan vector pt into a high-dimensional
ter α can be automatically adjusted by using the method proposed
executable action: at = πϕ (at |pt ). In practice, we reparameterize
in [12]. Then we can apply the stochastic gradient method to
the planner and stochastic plan jointly using a neural network
optimize parameters as follows,
approximation pt = fψ (ϵt , st ), known as the reparameterization
trick [2], where ϵt is an input noise vector sampled from a fixed 
Gaussian distribution. Moreover, we maximize the entropy of ψ =ψ − ηψ ▽ψ α log(κψ (pt |st ))+
the plan to improve exploration and robustness. The augmented  
objective function is formulated as follows: ▽pt α log(κψ (pt |st )) − ▽pt Qθ (st , pt ) ▽ψ fψ (ϵt , st ) .
(3)
The derivation for the case of the critic evaluating the actor
T
X can be found in Appendix A. Besides, our experimental results
max E(st ,pt ,at )∼ρ(κ,π) [rt (st , pt , at ) + αH(κψ (·|st ))] , (as reported in Table 10) also show that the critic evaluates actor’s
ψ,ϕ
t=1 action results in an inferior performance to that the critic evaluates
(1)
where α is the temperature and ρ(κ,π) is a trajectory distribution the planner.
under κψ (pt |st ) and πϕ (at |pt ).
3.4 Task Specific Auxiliary Learning
Following our meta policy (κψ , πϕ ), the framework derives the
3.3 Learning Planner and Critic executable action at . To enhance convergence and performance,
we adopt auxiliary learning for the planner and actor, tailored to
Unlike conventional RL algorithms, the critic Qθ in our frame- specific tasks. This approach is highly adaptable and capable of
work evaluates the plan pt instead of the action at since learning integrating various advanced losses and techniques.
a low-dimensional plan in an I2IT problem is easier and more For instance, in face inpainting tasks, we focus on recon-
effective. Specifically, the low-dimensional plan is concatenated structing the predicted faces to match the original ones while also
into the downsampled vector of the critic and outputs the soft Q synthesizing more realistic images. This is achieved by employing
function Qθ (st , pt ), which is an estimation of the current state a discriminator on predicted images with an adversarial loss. The
plan value, as shown in Fig. 2. nature of at is also task-dependent: for tasks like Deformable
When the critic is used to evaluate the planner, rewards and Image Registration, at may be a deformation field applied to
soft Q values are used to iteratively guide the stochastic meta- the current state. In contrast, for tasks aiming at realistic image
policy improvement. In the evaluation step, by following SAC generation, such as face inpainting or neural style transfer, at
[12], RL-I2IT learns κψ (planner) and fits parametric Q-function could directly be the target image y. Detailed explanations and
Qθ (st , pt ) (critic) using transitions sampled from the replay pool examples of these applications are provided in the experimental
D by minimizing the soft Bellman residual: sections. We elaborate on the auxiliary learning process using face
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Algorithm 1: Learning Planner-Actor-Critic


Input: IF , IM , UF , UM , replay pool D
Init: ψ , ϕ, θ, θ̄, D and environment E
for each iteration do
for each environment step do
pt ∼ κψ (pt |st ), at ∼ πϕ (at |pt )
st+1 , rt ∼ U(st+1 |st , pt , at )
D = D ∪ {(st , pt , at , rt , st+1 )}
end
for each gradient step do
Sample from D
Update θ, ψ , ϕ with Eq. (2), Eq. (3), Eq. (7)
end Fig. 3: The network architecture of RL-I2IT for MNIST dataset.
end Each rectangle represents a 2D image (or feature map), the number
of channels is shown inside the rectangle, and the responding res-
olution is printed underneath. ‘STN’ stands for spatial transformer
network [34], which is used to transform the source image with
inpainting tasks as an example. Concretely, the empirical objective the predicted displacement field.
of the reconstruction part in our framework is:
4 A PPLICATIONS ON C OMPUTER VISION
Lrec = Est ,y∼D [∥T (st ) − y∥d ], (4) 4.1 Digits Transform
We first evaluate our RL-I2IT framework on MNIST [31], which
where D is a replay pool, ∥·∥d denotes some distance measure,
is a dataset of digits and is regarded as a standard sanity check for
such as L1 or L2 . By adding a discriminator D to predicted
a proposed method. The goal is to transform between two different
images, the adversarial loss is defined as:
images of handwritten digits of 28 × 28 pixels.

4.1.1 RL-I2IT Setting


Ladv = Est ,y∼D [log(D(y)) + log(1 − D(T (st )))] (5)
For digits transform, the state is a concatenation of the predicted
In this example, the final auxiliary learning objective can be image and the target image. The Dice score is used as the reward,
expressed as and the NCC loss is leveraged for auxiliary learning [32], [33]. A
very simple network structure is used to construct the planer, the
JAux = λrec Lrec + λadv Ladv , (6) actor, and the critic, as shown in Fig. 3. More specifically, the plan
is a one-channel 7 × 7 feature map (49-dimensional plan), and
where λrec and λadv are the weight terms for reconstruction and
the actor outputs a deformation field, which is used to transform
adversarial learning. Finally, we can update ψ and ϕ from the
the source image by spatial transformer network (STN) [34]. All
planner and the actor by performing the following steps:
convolution operations use a 3 × 3 kernel with the LeakyReLU
ψ = ψ − η ▽ψ JAux (ψ, ϕ), ϕ = ϕ − η ▽ϕ JAux (ψ, ϕ). (7) activation function. The downsampling operation is performed by
max-pooling, and all upsampling operations are performed with
Note that the additional auxiliary learning may introduce new the nearest interpolation.
parameters to learn, such as the discriminator D in the above
example. Since our goal is to learn the planner and the actor, which 4.1.2 Experiment
are the only components used in testing, we omit those additional The following four types of spatial transforms are used in this ex-
notions for simplicity. More concrete examples of auxiliary learn- periment: (1) Inner-class transform, which transforms digits within
ing are introduced in the experiment sections for different I2IT ap- the same class; (2) Cross-class transform, which transforms digits
plications. The pseudo-code of optimizing RL-I2IT is described cross different classes; (3) Random transform, which transforms
in Algorithm 1. All parameters of RL-I2IT are optimized based digits cross different classes that are randomly scaled from 0.3 to
on the samples from replay pool D. 1.7 and rotated between 0 and 360 degrees; (4) Continuous and
random transform, which randomly selects a set of digits that have
been scaled and rotated and then transforms the first digit to the
3.5 Environment Settings in Practice last one in order. In the testing phase, ten digits from 0 to 9 are
In the given RL-I2IT framework, environment designs are used as the atlases, and 1000 randomly scaled and rotated digits
tailored for various applications, with detailed guidance in each are used as the moving images, which need to be aligned with
section. This section outlines general principles for selecting the atlas. We use the Dice score as the quantitative measure (the
rewards, focusing on the critic’s role in evaluating plans rather higher, the better).
than actions. The plans, being a subset of potential actions, serve The left panel of Fig. 4 shows the process of transforming
as high-level instructions for the actor to create specific actions. digits using the RL-I2IT framework. The result shows that our
Evaluation measures for structural or global image information, method can transform digits step-wise, and it can capture the style
such as SSIM [29] or the DICE score [30], are proposed as and shape accurately. The experiment on random and continuous
rewards for assessing these plans. However, the choice of reward transform (the bottom left panel of Fig. 4) further shows that our
remains flexible and should be empirically tested in the context of RL-I2IT method is robust to complex transformations. The right
individual applications. panel of Fig. 4 compares our method with VoxelMorph (VM) [33],
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Fig. 4: Top Left: examples of using RL-I2IT to gradually transform the source digits (leftmost) to the target digits (rightmost). Bottom
Left: Given a random set of digits, continuously transform the leftmost digit to the rightmost digit sequentially. Right: comparison
between our RL-I2IT and VoxelMorph (VM) [33].

which is a state-of-the-art DL-based method for transformation Method PSNR ↑ SSIM ↑ LPIPS ↓ FID ↓
tasks. It can be seen clearly that our method is better at recovering CE [21] 25.764 0.850 0.0955 14.454
the details and the shapes of the target digit than VM does. CA [24] 24.556 0.840 0.0715 9.950
PIC [37] 26.703 0.870 0.0844 12.470
Fig. 5 shows the average Dice scores on the random transform PEN [36] 23.196 0.634 0.1342 35.422
and some transformed results. In the random transform experi- RN [38] 25.123 0.835 0.0698 7.388
ment, the moving images are randomly scaled and rotated, which Shift-Net [39] 26.476 0.851 0.0703 7.597
results in a much larger and more complex deformation field. ILO [40] 22.709 0.783 0.0958 13.122
Palette [5] 24.926 0.850 0.0567 4.909
Our method significantly outperforms VM over all digits, both
quantitatively and qualitatively, which indicates that the proposed Ours(PSNR) 27.351 0.897 0.0439 4.697
Ours(SSIM) 27.598 0.899 0.0433 4.917
method has better generalizability and can work well on images
with large deformations. TABLE 1: Quantitative results of all methods on Celeba-HQ.
We use SNGAN + PSNR and SNGAN + SSIM as rewards
respectively.
4.2 Face Inpainting
In this section, we apply our RL-I2IT framework to the face
inpainting task, which aims to fill in a cropped region in the central 4.2.2 Experiment
area of a face with synthesized contents that are both semantically We use the Celeba-HQ dataset in this task, which includes 28, 000
consistent with the original face and visually realistic. images for training and 2, 000 images for testing. All images have
a cropped region of 64 × 64 pixels in the center. We compare our
method with several recent face inpainting methods, including CE
4.2.1 RL-I2IT Setting
[21], CA [24], PEN [36], PIC [37], RN [38], Shift-Net [39], ILO
For the state, we use the original image with a missing region [40], and Palette [5]. Following the previous work [21], [24], [37],
(center cropped) as the initial state, and the next state is obtained we use PSNR and SSIM as the evaluation metrics.
by adding the new predicted image to the missing region. We use Results and Analysis. The qualitative results produced by our
the peak signal-to-noise ratio (PSNR) as the reward. We apply the framework and existing state-of-the-art methods are shown in
L1 loss with an adversarial loss for the auxiliary learning, which Fig. 7. We can see easily that the RL-I2IT gives obvious visual
tries to make the predicted image more realistic and closer to the improvement for synthesizing realistic faces. The RL-I2IT re-
ground truth image. The λrec and λadv in Eq. 6 are set to 1.0 and sults are very reasonable, and the generated faces are sharper and
0.02, respectively. more natural. This may be attributed to the high-level latent plan
The network architecture for face inpainting is shown in pt , which focuses on learning the global semantic structure and
Fig. 6. For the planner-actor, we use a similar architecture with then directs the actor with auxiliary learning to further improve the
context-encoder [21] except for the skip connections and the local details of a generated image. We can also see that the synthe-
stochastic sampling operation in the planner. We use the same sized images of the RL-I2IT can have very different appearances
network structure for all the types of discriminators except minor from the ground truth, which indicates that, although our training
changes for different GANs. Specifically, for WGAN-GP, the is based on paired images, the RL-I2IT can successfully explore
sigmoid function is removed from the final output layer. A spectral and exploit data for producing diverse results.
normalization is added to each layer of the discriminator of The quantitative comparison is shown in Table 1. We can
SNGAN [35]. Moreover, the convolution layers of the planner, see that our method achieves the best PSNR and SSIM scores
critic, and discriminator use 4 × 4 kernels, and the downsampling when compared with the existing state-of-the-art methods. As we
is performed by convolution with a stride of 2. In this application, mentioned before, the reward function in our RL framework is
the latent action dimension is set to 256. very flexible. Both PSNR- and SSIM-based rewards are suitable
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Visual results
Target Source VM Ours DF

Fig. 5: Left: The plot box of Dice scores on 10 digits. Center: Step-wise comparison between our RL-I2IT and VoxelMorph (VM)
[33]. Right: Visual comparison of our method with VM. The scaled and rotated digits are transformed to the fixed (target) digits. The
Deformation Filed (DF) column visualizes the estimated deformable fields using RL-I2IT.

4.3.1 RL-I2IT Setting


For realistic photo translation, we directly use the source image
as the initial state. The next state is obtained by warping the
generated image to the source image. We also let the action as
the predicted image directly and use the same auxiliary learning
settings and network structure as in the face inpainting experiment
with the PSNR reward and the SNGAN loss (See Section 4.2 for
more details).

4.3.2 Experiment
We use three realistic photo translation tasks to evaluate our
Fig. 6: The network architecture of RL-I2IT for face inpainting. framework, (1) segmentation labels→images with CMP Facades
Each rectangle represents a 2D image (or feature map), the number dataset [43], (2) segmentation labels→images and images→labels
of channels is shown inside the rectangle, and the responding with Cityscapes dataset [44], (3) edges→shoes with Edge and
resolution is printed underneath (or on the left for discriminator). shoes dataset [45].
We compare our framework with existing methods, pix2pix
[4], PAN [46], and the methods designed for high-quality I2IT
Method PSNR ↑ SSIM ↑ task, pix2pixHD [17], DRPAN [47], and CHAN [48]. Moreover,
PA + SNGAN 26.884 0.871 we replace MERL with PPO [49], denoted as Ours-PPO. We use
Ours (+ WGAN-GP) 27.091 0.875 PSNR, SSIM, and LPIPS [50] as the evaluation metrics.
Ours (+ RaGAN) 27.080 0.873
Ours (+ SNGAN) 27.176 0.882
Results and Analysis. The quantitative results are shown in
Table 3. With a similar network structure, the proposed method
TABLE 2: Ablation study of our RL-I2IT framework on Celeba- significantly outperforms the pix2pix and PAN models on PSNR,
HQ testing dataset (all trained with the PSNR reward). SSIM, and LPIPS over all the datasets and tasks. Our method even
achieves a comparable or better performance than the high-quality
pix2pixHD and DRPAN models, which have much more com-
plex architectures and training strategies. Moreover, using MERL
for face inpainting with the RL-I2IT framework.
instead of PPO obviously improves performance on most tasks.
Ablation Study. To illustrate the stability of training GANs in our These experiments illustrate that the proposed RL-I2IT frame-
framework, we jointly use L1 and several advanced GAN losses, work is a robust and effective solution for I2IT.
i.e., WGAN-GP [41], RaGAN [42], and SNGAN [35] for auxiliary More importantly, our model is much simpler, with the same
learning. We also separately train a planner-actor (PA) model by architecture as pix2pix. The number of parameters and the com-
jointly optimizing the L1 and the SNGAN loss. The results are putational complexity are shown in Table 4. We can see that the
shown in Table 2, which indicates that the RL-I2IT framework RL-I2IT has much fewer parameters and lower computational
with different GANs is stable and significantly improves the complexity. We conclude that our model is lightweight, efficient,
performance of training the planner-actor with SNGAN alone, and effective.
further demonstrating the power of the RL-I2IT framework. The qualitative results of our RL-I2IT with other I2IT meth-
ods on different tasks are shown in Fig. 8. We can observe that
pix2pix and PAN sometimes suffer from mode collapse and yield
4.3 Realistic Photo Translation blurry outputs. The pix2pixHD is unstable on different datasets,
especially on Facades and Cityscapes. The DRPAN is more likely
In this section, we evaluate our RL-I2IT framework on the to produce blurred artifacts in several parts of the predicted image
general realistic photo translation task. on Cityscapes. In contrast, the RL-I2IT produces more stable
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

Fig. 7: Visual comparison of different face inpainting methods. GT means ground truth. Our RL-I2IT uses SNGAN for auxiliary
learning. # indicates what reward is used for RL training. Our results have good visual quality even for a large pose face.

Fig. 8: Visual comparison of our RL-I2IT with pix2pix, PAN, pix2pixHD, and DRPAN over photo translation tasks.

Facades label→image Cityscapes image→label Cityscapes label→image Edges→shoes


Method PSNR SSIM LPIPS ↓ PSNR SSIM LPIPS ↓ PSNR SSIM LPIPS ↓ PSNR SSIM LPIPS ↓
pix2pix 12.290 0.225 0.438 15.891 0.457 0.287 15.193 0.279 0.379 15.812 0.625 0.279
PAN 12.779 0.249 0.387 16.317 0.566 0.228 16.408 0.391 0.346 16.097 0.658 0.228
pix2pixHD 12.357 0.162 0.336 17.606 0.581 0.204 15.619 0.361 0.319 17.110 0.686 0.220
DRPAN 13.101 0.276 0.354 17.724 0.633 0.214 16.673 0.403 0.343 17.524 0.713 0.221
CHAN 13.137 0.231 0.402 17.459 0.641 0.222 16.739 0.401 0.373 18.065 0.692 0.236
Ours-PPO 13.163 0.308 0.366 17.168 0.616 0.221 16.685 0.410 0.362 16.914 0.695 0.225
Ours 13.178 0.296 0.324 17.969 0.659 0.203 16.848 0.412 0.337 18.178 0.698 0.215

TABLE 3: Quantitative results of our RL-I2IT and other methods over all datasets. ↓ means lower is better, Ours-PPO means our
RL-I2IT using PPO.

Method pix2pixHD DRPAN CHAN Ours four tasks are shown in Fig. 9, which indicates that, by using the
#Params 45.874M 11.378M 59.971M 9.730M stochastic meta policy and the maximum entropy framework, the
#FLOPs 10.340G 14.208G 19.743G 3.519G training process is significantly improved.
TABLE 4: Comparison of the number of parameters and FLOPs
(floating point operations, which represent the computational com- 4.4 Image Style Transfer
plexity of the model).
Neural Style Transfer (NST) refers to the generation of a pastiche
and realistic results. Using stochastic meta-policy and MERL image combining the semantic content of one image (the content
helps explore more possible solutions so as to seek out the best image) and the visual style of the other (the style image) using a
generation strategy by trial-and-error in the training steps, leading deep neural network. NST can be used to create a stylized non-
to a more robust agent for different datasets and tasks. photorealistic rendering of digital images with enriched expres-
Evaluation of RL Algorithms. To demonstrate the effectiveness siveness and artistic flavors.
of stochastic meta policy and MERL, we substitute the key The one-step DL approach has an apparent limitation: it is
components of RL-I2IT with other structures or other state-of- hard to determine a proper level of style for different users since
the-art RL algorithms to test their importance. We use DDPG and the ultimate metric of style transfer is too subjective. It has
PPO, respectively. The learning curves of different variants on the been observed that generated stylized images by the current NST
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

facade label to image


Ours
DDPG
Reward (PSNR)
PPO

0 2000 4000 6000 8000 10000


Episode
Fig. 9: Learning curves on different I2IT tasks. RL-I2IT performs consistently better than other modified RL algorithms.

Fig. 10: Illustration of our step-wise style transfer process using the RL-I2IT framework. The content images are stylized stronger with
the perdition steps smoothly. The model tends to preserve more details and structures of the content in the early steps and synthesize
more style patterns in the later steps. Our step-wise framework allows a user to control the stylization degree easily.

Fig. 11: Details of the RL-I2IT framework for the NST. The state is initialized with the content image. After the first iteration, we
use only the moving image as the state. The plan is sampled from a 2D Gaussian distribution and is concatenated with the critic. The
predicted moving image is generated by the actor. Note that the VGG networks are pre-trained and fixed for feature extraction during
the training process.

methods tend to be under- or over-stylization [51]. A remedy to shows some examples of our step-wise NST. We can see that our
under-stylization is to use the DL model multiple times, taking RL-based step-wise method tends to preserve more details and
the output of the previous round as the input in the current round. structures of the content image in early steps and synthesize more
However, this may suffer from the high computation cost due to style patterns in later steps, resulting in a more flexible control
the intrinsic complexity of one-step DL models. Other existing of the stylization degree. Furthermore, our model is a lightweight
methods, like [23] and [52], play a trade-off between content and and flexible NST model compared to existing methods, making
style by adjusting hyper-parameters, but this approach is inefficient it more efficient computationally. To the best of our knowledge,
and hard to control. this is the first work that successfully leverages RL for the NST
Our RL-I2IT framework provides a good solution for NST. scenario.
It can be used to learn a lightweight NST model that is applied
iteratively for NST. To preserve spatial structures of images, 4.4.1 RL-I2IT Setting
the latent plans in our model are sampled from a 2D Gaussian We set the moving image as state st , which is initialized by the
distribution that is estimated by the planner and forwarded to the content image. The moving image at time t, i.e., state image st+1 ,
actor to generate intermediate images. In addition, we develop a is created by the actor and current state image st and plan pt .
Fully Convolutional Network (FCN) based planner-actor structure The reward is obtained by measuring the difference between the
so that the model can process input images of any size. Fig. 10 current state st and the style image. The higher the difference is,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

Fig. 12: Qualitative comparison. The first two columns show the content and style images, respectively. The rest of the columns show
the stylization results generated with different style transfer methods, including our step-wise results at the rightmost two columns.

Methods Johnson et al. AdaIN WCT SANet LapStyle ArtFlow IEContraAST AdaAttN StyTR2 Ours(step=1) Ours(step=10)
Content loss 1.597 2.222 2.322 1.941 2.292 1.538 1.668 1.447 1.510 0.868 1.387
Style loss 1.985e-05 1.269e-05 1.626e-05 7.062e-06 2.117e-05 1.486e-05 8.863e-06 1.033e-05 9.178e-06 3.353e-06 1.594e-06
0.014 0.140 0.690 0.010 0.047 0.127 0.019 0.025 0.058
Time (s) 0.004 0.089
(3.5×) (35×) (172.5×) (2.5×) (11.75×) (31.75×) (4.75×) (6.25×) (14.5×)
1.68 7.01 34.24 20.91 7.79 6.46 21.12 13.63 35.39
#Params (M) 0.18 0.18
(9.33×) (38.94×) (190.22×) (116.17×) (43.28×) (35.89×) (117.33×) (75.72×) (196.61×)

TABLE 5: Quantitative comparison of our RL-NST with the baseline methods on the MS-COCO dataset. The speed is obtained with a Pascal
Tesla P100 GPU. (·×) represents the ratio between current baseline and our method (step=1) under the same metric. The best results are shown
in bold.

with the size of 64×64, which is forwarded to the actor to generate


the moving image. The actor has three up-sampling layers. We also
use three skip connections between the planner and the actor. Our
planner-actor is FCN, which can process input images of any size.
The critic consists of seven convolutional layers and one fully-
connected layer at the end. Since Johnson et al. [54] conclude that
using standard zero-padded convolutions in style transfer will lead
to serious artifacts on the boundary of the generated image, we use
reflection padding instead of zero padding for all the networks.
Style Learning. To make the moving image not deviate from
the content image, the model trains the planner and the actor based
on collected training data from the agent-environment interaction
and changes dynamically in experience replay. More specifically,
the planner and actor form a conditional generative process that
translates state st to output moving image mt at time t. Note
Fig. 13: Comparison of the content loss vs the style loss of
that st is initialized to content image c and st+1 is equivalently
different methods on the test dataset (Maxresdefault style), where
mt . Inspired by [54], we apply the content loss LCO , style loss
the loss of ours in 10 steps is plotted by the curve. Closer to (0,0)
LST , and total variation regularization LT V to optimize the model
is better.
parameters of planner and actor. These losses can better measure
perceptual and semantic differences between the moving image
the smaller the reward is. We use negative style loss as the reward. and content image.
The style loss is defined later in this subsection. Content Loss. Following [54], we use a pre-trained neural
The detail of the network architecture is shown in Fig. 11. network F to extract the high-level feature representatives of mt
The planner is a neural network consisting of three convolutional and c. The reason for using this F is to encourage moving image
layers and a residual layer. After each convolutional layer, there mt to be perceptually similar to content image c but does not force
is an instance norm layer and a ReLU layer. In the residual layer, them to match exactly. Denote F j (·) as the activations of the j -th
we use the residual block designed by He et al. [53]. The planner layer of F . Suppose j -th layer is a convolutional layer, then the
estimates a 2D Gaussian distribution for sampling our latent plan output of F j (·) will be a feature map with size C j × H j × W j ,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

Fig. 14: Comparison of our RL-I2IT framework with the


Planner-Actor (PA) model at step 1 and ours without the RL model
runs 10 steps.
Fig. 15: Comparison of Ours with AdaIN and StyTR2 in various
where C j , H j , and W j represent the number of channels, height, hyperparameter settings.
and width in the feature map of layer j , respectively. We apply the
Euclidean distance, which is squared and normalized, to design Style Step Method Content Loss Style Loss
the content loss as follows, 1
Planner-Actor (PA) 0.787 5.686e-06
Ours 0.557 1.883e-06
1 Maxresdefault
Ours w/o RL 2.093 3.433e-05
LCO (mt , c) = j j j ∥F j (mt ) − F j (c)∥22 . 10
C H W Ours 0.945 1.093e-06
Planner-Actor (PA) 2.265 3.275e-05
Style Loss. To impose penalties on mt when it deviates in 1
Ours 1.016 5.280e-06
content from c and in style from e, we define the style loss by fol- Blue Swirls
Ours w/o RL 3.374 7.747e-05
F̃ j (x)(F̃ j (x))⊤ j j 10
lowing [55] a Gram matrix Gj (x) = ∈ RC ×C , Ours 1.654 2.178e-06
Cj Hj W j
where F̃ j (·) is obtained by reshaping F j (·) into the shape TABLE 6: Content loss, and style loss of several variants of our
C j ×H j W j . The style loss can be defined as a squared Frobenius proposed method.
norm of the difference between the Gram matrices of mt and e. To
preserve the spatial structure of images, we use a set of layers, J ,
instead of a single layer j . Thus, we define the style loss to be the methods, including ArtFlow which is proposed to solve the content
sum of losses for each layer j ∈ J (|J| = 4 in our experiments): leak issue, produce messy stylized images with content structures
J
completely lost. SANet has repeated texture patterns for all the
cases, and most of its results are hard to generate sharp edges.
X
LST (mt , e) = ∥Gj (mt ) − Gj (e)∥2F .
j=1
In contrast, our method can produce stable and diversified
stylized results with good content structures. This may be at-
Total Variation Regularization. To ensure spatial smoothness tributed to our step-wise solution. More specifically, the content
in moving image mt , we use a total variation regularizer LT V . image is stylized progressively, resulting in smoothed stylization
Putting all components together, the final style learning loss is results. More importantly, as we mentioned before, despite the fact
L = LCO + λLST + βLT V , (8) that stylization is quite subjective, our step-wise method provides
flexible control of different degrees of stylization to fit the needs
where λ and β are hyperparameters to control the sensitivity of of different users.
each term. Quantitative Results. To be consistent with all compared methods
shown in Fig. 12, we compare our method with all baselines
4.4.2 Experiment without caring which type (single or multiple styles) they are. The
Dataset. We select style images from WikiArt [56] and use MS- quantitative results are shown in Table 5. Our RL-NST (step=1)
COCO [57] as content images. For the latter, the training set achieves better performance than the baseline methods in all
includes 80K images, and the test set includes 40K images. All the evaluation metrics. Our method still has low content and style
training images are resized to 256 × 256 pixels. We note that our losses even if the step is equivalent to 10, which means our method
method at the inference stage is applicable for content images and is friendly to the user for choosing the results from specific steps
style images of any size. Following StyTR2 [58], we use content accordingly. To better visualize step-wise results, we also compare
loss, style loss and computing time as the evaluation metrics. the two losses of our model in steps 1-10 with the baseline
Baselines. We choose several state-of-the-art style transfer meth- methods in Fig. 13. It is clear that our model can provide lower
ods as our baselines, including Johnson et al. [54], WCT [59], style and content losses. In addition, our model boasts significantly
AdaIN [52], SANet [60], LapStyle [61], ArtFlow [62], IECon- fewer parameters and operates at a faster speed. For example, the
traAST [63], AdaAttN [64], and StyTR2 [58]. All these methods time cost and the parameter size of our method are 2/7 and 1/9
are performed with their public codes with default settings. of Johnson et al., and 4/47 and 1/43 of LapStyle, respectively.
Implementation Details. In the experiment, we set λ = 1e5, β = Ablation Study. (1) We study the effect of the RL model in
1e − 7 in Eq (8). These settings yield nearly the best performance our framework. As shown in Fig. 14, compared with the method
in our experiments. that only uses Planner-Actor (PA), our method can generate more
Qualitative Comparison. Fig. 12 shows some stylized results of stable and clear stylized images at step 1. At step 10, PA loses the
our model and the baseline methods. For content images with fine content information completely without the help of RL (ours w/o
structures, such as the forest image (Undie style), all the baseline RL), while our method can still produce amazing results. We also
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

Fig. 17. In general, the results of our algorithm are favored by the
most subjects.

4.5 Video Style Transfer


Similar to image style transfer, video style transfer focuses on
converting the visual style within a video sequence, allowing it
to exhibit different artistic styles, colors, and appearances while
maintaining the fundamental content and structure of objects and
scenes in the video. Video style transfer is a complex task as it
necessitates considering the temporal continuity and stability of
the video, ensuring a smooth transition between frames during the
style conversion process to avoid flickering or disjointed effects.
Fig. 16: The necessity of high-dimensional latent space. The recent works in the field of video style transfer [64],
[65], [66], [67] have been successful. However, these methods are
facing the challenge of limited diversity. They can only generate
results with a singular degree of stylization, without considering
the creation of stylized videos tailored to different audiences with
varying degrees of stylization.
To achieve a diverse range of stylization levels in video style
transfer, we employ the RL-I2IT framework. Specifically, we
have made adjustments to the neural network for video style
transfer. By using a CNN+RNN architecture [68] for the planner
and actor for both frame-wise and step-wise smoothing, our model
can perform video NST tasks. Fig. 18 illustrates some examples of
our step-wise video NST. We observe that our RL-I2IT frame-
work approach not only retains the advantages seen in image
style transfer, namely the tendency to preserve more details and
Fig. 17: User preference results of nine competitive methods. structure of the frame in the early steps while synthesizing more
style patterns in the later steps, but also generates stable outputs
across different degrees of stylization.
show the corresponding quantitative comparison in Table 6. We
can easily see that at both steps 1 and 10, our model achieves the
4.5.1 RL-I2IT Setting
best performance consistently. This indicates RL can improve the
performance of DL-based NST models. In video style transfer, we maintain most of the settings used in
(2) Since AdaIN and StyTR2 methods can adjust the hyper- image style transfer. The difference lies in initializing the moving
parameter ‘α’∈ [0, 1] and the round of repetitive stylization to images using frames. In addition, building upon the image style
control the degree of stylization in the final results, respectively, transfer network, we made slight modifications for video style
we compare our method with them accordingly in Fig. 15. From transfer. Two GRUs are introduced: the step-wise GRU and the
the visualization results, we can see that the results of AdaIN are frame-wise GRU. The step-wise GRU retains information between
in the under-stylized state even if the style control hyperparameter steps, ensuring smoother step-wise stylization. Meanwhile, the
is changed. Moreover, StyTR2 gets the results with small style frame-wise GRU preserves information between frames, enforc-
changes and low quality after multiple rounds. However, our ing consistency in style patterns among adjacent frames. The
method not only ensures the gradual change in style, but also network’s details are illustrated in Fig. 19.
produces very smooth results. Compound Temporal Regularization. Inspired by [67], we
(3) We evaluate the necessity of the high-dimension for style add a compound temporal regularization for video style transfer.
transfer, showing in Fig. 16. From the result, it can be seen that 1D Specifically, we first generate motions M (·) and then synthesize
latent vector can only generate a single style pattern and cannot adjacent frames. This innovative approach eliminates the necessity
retain the semantic information of the content. In contrast, high- to estimate optical flow in the training process and it can be
dimensional latent tensor preserves the structure and semantic guaranteed that the optical flows are absolutely accurate. We also
information of content, allowing for simultaneous stylization and include random noise △ to maintain temporal consistency, and the
content reconstruction. compound temporal regularization is defined as:
User Study. We conduct a user survey to collect users’ preferences LCT = ∥ηψ (πϕ (M (st ) + △)) − M (mt )∥1 .
for the results of our method and nine competitive methods.
Specifically, we use 5 stylized images and 10 content images in The remaining loss functions remain consistent with those
this study and randomly select 5 combinations of content and style. used in image style transfer. Summing up all the components,
For each combination, we display the stylized images side by side the final style learning loss is
to the subjects, and ask participants to choose the favorite. To L = LCO + λLST + βLT V + ζLCT , (9)
reduce the burden on the subjects, our method only shows the
results of step=1 and step=10. Finally, we collect 270 votes from where λ, β , and ζ are hyper-parameters to control the sensitivity
54 users and show the percentage of votes for each method in of each term.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

Fig. 18: Illustration of our step-wise video style transfer process using the RL-I2IT framework. The first row and the second row
represent two different frames from the same video sequence, respectively. The frames are stylized stronger with the perdition steps
smoothly. The model tends to preserve more details and structures of the content in the early steps and synthesize more style patterns
in the later steps. Our model is capable of generating stable stylized results across different degrees of stylization. Meanwhile, at the
same level of stylization, the same object exhibits similar stylized characteristics in two different frames.

Fig. 19: Details of the RL-I2IT framework for the video NST. The state is initialized with the frame. After the first iteration, we use
only the moving image as the state. Note that the VGG networks are pre-trained and fixed for feature extraction during the training
process.

Styles
La muse Sketch En campo gris Brushstrokes Picasso Trial Asheville Contrast Average
Methods
LinearStyleTransfer 2.602 1.792 1.795 2.321 2.947 1.451 5.043 4.524 2.809
ReReVST 1.450 8.155 7.050 7.026 10.772 7.888 19.493 12.886 9.340
MCCNet 4.493 2.050 2.759 2.591 2.854 2.486 6.750 4.820 3.600
AdaAttN 3.442 1.976 2.660 2.561 2.941 1.698 5.775 3.587 3.080
Ours(Step=1) 0.885 1.196 0.453 0.883 1.447 0.527 1.735 1.045 1.021
Ours(Step=5) 1.436 1.509 0.855 1.499 1.980 0.704 2.327 1.550 1.483
Ours(Step=10) 1.867 1.695 1.141 1.807 2.394 0.852 2.854 1.842 1.807
TABLE 7: Comparison of the average temporal losses (×10−2 ) from 23 different sequences of our method with other baseline methods on
different styles. The last column shows the average scores among all styles in each method.

4.5.2 Experiment for a frame of size H × W , we first generate a Gaussian map


(wavy twists) Mwt of shape H/100 × W/100 × 2, mean 0, and
Dataset. For video style transfer, we randomly collect 16 videos of standard deviation 0.001. Second, Mwt is resized to H × W and
different scenes from pexels [69]. Then these videos are extracted blurred by a Gaussian filter of kernel size 100. Finally, we add
into video frames and we obtain more than 2.5K frames. We regard two random values (translation motion) Mtm of range [-10,10] to
these frames as the content images of training set. Note that the Mwt , and obtain M . In addition, random noise △ ∼ N (0, σ 2 I),
style images in the training set are also selected from WikiArt [56]. where σ ∼ U(0.001, 0.002).
In addition, following [67], we use the training set of MPI Sintel
dataset [70] as the test set, which contains 23 sequences with a Qualitative Comparison. We show the visualization results of our
total of 1K frames. Similarly, all training frames are resized to method compared with the four latest video style transfer methods
256×256, we use the original frame size in testing. in Fig. 20, wherein, for each method, the top portion shows the
Baselines. For video style transfer, we compare our method with specific stylized results and the bottom portion is the heatmap of
the following four popular methods: Linear [65], MCCNet [66], the differences in the adjacent frames of the input and stylized
ReReVST [67], and AdaAttN [64]. Following [64], we use tempo- videos. Note that the adjacent frame indexes are the same for all
ral loss as the evaluation metric to compare the stability of stylized methods. We can find that our method produces refined stylized
results. All these methods are performed using their public codes results and our results are closest to the input frames. In particular,
with the default settings. our method can highly promote the stability of video style transfer.
Implementation Details. In the experiment, we set λ = 1e5, β = The differences in our results are closest to the difference from
1e−7, and ζ = 1e2 in Eq. (9). These settings yield nearly the best input frames without reducing the effect of stylization. It is clear
performance in our experiments. Following [67], in LCT , M (·) is that MCCNet and ReReVST fail to keep the coherence of videos.
implemented by warping with a random optical flow. Specifically, In addition, Linear and AdaAttN also fail to keep the coherence in
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

Fig. 20: Comparison of video style transfer between our method and the compared methods. For each method, the top portion shows the video
frame stylized results. The bottom portion shows the heatmap of the differences between two adjacent video frames.

Fig. 21: Comparison of our method, our method without using RL (PA method), and our method without using Step-wise GRU. RL
makes the results from our model more stable and Step-wise GRU makes the output has higher quality.
Styles
some regions that are close to the edge of objects such as the head La muse Brushstrokes
Methods
and shoulder. Ours w/o FWG 1.8939 1.0933
Step=1
Quantitative Results. As shown in Table 7, we choose 23 differ- Ours 1.1351 0.8679
ent sequences from the MPI Sintel dataset [70] and eight different Ours w/o FWG 2.3991 1.4731
style images to calculate the average of temporal losses for
Step=5
Ours 1.8883 1.4331
comparison. It is clear that our method (step=1 and 5) outperforms Ours w/o FWG 3.1329 1.8000
the compared methods in all style settings. Our method still has a Step=10
Ours 2.3836 1.7053
low temporal error even if step=10.
Ablation Study. We investigate the effect of the individual parts TABLE 8: Comparison of our method with and without using Frame-
of the network structure group on the results, including the RL, wise GRU (FWG). The average of temporal losses (×10−2 ) from
eight sequences are reported on two styles.
Step-wise GRU, and Frame-wise GRU.
(1) As shown in Fig. 21 (first and second rows), our method
generates more stable and clearer stylized results than the method
using only Planner-Actor without RL. After step 5, PA no longer external details of the protagonist and dragon are completely lost
has the ability to keep the content information and style informa- in step 20. Our method with using Step-wise GRU, on the other
tion, while our method with RL can still produce good results. hand, obtains very fine results even at step 20.
(2) We compared our method with the results produced when (3) Table 8 shows the comparison of the temporal loss of our
Step-wise GRU is removed, and the results are shown in Fig. 21 method with Frame-wise GRU (FWG) and without FWG. We find
(second and third rows). We can clearly see that most of the face that the temporal loss is very low if we use FWG, which means
details of the protagonist and dragon have been lost at step 10 the obtained final results are more consistent from frame to frame.
when using our method without the Step-wise GRU. Also, the The above experiments have shown that RL, Step-wise and FWG
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

all greatly improve the performance of the model. Registr ation Envir onment

!
5 A PPLICATIONS ON M EDICAL I MAGE "

' 1 .
warp
!

0
5.1 Deformable Image Registration )

In this section, we apply the RL-I2IT to deformable image ( '- . '


' + , ( '%# $ %& / # $ %&
registration (DIR), which is an ill-posed problem formalized as
the optimization of a function balancing the similarity between * )
images and the plausibility of the deformation [22], [33], [71]. warp

'
*

2
Given a pair of images (IF , IM ), both from the image domain "

X → Rd , where d is the dimension, where IF is the fixed


image and IM is the moving image. Denote Ωw as a registration
model parameterized by w, the output of which is a deformation Fig. 22: Illustration of the deformable registration environment
field. Then, the process of aligning the moving image to the fixed of RL-I2IT. When the environment receives an action at , it
image can be written as IM ◦ Ωw (IF , IM ). Then, the pairwise outputs the next state st+1 and the reward rt . Specifically, the
registration is formulated as a minimization problem based on the environment comprises a pair of images (IF , IM ) and generates
following energy function: corresponding segmentation maps (UF , UM ) with the K-means
t−1
clustering. C(at , Ωψ,ϕ ) is the function that applies action at to
the accumulated deformation field Ωtψ,ϕ . The next state st+1
min E(w) := G(IF , IM ◦ Ωw (IF , IM )) + λR(Ωw (IF , IM )) is obtained by concatenating IF and the warped moving image
w
(10) IM ◦ Ωtψ,ϕ . The reward rt is obtained by Eq. (14) to evaluate the
where G represents a distance metric measuring the similarity improvement of the deformation field.
between the fixed image and the warped image, R represents a
regularization constraining the deformation field, λ is a regular-
ization parameter. G can be any distance metric, such as the sum An overview of the environment of our RL-I2IT framework
of squared differences (SSD), the normalized mutual information is shown in Fig. 22. In the beginning, the environment contains
(NMI), or the negative normalized cross-correlation (NCC) [32], only an image pair (IF , IM ), then K-means [82] with three
[33]. clustering labels is performed for a voxel-wise segmentation in
an unsupervised manner. The obtained segmentation maps (UF ,
5.1.1 RL-I2IT Setting UM ) assign each voxel to a virtual anatomical structure label.
At time step t, state st comprises the fixed image IF and the
Instead of predicting the deformation field in one shot as tradi-
moving image IMt , st = (IF , IMt ). The next state st+1 is
tional DL-based DIR methods, our framework decomposes the
obtained by warping IM with the composed deformable field
registration task into T steps. Suppose action at is the current
Ωtψ,ϕ : st+1 = (IF , IM ◦Ωtψ,ϕ ), where the warping operator is the
deformation field at time step t, which is generated by the planner
popular spatial transformer network (STN) [34]. The reward rt is
κψ and actor πϕ based on the fixed image IF and the interme-
defined based on the Dice score [30]:
diate moving image IMt . Let Ωtψ,ϕ represent the accumulated
deformation field composed by at and the previous deformation
field Ωt−1 t
ψ,ϕ . We can compute Ωψ,ϕ with a recursive composition rt = Dice(UF , UM ◦ Ωtψ,ϕ ) − Dice(UF , UM ◦ Ωt−1
ψ,ϕ ), (14)
function:
|U ∩U |
( where Dice(U1 , U2 ) = 2 · |U11|+|U22 | . This reward function explic-
t 0 if t = 0, itly evaluates the improvement of the predicted deformation field
Ωψ,ϕ = t−1 (11)
Ωtψ,ϕ .
C(at , Ωψ,ϕ ) otherwise,
Auxiliary Learning. After getting action at and following meta
where policy (κψ , πϕ ), we can obtain Ωtψ,ϕ based on Eq. (11). We use
the local normalized cross-correlation (NCC) [33] to measure the
C(at , Ωt−1 t−1 t−1
ψ,ϕ ) = Ωψ,ϕ + (at ◦ Ωψ,ϕ ). (12) similarity between the fixed image and the warped moving image:
To generate the intermediate moving image IMt+1 at time step G(IF , IMt ) = N CC(IF , IM ◦ Ωtψ,ϕ ), where a higher NCC
t + 1, we warp the initial moving image IM with the accumulated indicates a better alignment.
deformation field Ωtψ,ϕ , so as to eliminate the warping bias in Moreover, in order to generate a realistic warped moving
the multi-step recursive registration process [80]. In this way, the image, we smooth the deformation field by using total variation
registration result can be progressively improved by predicting regularizer [83]: R(Ωtψ,ϕ ) = ∥∇Ωtψ,ϕ ∥22 . The final registration
the deformation field from coarse to refined. Using the step- loss for the auxiliary learning JAux is then defined as
wise notion, our RL-I2IT framework reformulates the DIR
optimization problem (Eq.(10)) as JAux (ψ, ϕ) =Est ∼D [−N CC(IF , IM ◦ Ωtψ,ϕ ) + λ∥∇Ωtψ,ϕ (st )∥22 ].

T
1X 5.1.2 Experiment
min E(ψ, ϕ) := G(IF , IMt ◦ Ωtψ,ϕ ) + λR(Ωtψ,ϕ ),
ψ,ϕ T t=1 In this section, we evaluate our RL-I2IT framework on the 2D
(13) and 3D medical image registration tasks.
where we use our RL-I2IT framework to learn a tuple (ψ, ϕ) Datasets. For the 2D registration, we use 2,302 pre-processed
instead of the parameter w in Eq.(10). 2D scans from ADNI [84], ABIDE [85], and ADHD [86] for
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

2D Registration 3D Registration
Method LPBA Time(s) #Params SLIVER LSPIG Time(s) #Params
SyN [72] 55.47±3.96 4.57 - 89.57±3.34 81.83±8.30 269 -
Elastix [73] 53.64±3.97 2.20 - 90.23±2.39 81.19±7.47 87.0 -
LDDMM [74] 52.18±3.48 3.27 - 83.94±3.44 82.33±7.14 41.4 -
VM [75] 55.36±3.94 0.02 105K 86.37±4.15 81.13±7.28 0.13 356K
VM-diff [76] 55.88±3.78 0.02 118K 87.24±3.26 81.38±7.21 0.16 396K
R2N2 [77] 51.84±3.30 0.46 3,591K - - - -
GMFlow [78] 52.52±1.90 0.05 468k - - - -
COTR [79] 52.53±1.89 2312.29 1838k - - - -
RCN [80] - - - 89.59±3.18 82.87±5.69 2.44 21,291K
SYMNet [81] - - - 86.97±3.82 82.78±7.20 0.18 1,124K
RL-I2IT (t=20, SSIM reward) 56.43±3.76 0.16 107K 90.27±3.85 83.69±6.74 1.05 458K
RL-I2IT (t=1, Dice reward) 55.21±3.55 0.02 107K 84.81±4.42 80.61±7.94 0.07 458K
RL-I2IT (t=10, Dice reward) 56.12±3.68 0.08 107K 90.01±3.79 84.67±6.05 0.55 458K
RL-I2IT (t=20, Dice reward) 56.57±3.71 0.16 107K 90.28±3.66 84.40±6.24 1.05 458K

TABLE 9: The Dice score (%) results of our RL-I2IT (t indicates the t-th step) and the baseline methods. The execution time for the
3D registration is tested on the SLIVER dataset. Note that R2N2, GMFlow, and COTR work only for the 2D registration, and RCN and
SYMNet are only for the 3D registration.

is its probabilistic diffeomorphic variant. SYMNet is a one-shot


3D registration method. R2N2 and RCN are multi-step methods
for the 2D and 3D registration, respectively. Both GMflow and
COTR can only deal with 2D image registration. GMFlow is
Fixed Moving GT Elastix SyN an optical flow estimation method. COTR is an image-matching
method using a coordinate query. For a fair comparison, we use
the same network structure as VM in our RL-I2IT framework.
The Dice score is used as the reward function and evaluation
metric. To evaluate the robustness of our RL-I2IT framework,
LDDMM VM-diff SYMNet RCN Ours we also provide the registration results of using SSIM as the
reward function.
Fig. 23: Visual results of our RL-I2IT and the baseline methods Results. Table 9 summarizes the performance of our method and
on the LSPIG 3D liver dataset. The warped moving image ob- the baseline methods. We can see that our RL-I2IT outperforms
tained by our RL-I2IT is more similar to the ground truth. the baseline methods over all the cases. The experimental results
on the demanding LSPIG can better reflect the strength of our
method. The LSPIG dataset has large deformation fields and
training and apply K-Means to obtain corresponding voxel-wise
is quite different from the training dataset (LiTS) in terms of
segmentation maps. 40 pre-processed slices from LONI Prob-
structure and appearance. The good performance of our method
abilistic Brain Atlas (LPBA) [87] are used for the evaluation,
on LSPIG shows that the RL-I2IT framework can handle large
each of which contains the ground truth of a segmentation map
deformation and has better generalizability than conventional DL-
with 56 manually delineated anatomical structures. All images are
based methods. Note that RL-I2IT performs registration step
resampled to 128 × 128 pixels. The first slice of LPBA is used
by step. It is slower than most one-step methods, such as VM
as the atlas, and the remaining images are used as the moving
and SYMNet but is still faster than other multi-step methods,
images.
such as R2N2 and RCN. Compared with the RL-I2IT using
For the 3D registration, we use Liver Tumor Segmentation
the Dice reward, the RL-I2ITusing the SSIM reward achieves
(LiTS) [88] challenge data for training, which contains 131 CT
comparable results on SLIVER but slightly worse on the LSPIG
scans with the annotated segmentation ground truth. The SLIVER
dataset. Fig. 23 visualizes a registration result on the LSPIG
[89] dataset has 20 scans with liver segmentation ground truth,
dataset by overlaying the warped moving segmentation map on
half of which are used as the testing data. To further evaluate
the fixed image. This result shows that our model successfully
the generalization capability of our model, we conduct a cross-
learns 3D registration even when encountering a large deformation
subjects experiment: the model is trained with the human liver
field and facing a large discrepancy between the training and the
dataset but tested with the pig liver data set. Concretely, we use
testing. Moreover, the RL-I2IT outperforms the two step-wise
the same model trained on the LiTS dataset and test the model
methods, RCN and R2N2, which demonstrates the effectiveness
on the challenging Liver Segmentation of Pigs (LSPIG) [80]
of our framework.
dataset, which contains 17 paired CT scans from pigs and the
corresponding liver segmentation ground truth. All 3D volumes
are resampled to 128 × 128 × 128 pixels and pre-affined as a 5.1.3 Analysis
standard pre-processing step in the DIR task. Step-wise registration. The key idea of our method is to de-
Baselines We compare our method with seven state-of-the-art DL- compose the monolithic registration process into small steps
based DIR methods: VoxelMorph (VM) [75], VM-diff [76], SYM- using a lightweight CNN model and progressively improve the
Net [81], R2N2 [77], RCN [80], GMFlow [78], COTR [79], and transformed results. Fig. 24 shows an example of the step-wise
two top-performing conventional registration algorithms, SyN [72] registration process. The first row visualizes deformation fields
and Elastix [73] with B-Spline [90]. VM uses a U-Net structure that are predicted by our method in a step-wise and coarse-to-
with NCC loss to learn deformable registration, and VM-diff fine manner. In Fig. 25, we compare our method with PPO [49]
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17

Input !"# !"$ !"% ! " #& ! " '( Output

!!"#$ !!"%& !'"$( !'"#% !'"')

Fig. 24: A step-wise registration example of RL-I2IT on the LPBA dataset. The first row is the visualized displacement field, where
deep color represents a large deformation. The red number at the top right corner is the Dice score.

LPBA SLIVER LSPIG


PPO-modified 55.82±3.49 89.30±3.63 83.55±6.24
RL-I2IT-action 55.58±3.70 88.75±3.69 81.80±7.51
RL-I2IT w/o RL 54.89±3.80 85.43±4.14 80.72±7.34
RL-I2IT w/o Reg 44.67±3.74 79.34±4.02 72.45±6.25
RL-I2IT 56.57±3.71 90.28±3.66 84.40±6.24

TABLE 10: The Dice scores (%) of several variants of RL-I2IT.


‘RL-I2IT-action’ indicates that the critic evaluates the actor’s
action instead of the planner in RL-I2IT.
Fig. 25: The step-wise registration results. The RL-based methods
(our RL-I2IT and PPO) perform more stably than the DL-based SLIVER LSPIG
Method
methods, and our RL-I2IT achieves the best performance. mean(|Jϕ |) std(|Jϕ |) mean(|Jϕ |) std(|Jϕ |)
VM 0.9263 0.0106 0.9204 0.0112
LPBA 2D registration RCN 0.8066 0.0906 0.7183 0.1126
RL-I2IT (t=1) 0.9545 0.0084 0.9637 0.0110
60 RL-I2IT (t=10) 0.9176 0.0160 0.9306 0.0202
RL-I2IT (t=20) 0.8631 0.0376 0.8951 0.0334
Dice Reward (%)

50
DDPG TABLE 11: Quantitative results of the Jacobian determinants.
40 PPO
Ours w/o Reg
30 Ours
unsupervised registration loss. In addition, our RL-I2IT achieves
better performance than PPO.
20
Ablation Experiments. Some important components in our
10 framework, such as reinforcement learning, unsupervised registra-
0 2000 4000 6000 8000 10000 tion learning, and the evaluating plan with the critic, are analyzed.
Episode
Note that when the registration loss is discarded, the deformation
Fig. 26: Learning curves of several RL-based methods on the
field is practically the only action, and both the planner and actor
LPBA dataset.
are trained with the RL objective. As summarized in Table 10, the
result is unsatisfactory if we train RL-I2IT without reinforce-
and DL-based methods, such as VM-dff, RCN, and R2N2, using ment learning, and it becomes worse if the training discards un-
step-wise registration. As the step increases, the performance of supervised registration loss. When the critic evaluates the actor’s
DL-based methods becomes worse, while the RL-based methods action (RL-I2IT-action), it results in an inferior performance
are quite stable. In addition, the Dice score of RL-I2IT increases as compared with the RL-I2IT when the critic evaluates the
all the time on both LPBA and SLIVER datasets. planner.
Compare with other RL methods. To demonstrate the We also use the Jacobian determinant to assess the regularity
effectiveness of our method on the reinforcement learning side, of the predicted displacement field. The results are shown in
we modify our framework with other popular RL models such as Table 11. A small standard deviation of the Jacobian determinant
PPO [49] and DDPG [28] in the Planner-Critic learning process. indicates a smooth displacement. We can see from the table that
We also compare with our variant that discards the DL-based our deformation fields are plausible and smooth. Furthermore,
unsupervised registration loss (RL-I2IT w/o Reg). The quan- we are the first to use SSIM as a reward function to perform
titative results are shown in Table 10, and their training curves are registration. The comparison between using SSIM and the Dice
shown in Fig. 26. We can see that RL-I2IT w/o Reg and DDPG, score as the reward is shown in Table 12. Despite using SSIM,
which use deterministic policy, fail to converge. This indicates that which can perform well compared with other methods, its overall
the RL agent can hardly deal with the DIR problem without the performance of all steps is still inferior to the Dice reward, as
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18

LPBA 2D registration SLIVER 3D registration


Ours(t=20) Ours(t=20)
LSPIG 3D registration
Ours(t=10) Ours(t=10)
56 90 Ours(t=10) 84.5 Ours(t=5)
RCN
VM SyN 84.0
Dice reward (%)

Ours(t=1)

Dice reward (%)


89

Dice reward (%)


55
83.5

54 88 83.0 RCN
SYMNet
Elastix
82.5
53 87 SYMNet 82.0
VM Ours(t=1)
LDDMM 81.5
52 R2N2 86 Ours(t=1) VM
81.0
0 1 2 3 4 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
Inference time (s) Inference time (s) Inference time (s)
Fig. 27: Trade-off between Dice score and inference time over all datasets.

Reward
SLIVER LSPIG A PPENDIX A
t=1 t=10 t=20 t=1 t=10 t=20
L EARNING WITH C RITIC ON ACTOR
SSIM 83.69 89.71 90.27 79.14 83.80 83.69
DICE 84.81 90.01 90.28 80.61 84.67 84.40 When the critic is used to evaluate the actor, the rewards and the
soft Q values are used to guide the stochastic policy improvement
TABLE 12: Comparison of SSIM and segmentation rewards. iteratively, where the at is concatenated on the state st as the input
of the critic. In evaluation step, follow SAC [12], RL-I2IT learns
we can see from Table 12. The tradeoff between the Dice score the actor πϕ and fits the parametric Q-function Qθ (st , at ) (critic)
and the inference time is shown in Fig. 27. We can see that the using transitions sampled from the replay pool D by minimizing
proposed RL-I2IT achieves a better tradeoff between registration the soft Bellman residual,
performance and computational efficiency.
  
1  2
JQ (θ) = ED Qθ (st , at ) − rt + γ E [Vθ̄ (st+1 )] ,
2
6 C ONCLUSION (15)
where Vθ̄ (st ) = Eat ∼πϕ [Qθ̄ (st , at ) − α log πϕ (at |pt )]. γ is the
In this paper, we propose a reinforcement learning-based discount factor. We use a target network Qθ̄ to stabilize training,
framework, RL-I2IT, to handle the I2IT problem. Our whose parameters θ̄ are obtained by an exponentially moving av-
RL-I2IT framework is an off-policy planner-actor-critic model. erage of parameters of the critic network [28]: θ̄ → τ θ +(1−τ )θ̄.
It can efficiently learn good policies in spaces with high- The hyper-parameter τ ∈ [0, 1]. To optimize the JQ (θ), we can do
dimensional continuous states and actions. The core component the stochastic gradient descent [12] with respect to the parameters
in RL-I2IT is the proposed meta policy with a new component θ as follows,
‘plan’, which is defined in latent subspace and can guide the actor
to generate high-dimensional executable actions. To the best of

θ = θ − ηQ ▽θ Qθ (st , at ) Qθ (st , at ) − rt
our knowledge, we are the first to propose an RL framework 
for the I2IT problem. Experiments based on diverse applications − γ [Qθ̄ (st+1 , at+1 ) − α log πϕ (at+1 |pt+1 )] .
demonstrate that this architecture achieves significant gains over (16)
existing state-of-the-art methods. Since the critic works on the actor, the optimization proce-
There are several potential limitations in our proposed frame- dure will also influence the planner’s decisions. Therefore, the
work. One potential limitation is that our framework can perform improvement step attempts to optimize the actor and the planner
only single-style NST tasks. Arbitrary style transfer methods parameters ϕ, ψ . Following [12], we can use the following ob-
usually use a pre-training model to extract depth features, while jective to minimize the KL divergence between the policy and a
our current RL-based framework directly interacts with the current Boltzmann distribution induced by the Q-function,
state instead of using pre-extracted features as input. This differ-
ence limits our current framework, which cannot perform arbitrary
Jκ,π (ψ, ϕ) =ED [α log(πϕ (at |pt )) − Qθ (st , at )]
style transfer. However, the main goal of the NST task in this
paper is to show the effectiveness of stylization-level controlling =ED [α log(πϕ (at |fψ (ϵt , st ))) − Qθ (st , at )] .
with our RL-based method and the superiority of our method in (17)
achieving the best NST quality. Using a single-style NST model The last equation holds because pt can be replaced by
is sufficient to serve the purpose. That said, we can extend the fψ (ϵt , st ) as we discussed before. It should be mentioned that
current framework to support arbitrary style transfer by observing the hyperparameter α can be automatically adjusted by using one
the depth features of the state. Another potential limitation is that proposed method from [12]. Then we can apply the stochastic
the number of steps of the testing process is a predefined hyper- gradient method to optimize parameters as follows,
parameter, which can be improved by learning from the model α▽pt πϕ (at |pt ) · ▽ψ fψ (ϵt , st )
automatically. ψ = ψ − ηψ , (18)
πϕ (at |pt )
In the future, we will try to address the aforementioned limi-
tations of our proposed framework. We expect that the proposed α▽at πϕ (at |pt )
architecture can potentially be extended to all I2IT tasks. ϕ = ϕ − ηϕ . (19)
πϕ (at |pt )
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 19

A PPENDIX B [12] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar,


H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic algorithms and
M ETA P OLICY WITH S KIP C ONNECTIONS applications,” arXiv preprint arXiv:1812.05905, 2018.
Like in a MERL model, the stochastic meta policy and maximum [13] Y. Xiang, X. Wang, S. Hu, B. Zhu, X. Huang, X. Wu, and S. Lyu,
entropy in our framework improve the exploration for more di- “Rmbench: Benchmarking deep reinforcement learning for robotic ma-
nipulator control,” IEEE/RSJ International Conference on Intelligent
verse generation possibilities, which helps to prevent agents from Robots (IROS), 2023.
producing a single type of plausible output during training (known [14] D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus,
as mode-collapse). “Improving sample efficiency in model-free reinforcement learning from
images,” arXiv preprint arXiv:1910.01741, 2019.
One specific characteristic in our framework is that we also add [15] A. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine, “Vi-
skip-connections from each down-sampling layer of the planner sual reinforcement learning with imagined goals,” arXiv preprint
to the corresponding up-sampling layer of the actor, as shown in arXiv:1807.04742, 2018.
Fig. 2. In this way, a natural-looking image is more likely to be [16] K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine, “Train-
ing diffusion models with reinforcement learning,” arXiv preprint
reconstructed since the details of state st can be passed to the actor arXiv:2305.13301, 2023.
by skip-connections. Besides, since both pt and st can be used by [17] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro,
the actor to generate the executable action at , over-exploration “High-resolution image synthesis and semantic manipulation with condi-
of the action space can be avoided in our RL-I2IT framework, tional gans,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2018, pp. 8798–8807.
where the variance is limited by the passed detail information. [18] Z. Luo, J. Hu, X. Wang, S. Lyu, B. Kong, Y. Yin, Q. Song, and X. Wu,
Furthermore, the skip-connections also facilitate back- “Stochastic actor-executor-critic for image-to-image translation,” in Pro-
propagation of the gradients of the auxiliary learning part to the ceedings of the Thirtieth International Joint Conference on Artificial
Intelligence, IJCAI-21, 2021, pp. 2775–2781.
actor. It is also a key point to accelerate and stabilize training
[19] C. Feng, J. Hu, X. Wang, S. Hu, B. Zhu, X. Wu, H. Zhu, and S. Lyu,
and avoid over-exploration since it helps the actor to focus on “Controlling neural style transfer with deep reinforcement learning,”
the refined details to bypass the coarse information from input to International Joint Conference on Artificial Intelligence (IJCAI), 2023.
target. [20] Z. Luo, J. Hu, X. Wang, S. Hu, B. Kong, Y. Yin, Q. Song, X. Wu,
and S. Lyu, “Stochastic planner-actor-critic for unsupervised deformable
image registration,” in Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 36, no. 2, 2022, pp. 1917–1925.
ACKNOWLEDGMENTS [21] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,
This work was supported in part by the National Natural Science “Context encoders: Feature learning by inpainting,” in Proceedings of
Foundation of China under Grants 42375148 and 42130608, the IEEE conference on computer vision and pattern recognition, 2016,
pp. 2536–2544.
Sichuan province Key Technology Research and Development [22] B. D. de Vos, F. F. Berendsen, M. A. Viergever, H. Sokooti, M. Staring,
project under Grants 2023YFG0305. and I. Išgum, “A deep learning framework for unsupervised affine and
Xin Wang is supported by University at Albany, SUNY Start- deformable image registration,” Medical image analysis, vol. 52, pp.
128–143, 2019.
up Grant.
[23] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic
style,” arXiv preprint arXiv:1508.06576, 2015.
[24] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative
R EFERENCES image inpainting with contextual attention,” in Proceedings of the IEEE
[1] Y. Pang, J. Lin, T. Qin, and Z. Chen, “Image-to-image translation: conference on computer vision and pattern recognition, 2018, pp. 5505–
Methods and applications,” arXiv preprint arXiv:2101.08629, 2021. 5514.
[2] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” ICLR, [25] B. Li, K. Xue, B. Liu, and Y.-K. Lai, “Bbdm: Image-to-image translation
2014. with brownian bridge diffusion models,” in Proceedings of the IEEE/CVF
[3] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks Conference on Computer Vision and Pattern Recognition, 2023, pp.
for biomedical image segmentation,” in International Conference on 1952–1961.
Medical image computing and computer-assisted intervention. Springer, [26] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic latent
2015, pp. 234–241. actor-critic: Deep reinforcement learning with a latent variable model,”
[4] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation in Neural Information Processing Systems (NeurIPS), 2020.
with conditional adversarial networks,” in Proceedings of the IEEE [27] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
conference on computer vision and pattern recognition, 2017, pp. 1125– MIT press, 2018.
1134. [28] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver,
[5] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and D. Wierstra, “Continuous control with deep reinforcement learning,”
and M. Norouzi, “Palette: Image-to-image diffusion models,” in ACM arXiv preprint arXiv:1509.02971, 2015.
SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–10. [29] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
[6] S. Sun, L. Wei, J. Xing, J. Jia, and Q. Tian, “Sddm: score-decomposed quality assessment: from error visibility to structural similarity,” IEEE
diffusion models on manifolds for unpaired image-to-image translation,” transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
in International Conference on Machine Learning. PMLR, 2023, pp. [30] L. R. Dice, “Measures of the amount of ecologic association between
33 115–33 134. species,” Ecology, vol. 26, no. 3, pp. 297–302, 1945.
[7] N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play [31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
diffusion features for text-driven image-to-image translation,” in Pro- applied to document recognition,” Proceedings of the IEEE, vol. 86,
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern no. 11, pp. 2278–2324, 1998.
Recognition, 2023, pp. 1921–1930. [32] G. Haskins, U. Kruger, and P. Yan, “Deep learning in medical image
[8] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro, “Exploring registration: a survey,” Machine Vision and Applications, vol. 31, no. 1,
generalization in deep learning,” arXiv preprint arXiv:1706.08947, 2017. pp. 1–18, 2020.
[9] J. C. Caicedo and S. Lazebnik, “Active object localization with deep rein- [33] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca,
forcement learning,” in Proceedings of the IEEE international conference “An unsupervised learning model for deformable medical image regis-
on computer vision, 2015, pp. 2488–2496. tration,” in Proceedings of the IEEE conference on computer vision and
[10] J. Hu, Z. Luo, X. Wang, S. Sun, Y. Yin, K. Cao, Q. Song, S. Lyu, and pattern recognition, 2018, pp. 9252–9260.
X. Wu, “End-to-end multimodal image registration via reinforcement [34] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial
learning,” Medical Image Analysis, vol. 68, p. 101878, 2021. transformer networks,” arXiv preprint arXiv:1506.02025, 2015.
[11] Z. Luo, X. Wang, X. Wu, Y. Yin, K. Cao, Q. Song, and J. Hu, “A [35] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral
spatiotemporal agent for robust multimodal registration,” IEEE Access, normalization for generative adversarial networks,” arXiv preprint
vol. 8, pp. 75 347–75 358, 2020. arXiv:1802.05957, 2018.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 20

[36] Y. Zeng, J. Fu, H. Chao, and B. Guo, “Learning pyramid-context [62] J. An, S. Huang, Y. Song, D. Dou, W. Liu, and J. Luo, “Artflow: Unbiased
encoder network for high-quality image inpainting,” in Proceedings of image style transfer via reversible neural flows,” in CVPR, 2021, pp. 862–
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 871.
2019, pp. 1486–1494. [63] H. Chen, Z. Wang, H. Zhang, Z. Zuo, A. Li, W. Xing, D. Lu et al.,
[37] C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic image completion,” in “Artistic style transfer with internal-external learning and contrastive
Proceedings of the IEEE/CVF Conference on Computer Vision and learning,” NeurIPS, vol. 34, 2021.
Pattern Recognition, 2019, pp. 1438–1447. [64] S. Liu, T. Lin, D. He, F. Li, M. Wang, X. Li, Z. Sun, Q. Li, and E. Ding,
[38] T. Yu, Z. Guo, X. Jin, S. Wu, Z. Chen, W. Li, Z. Zhang, and S. Liu, “Adaattn: Revisit attention mechanism in arbitrary neural style transfer,”
“Region normalization for image inpainting.” in AAAI, 2020, pp. 12 733– in Proceedings of the IEEE/CVF International Conference on Computer
12 740. Vision, 2021, pp. 6649–6658.
[39] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan, “Shift-net: Image inpainting [65] X. Li, S. Liu, J. Kautz, and M.-H. Yang, “Learning linear transformations
via deep feature rearrangement,” in Proceedings of the European confer- for fast image and video style transfer,” in Proceedings of the IEEE/CVF
ence on computer vision (ECCV), 2018, pp. 1–17. Conference on Computer Vision and Pattern Recognition, 2019, pp.
[40] G. Daras, J. Dean, A. Jalal, and A. G. Dimakis, “Intermediate layer 3809–3817.
optimization for inverse problems using deep generative models,” arXiv [66] Y. Deng, F. Tang, W. Dong, H. Huang, C. Ma, and C. Xu, “Arbitrary
preprint arXiv:2102.07364, 2021. video style transfer via multi-channel correlation,” AAAI, 2021.
[41] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, [67] W. Wang, S. Yang, J. Xu, and J. Liu, “Consistent video style transfer via
“Improved training of wasserstein gans,” Advances in neural information relaxation and regularization,” IEEE Transactions on Image Processing,
processing systems, 2017. vol. 29, pp. 9125–9139, 2020.
[68] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino,
[42] A. Jolicoeur-Martineau, “The relativistic discriminator: a key element
M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu et al., “Learning to
missing from standard gan,” arXiv preprint arXiv:1807.00734, 2018.
navigate in complex environments,” ICLR, 2017.
[43] R. Tyleček and R. Šára, “Spatial pattern templates for recognition
[69] “Pexels,” https://www.pexels.com/, 2022, accessed: 2022-03-12.
of objects with regular structure,” in German Conference on Pattern
[70] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open
Recognition. Springer, 2013, pp. 364–374.
source movie for optical flow evaluation,” in ECCV. Springer, 2012, pp.
[44] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- 611–625.
son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for [71] X. Yang, R. Kwitt, and M. Niethammer, “Fast predictive image regis-
semantic urban scene understanding,” in Proc. of the IEEE Conference tration,” in Deep Learning and Data Labeling for Medical Applications.
on Computer Vision and Pattern Recognition (CVPR), 2016. Springer, 2016, pp. 48–57.
[45] A. Yu and K. Grauman, “Fine-grained visual comparisons with local [72] B. B. Avants, C. L. Epstein, M. Grossman, and J. C. Gee, “Symmetric
learning,” in Proceedings of the IEEE Conference on Computer Vision diffeomorphic image registration with cross-correlation: evaluating auto-
and Pattern Recognition, 2014, pp. 192–199. mated labeling of elderly and neurodegenerative brain,” Medical image
[46] C. Wang, C. Xu, C. Wang, and D. Tao, “Perceptual adversarial networks analysis, vol. 12, no. 1, pp. 26–41, 2008.
for image-to-image transformation,” IEEE Transactions on Image Pro- [73] S. Klein, M. Staring, K. Murphy, M. A. Viergever, and J. P. Pluim,
cessing, vol. 27, no. 8, pp. 4066–4079, 2018. “Elastix: a toolbox for intensity-based medical image registration,” IEEE
[47] C. Wang, W. Niu, Y. Jiang, H. Zheng, Z. Yu, Z. Gu, and B. Zheng, transactions on medical imaging, vol. 29, no. 1, pp. 196–205, 2009.
“Discriminative region proposal adversarial network for high-quality [74] M. F. Beg, M. I. Miller, A. Trouvé, and L. Younes, “Computing large
image-to-image translation,” International Journal of Computer Vision, deformation metric mappings via geodesic flows of diffeomorphisms,”
2019. International journal of computer vision, vol. 61, no. 2, pp. 139–157,
[48] F. Gao, X. Xu, J. Yu, M. Shang, X. Li, and D. Tao, “Complementary, 2005.
heterogeneous and adversarial networks for image-to-image translation,” [75] G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca,
IEEE Transactions on Image Processing, vol. 30, pp. 3487–3498, 2021. “Voxelmorph: a learning framework for deformable medical image reg-
[49] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- istration,” IEEE transactions on medical imaging, vol. 38, no. 8, pp.
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 1788–1800, 2019.
2017. [76] A. V. Dalca, G. Balakrishnan, J. Guttag, and M. R. Sabuncu, “Unsuper-
[50] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The vised learning of probabilistic diffeomorphic registration for images and
unreasonable effectiveness of deep features as a perceptual metric,” in surfaces,” Medical image analysis, vol. 57, pp. 226–236, 2019.
Proceedings of the IEEE conference on computer vision and pattern [77] R. Sandkühler, S. Andermatt, G. Bauman, S. Nyilas, C. Jud, and P. C.
recognition, 2018, pp. 586–595. Cattin, “Recurrent registration neural networks for deformable image
[51] J. Cheng, A. Jaiswal, Y. Wu, P. Natarajan, and P. Natarajan, “Style-aware registration,” in Advances in Neural Information Processing Systems,
normalized loss for improving arbitrary style transfer,” in CVPR, 2021. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and
[52] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019.
adaptive instance normalization,” in CVPR, 2017, pp. 1501–1510. [78] H. Xu, J. Zhang, J. Cai, H. Rezatofighi, and D. Tao, “Gmflow: Learning
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image optical flow via global matching,” in Proceedings of the IEEE/CVF
recognition,” in CVPR, 2016, pp. 770–778. Conference on Computer Vision and Pattern Recognition, 2022, pp.
[54] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style 8121–8130.
transfer and super-resolution,” in ECCV. Springer, 2016. [79] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, and K. M. Yi, “Cotr:
Correspondence transformer for matching across images,” in Proceedings
[55] L. Gatys, A. S. Ecker, and M. Bethge, “Texture synthesis using convolu-
of the IEEE/CVF International Conference on Computer Vision, 2021,
tional neural networks,” NeurIPS, vol. 28, pp. 262–270, 2015.
pp. 6207–6217.
[56] F. Phillips and B. Mackintosh, “Wiki art gallery, inc.: A case for critical [80] S. Zhao, Y. Dong, E. I. Chang, Y. Xu et al., “Recursive cascaded
thinking,” Issues in Accounting Education, vol. 26, no. 3, pp. 593–608, networks for unsupervised medical image registration,” in Proceedings
2011. of the IEEE/CVF International Conference on Computer Vision, 2019,
[57] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, pp. 10 600–10 610.
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in [81] T. C. Mok and A. Chung, “Fast symmetric diffeomorphic image reg-
context,” in ECCV. Springer, 2014, pp. 740–755. istration with convolutional neural networks,” in Proceedings of the
[58] Y. Deng, F. Tang, X. Pan, W. Dong, C. Ma, and C. Xu, “Stytrˆ IEEE/CVF conference on computer vision and pattern recognition, 2020,
2: Unbiased image style transfer with transformers,” arXiv preprint pp. 4644–4653.
arXiv:2105.14576, 2021. [82] J. MacQueen et al., “Some methods for classification and analysis of
[59] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal multivariate observations,” in Proceedings of the fifth Berkeley sympo-
style transfer via feature transforms,” Advances in neural information sium on mathematical statistics and probability, vol. 1. Oakland, CA,
processing systems, vol. 30, 2017. USA, 1967, pp. 281–297.
[60] D. Y. Park and K. H. Lee, “Arbitrary style transfer with style-attentional [83] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based
networks,” in CVPR, 2019, pp. 5880–5888. noise removal algorithms,” Physica D: nonlinear phenomena, vol. 60,
[61] T. Lin, Z. Ma, F. Li, D. He, X. Li, E. Ding, N. Wang, J. Li, and X. Gao, no. 1-4, pp. 259–268, 1992.
“Drafting and revision: Laplacian pyramid network for fast high-quality [84] S. G. Mueller, M. W. Weiner, L. J. Thal, R. C. Petersen, C. R. Jack,
artistic style transfer,” in CVPR, 2021, pp. 5141–5150. W. Jagust, J. Q. Trojanowski, A. W. Toga, and L. Beckett, “Ways
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 21

toward an early diagnosis in alzheimer’s disease: the alzheimer’s disease


neuroimaging initiative (adni),” Alzheimer’s & Dementia, vol. 1, no. 1,
pp. 55–66, 2005.
[85] A. Di Martino, C.-G. Yan, Q. Li, E. Denio, F. X. Castellanos, K. Alaerts,
J. S. Anderson, M. Assaf, S. Y. Bookheimer, M. Dapretto et al., “The
autism brain imaging data exchange: towards a large-scale evaluation of
the intrinsic brain architecture in autism,” Molecular psychiatry, vol. 19,
no. 6, pp. 659–667, 2014.
[86] P. Bellec, C. Chu, F. Chouinard-Decorte, Y. Benhajali, D. S. Margulies,
and R. C. Craddock, “The neuro bureau adhd-200 preprocessed reposi-
tory,” Neuroimage, vol. 144, pp. 275–286, 2017.
[87] D. W. Shattuck, M. Mirza, V. Adisetiyo, C. Hojatkashani, G. Salamon,
K. L. Narr, R. A. Poldrack, R. M. Bilder, and A. W. Toga, “Construction
of a 3d probabilistic atlas of human cortical structures,” Neuroimage,
vol. 39, no. 3, pp. 1064–1080, 2008.
[88] P. Bilic, P. F. Christ, E. Vorontsov, G. Chlebus, H. Chen, Q. Dou, C.-W.
Fu, X. Han, P.-A. Heng, J. Hesser et al., “The liver tumor segmentation
benchmark (lits),” arXiv preprint arXiv:1901.04056, 2019.
[89] T. Heimann, B. Van Ginneken, M. A. Styner, Y. Arzhaeva, V. Aurich,
C. Bauer, A. Beck, C. Becker, R. Beichel, G. Bekes et al., “Comparison
and evaluation of methods for liver segmentation from ct datasets,” IEEE
transactions on medical imaging, vol. 28, no. 8, pp. 1251–1265, 2009.
[90] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. Hill, M. O. Leach, and
D. J. Hawkes, “Nonrigid registration using free-form deformations:
application to breast mr images,” IEEE transactions on medical imaging,
vol. 18, no. 8, pp. 712–721, 1999.

You might also like