Hetero-Modal Variational Encoder-Decoder For
Hetero-Modal Variational Encoder-Decoder For
Reuben Dorent, Samuel Joutard, Marc Modat, Sébastien Ourselin and Tom
Vercauteren
1 Introduction
2 Method
2.1 Multi-modal Variational Auto-Encoders (MVAE)
The MVAE [13] aims at identifying a model in which n modalities x = (x1 , .., xn )
are conditionally independent given a hidden latent variable z. We consider the
directed latent-variable model parameterised by θ (typically the weights of a
decoding network fθ (·) going from the latent space to the image space):
n
Y
pθ (z, x1 , ..., xn ) = p(z) pθ (xi |z) (1)
i=1
where p(z) is a prior on the latent space, which we classically choose as a standard
normal distribution z ∼ N (0, I). The goal is then to maximise the marginal log-
likelihood L(x; θ) = log(pθ (x1 , ..., xn )) with respect to θ. However, the integral
Hetero-Modal Variational Encoder-Decoder 3
R
pθ (x1 , ..., xn ) = pθ (x|z)p(z) is computationally intractable. [5] proposed to
optimise, with respect to (φ, θ), the evidence lower-bound (ELBO):
The KL divergence between the two Gaussians qφ (z|x) and p(z) can be computed
in closed form given by their means and covariances. In contrast, estimating
Eqφ (z|x) [log(pθ (x|z))] is done by sampling the hidden variable z according to the
Gaussian qφ (·|x) and then decoding it as fθ (z) in image space to evaluate pθ (x|z).
To make sampling from z|x amenable to back-propagation, reparametrisation is
used [5]: µφ (x) + Σφ (x) × where ∼ N (0, I).
Wu, et al. [13] extended this variational formulation to a multi-modal setting.
Qn
The authors remarked that pθ (z|x) ∝ p(z) i=1 pθp(z) (z|xi )
. This expression shows
that pθ (z|x) can be decomposed into n modality-specific terms. For this reason,
the authors approximate each pθp(z) (z|xi )
with a modality-specific variational pos-
terior qφi (z|xi ). Similarly to (3), qφi (z|xi ) is modelled as a Gaussian distribution
after an encoding of xi into a mean and a diagonal covariance by a neural net-
work, hφi (xi ) = µφi (xi ), Σφi (xi ) , such that qQ i (z|xi ) = N (z; µφi (xi ), Σφi (xi )).
n
Finally, [1] demonstrates that qφ (z|x) ∝ p(z) i=1 qφi (z|xi ) is Gaussian with
mean µφ and covariance Σφ defined by:
X X
Σφ = (I + Σφ−1
i
)−1 and µφ = Σφ−1 ( Σφ−1
i
µφi ) (4)
i i
This formulation allows for encoding each modality independently and fusing
their encoding using a closed-form formula.
However, from this well-posed multimodal extension of the ELBO, [13] resort
to a ad hoc training sampling procedure. At each training iteration, the extremes
cases (one modality and all the modalities) and random modality subsets are
used concurrently. This option is highly memory consuming, not suitable for 3D
images and not adapted to the clinical scenarios where some imaging subsets are
clinically more frequent than others. The next section proposes to include this
prior information in our principled training procedure via ancestral sampling.
T1 Decoder: fϕ
T1 Encoder: hϕ T1 (μ ϕ , Σ ϕ )
T1
T1 T1
T1c Decoder: fϕ
T1c Encoder: hϕT 1c (μ ϕ , Σ ϕ )
T1c
T1 c T 1c
(μ ϕ , Σϕ) z ~ N (μ ϕ , Σϕ ) T2 Decoder: fϕ
T2
T2 Encoder: hϕT2 (μ ϕ , Σ ϕ )
T2 T2
FLAIR Decoder: f ϕ Fl
FLAIR Encoder: h ϕ Fl
(μ ϕ , Σ ϕ )
Fl Fl
Seg Decoder: fϕ
seg
Fig. 1. MVAE architecture. Each imaging modality is encoded independently, the mean
and covariance of each q(z|xi ) are fused using the closed-form formula (4). A sample z
is randomly drawn and is decoded into imaging modalities and the segmentation map.
We choose qφπ P
(z|xπ ) as Gaussian. Given the convexity of the KL divergence and
the fact that π∈P απ = 1, we obtain:
X
KL[qφ (z|x)||p(z)] ≤ απ KL[qφπ (z|xπ )||p(z)]
π
The single Gaussian prior model for p(z) promotes consistency of the embedding
z across the subsets of modalities π (qφπ (z|xπ )) and in turn across the full set of
modalities (qφ (z|x)). In our optimisation procedure, at each iteration, we propose
to randomly draw a subset π with a probability απ as the model input and
Hetero-Modal Variational Encoder-Decoder 5
x2
x2
x2
64 64
32 32
16 16
8 88
~ M
88
~ 1: for imaging
modality
~ 4: for segmentation
16 16 x2
32 32 x2
~ x2
x2
x2 N 1x1x1 2x2x2
x2 Product of x2
Concat.
Gaussians (4) Convolutions max-pooling
N
64 64
32 32 Hidden Inst. Leaky- N 3x3x3 Trilinear
16 16 Variable Norm ReLU convolutions x2 Upsampling
N
8 88
Fig. 2. Our 3D variational encoder-decoder (U-HVED). Only two encoders and one
decoders are shown. Product of Gaussian is defined in (4).
Choices of the losses. The reconstruction loss follows from pθ (xi |z). For the
segmentation we use the sum of the cross-entropy Lcross and the dice loss func-
tion Ldice [4]. For the imaging reconstruction loss, we used the classic L2 loss.
Additionally, given a drawn subset π, our loss includes the closed-form KL diver-
gence between the Gaussians qφ (z|xπ ) and p(z). For weighting the regularization
losses (KL divergence and reconstruction loss), we did a grid search over weights
in [0, 0.1, 1]. Finally, the loss associated to maximising the ELBO (5) is:
FLAIR
T1
Seg
Fig. 3. Example of FLAIR and T1 completion and tumour segmentation given a subset
of modalities as input. Green: edema; Red: non-enhancing core; Blue: enhancing core.
modalities are encoded as U-HVED and the skip-connection are the first and sec-
ond moments of the modality-specific feature maps such as in HeMIS. U-HeMIS
has only one decoder for tumour segmentation. The third approach, Single, is
the ”brute-force” method in which for each possible subset of modalities, we
train a U-Net network where the observed modalities are concatenated as input.
The encoder and decoder are those of our model. Given the 3-fold validation, we
consequently trained 45 Single networks.
Table 1. Comparison of the different models (Dice %) for the different combinations
of available modalities. Modalities present are denoted by •, the missing ones by ◦. ∗
denotes significant improvement provided by a Wilcoxon test (p < 0.05).
References
1. Cao, Y., Fleet, D.J.: Generalized product of experts for automatic and principled
fusion of Gaussian Process Predictions. CoRR abs/1410.7827 (2014)
2. Gibson, E., Li, W., et al.: NiftyNet: a deep-learning platform for medical imaging.
Computer Methods and Programs in Biomedicine 158, 113 – 122 (2018)
3. Havaei, M., Guizard, N., Chapados, N., Bengio, Y.: Hemis: Hetero-modal image
segmentation. In: MICCAI 2016. pp. 469–477. Springer, Cham (2016)
4. Isensee, F., et al.: No new-net. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke
and Traumatic Brain Injuries. pp. 234–244. Springer, Cham (2019)
5. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)
6. Li, R., Zhang, W., Suk, H.I., Wang, L., Li, J., Shen, D., Ji, S.: Deep learning based
imaging data completion for improved brain disease diagnosis. In: MICCAI 2014.
pp. 305–312. Springer, Cham (2014)
7. Menze, B.H., et al.: The multimodal brain tumor image segmentation benchmark
BRATS. In: IEEE Transactions on Medical Imaging. vol. 34, pp. 1993–2024 (2015)
8. Milletari, F., N.N., Ahmadi, S.: V-net: Fully convolutional neural networks for
volumetric medical image segmentation. In: Inter-national Conference on 3D Vision
(3DV),. pp. 565–571 (2016)
9. Myronenko, A.: 3d MRI brain tumor segmentation using autoencoder regular-
ization. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain
Injuries. pp. 311–320. Springer, Cham (2019)
10. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: MICCAI 2015. pp. 234–241. Springer, Cham (2015)
11. Sø nderby, C.K., Raiko, T., Maalø e, L., Sø nderby, S.r.K., Winther, O.: Ladder
variational autoencoders. In: NeurIPS. pp. 3738–3746 (2016)
12. Varsavsky, T., et al.: PIMMS: permutation invariant multi-modal segmentation.
In: DLMIA 2018, MICCAI 2018. pp. 201–209 (2018)
13. Wu, M., Goodman, N.: Multimodal generative models for scalable weakly-
supervised learning. In: NeurIPS. pp. 5580–5590 (2018)
14. Zhao, S., Song, J., Ermon, S.: Learning hierarchical features from deep generative
models. In: ICML. pp. 4091–4099 (2017)