Mmformer Multimodal Medical Transformer For Incomplete Multimodal Learning of Brain Tumor
Mmformer Multimodal Medical Transformer For Incomplete Multimodal Learning of Brain Tumor
Yao Zhang1,2? , Nanjun He3? , Jiawei Yang4 , Yuexiang Li3 , Dong Wei3 , Yawen
Huang3 , Yang Zhang5 , Zhiqiang He5 , and Yefeng Zheng3
arXiv:2206.02425v2 [eess.IV] 4 Aug 2022
1
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
2
University of Chinese Academy of Sciences, Beijing, China
3
Jarvis Lab, Tencent, Shenzhen, China
4
Electrical and Computer Engineering, University of California, Los Angeles, USA
5
Lenovo Research, Beijing, China
[email protected]
1 Introduction
Automated and accurate segmentation of brain tumors plays an essential role
in clinical assessment and diagnosis. Magnetic Resonance Imaging (MRI) is a
common neuroimaging technique for the quantitative evaluation of brain tu-
mors in clinical practice, where multiple imaging modalities, i.e., T1-weighted
(T1), contrast-enhanced T1-weighted (T1c), T2-weighted (T2), and Fluid Atten-
uated Inversion Recovery (FLAIR) images, are provided. Each imaging modality
provides a distinctive contrast of the brain structure and pathology. The joint
learning of multimodal images for brain tumor segmentation is essential and can
significantly boost the segmentation performance. Plenty of methods have been
widely explored to effectively fuse multimodal MRIs for brain tumor segmen-
tation by, for example, concatenating multimodal images in channel dimension
as the input or fusing features in the latent space [23,17]. However, in clinical
practice, it is not always possible to acquire a complete set of MRIs due to
data corruption, various scanning protocols, and unsuitable conditions of pa-
tients. In this situation, most existing multimodal methods may fail to deal with
incomplete imaging modalities and face a severe degradation in segmentation
performance. Consequently, a robust multimodal method is highly desired for a
flexible and practical clinical application with one or more missing modalities.
Incomplete multimodal learning, also known as hetero-modal learning [8],
aims at designing methods that are robust with any subset of available modalities
at inference. A straightforward strategy for incomplete multimodal learning of
brain tumor segmentation is synthesizing the missing modalities by generative
models [18]. Another stream of methods explores knowledge distillation from
complete modalities to incomplete ones [2,10,21]. Although promising results
are obtained, such methods have to train and deploy a specific model for each
subset of missing modalities, which is complicated and burdensome in clinical
application. Zhang et al. [22] proposed an ensemble learning of single-modal
models with adaptive fusion to achieve multimodal segmentation. However, it
only works when one or all modalities are available. Meanwhile, all these methods
require complete modalities during the training process.
Recent methods focused on learning a unified model, instead of a bunch of
distilled networks, for incomplete multimodal segmentation [8,16]. For exam-
ple, HeMIS [8] learns an embedding of multimodal information by computing
mean and variance across features from any number of available modalities. U-
HVED [4] further introduces multimodal variational auto-encoder to benefit in-
complete multimodal segmentation with generation of missing modalities. More
recent methods also proposed to exploit feature disentanglement [1] and atten-
tion mechanism [3] for robust multimodal brain tumor segmentation. Fully Con-
volutional Network (FCN) [11,15] has achieved great success in medical image
segmentation and is widely used for feature extraction in the methods mentioned
above. Despite its excellent performance, the inductive bias of convolution, i.e.,
the locality, makes FCN difficult to build long-range dependencies explicitly. In
incomplete multimodal learning of brain tumor segmentation, the features ex-
tracted with limited receptive fields tend to be biased when dealing with varying
Title Suppressed Due to Excessive Length 3
Layer Norm
Layer Norm
Linear Proj.
Multiplication
Conv
FFN
QFLAIR
Encoder δ Bernoulli Indicator
VFLAIR PE Position Encoding
KT1c δT1c
Linear Proj.
Layer Norm
Layer Norm
Layer Norm
Conv
FFN
Encoder QT1c
V Q K
VT1c
KT1 δT1
Layer Norm
Layer Norm
Linear Proj.
Conv
FFN
QT1
Encoder
VT1
Layer Norm
FFN
KKT2 δT2
Layer Norm
Layer Norm
4
Linear Proj.
Conv
FFN
QT2
Encoder
VT2
Upsampling
Reshape
Auxiliary Conv
Regularizers Decoder
Upsampling
2 Method
The hybrid modality-specific encoder aims to extract both local and global con-
text information within a specific modality by bridging a convolutional encoder
and an intra-modal Transformer. We denote the complete set of modalities by
M = {F LAIR, T 1c, T 1, T 2}. Given an input of Xm ∈ R1×D×H×W with a size
of D × H × W , m ∈ M , we first utilize the convolutional encoder to generate
compact feature maps with the local context and then leverage the intra-modal
Transformer to model the long-range dependency in a global space.
Convolutional Encoder. The convolutional encoder is constructed by stacking
convolutional blocks, similar to the encoder part of U-Net [15]. The feature
maps with the local context within each modality produced by the convolutional
conv
encoder Fm can be formulated as
Flocal
m
conv
= Fm conv
(Xm ; θm ) (1)
D H W
where Flocal
m ∈ RC× 2l−1 × 2l−1 × 2l−1 , C is the channel dimension, and l is the
number of the stages in the encoder. Concretely, we build a five-stage encoder,
and each stage consists of two convolutional blocks. Each block contains cascaded
group normalization, ReLU, and convolutional layers with kernel size of 3, while
the first convolutional block in the first stage only contains a convolutional layer.
Between two consecutive blocks, a convolutional layer with stride of 2 is employed
to downsample the feature maps. The number of filters at each level of the
encoder is 16, 32, 64, 128, and 256, respectively.
Title Suppressed Due to Excessive Length 5
Ftoken
m = Flocal
m Wm + Pm , (2)
C0× DHW
where Ftoken
m ∈ R 23(l−1) denotes the token and W
m denotes the weights
of linear projection. The MSA builds the relationship within each modality by
looking over all possible locations in the feature map, which is formulated as
Qim KiT
m
headim = Attention(Qim , Kim , Vm
i
) = sof tmax( √ i
)Vm , (3)
dk
Fglobal
m = F F Nm (LN (z)) + z, z = M SAm (LN (Ftoken
m )) + Ftoken
m , (5)
C0× DHW
where Fglobal
m ∈R 23(l−1) .
C 0 × DHW
where Fglobal ∈ R 2 (l−1)
.
X l−1
X
Ltotal = Lencoder
i + Ldecoder
i + Loutput , (9)
i∈M i=1
Title Suppressed Due to Excessive Length 7
Table 1. Results of the proposed method and state-of-the-art unified models, i.e.,
HeMIS [8] and U-HVED [4], on BraTS 2018 dataset [12]. Dice similarity coefficient
(DSC) [%] is employed for evaluation with every combination settings of modalities. •
and ◦ denote available and missing modalities, respectively.
https://www.med.upenn.edu/sbia/brats2018/data.html
8 Yao Zhang et al.
FLAIR Image Ground Truth T1, T1c, T2, FLAIR T1, T1c, T2 T1, T1c T1
4 Conclusion
We proposed a Transformer-based method for incomplete multimodal learning
of brain tumor segmentation. The proposed mmFormer bridges Transformer
and CNN to build the long-range dependencies both within and across different
modalities of MRI images for a modality-invariant representation. We validated
our method on brain tumor segmentation under various combinations of missing
modalities, and it outperformed state-of-the-art methods on the BraTS bench-
mark. Our method gains more improvements when more modalities are missing
and/or the target ones are more difficult to segment.
References
1. Chen, C., Dou, Q., Jin, Y., Chen, H., Qin, J., Heng, P.A.: Robust multimodal brain
tumor segmentation via feature disentanglement and gated fusion. In: International
Conference on Medical Image Computing and Computer Assisted Intervention. pp.
447–456. Springer (2019)
2. Chen, C., Dou, Q., Jin, Y., Liu, Q., Heng, P.A.: Learning with privileged multi-
modal knowledge for unimodal segmentation. IEEE Transactions on Medical Imag-
ing (2021)
3. Ding, Y., Yu, X., Yang, Y.: RFNet: Region-aware fusion network for incomplete
multi-modal brain tumor segmentation. In: Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision. pp. 3975–3984 (2021)
4. Dorent, R., Joutard, S., Modat, M., Ourselin, S., Vercauteren, T.: Hetero-modal
variational encoder-decoder for joint modality completion and segmentation. In:
International Conference on Medical Image Computing and Computer Assisted
Intervention. pp. 74–82. Springer (2019)
5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is
worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929 (2020)
10 Yao Zhang et al.
6. Dou, Q., Yu, L., Chen, H., Jin, Y., Yang, X., Qin, J., Heng, P.A.: 3D deeply super-
vised network for automated segmentation of volumetric medical images. Medical
Image Analysis 41, 40–54 (2017)
7. Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B.,
Roth, H.R., Xu, D.: UNETR: Transformers for 3D medical image segmentation.
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision. pp. 574–584 (2022)
8. Havaei, M., Guizard, N., Chapados, N., Bengio, Y.: HeMIS: Hetero-modal im-
age segmentation. In: International Conference on Medical Image Computing and
Computer Assisted Intervention. pp. 469–477. Springer (2016)
9. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415 (2016)
10. Hu, M., Maillard, M., Zhang, Y., Ciceri, T., La Barbera, G., Bloch, I., Gori, P.:
Knowledge distillation from multi-modal to mono-modal segmentation networks.
In: International Conference on Medical Image Computing and Computer Assisted
Intervention. pp. 772–781. Springer (2020)
11. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 3431–3440 (2015)
12. Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J.,
Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor
image segmentation benchmark (BRATS). IEEE Transactions on Medical Imaging
34(10), 1993–2024 (2014)
13. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks
for volumetric medical image segmentation. In: Fourth International Conference
on 3D Vision. pp. 565–571. IEEE (2016)
14. Peiris, H., Hayat, M., Chen, Z., Egan, G., Harandi, M.: A volumetric transformer
for accurate 3D tumor segmentation. arXiv preprint arXiv:2111.13300 (2021)
15. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-
cal image segmentation. In: International Conference on Medical Image Computing
and Computer Assisted Intervention. pp. 234–241. Springer (2015)
16. Shen, Y., Gao, M.: Brain tumor segmentation on MRI with missing modalities.
In: International Conference on Information Processing in Medical Imaging. pp.
417–428. Springer (2019)
17. Tseng, K.L., Lin, Y.L., Hsu, W., Huang, C.Y.: Joint sequence learning and cross-
modality convolution for 3D biomedical segmentation. In: Proceedings of the IEEE
conference on Computer Vision and Pattern Recognition. pp. 6393–6400 (2017)
18. Tulder, G.v., Bruijne, M.d.: Why does synthesized data improve multi-sequence
classification? In: International Conference on Medical Image Computing and Com-
puter Assisted Intervention. pp. 531–538. Springer (2015)
19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Pro-
cessing Systems 30 (2017)
20. Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., Li, J.: TransBTS: Multimodal
brain tumor segmentation using Transformer. In: International Conference on Med-
ical Image Computing and Computer Assisted Intervention. pp. 109–119. Springer
(2021)
21. Wang, Y., Zhang, Y., Liu, Y., Lin, Z., Tian, J., Zhong, C., Shi, Z., Fan, J., He, Z.:
ACN: Adversarial co-training network for brain tumor segmentation with missing
modalities. In: International Conference on Medical Image Computing and Com-
puter Assisted Intervention. pp. 410–420. Springer (2021)
Title Suppressed Due to Excessive Length 11
22. Zhang, Y., Yang, J., Tian, J., Shi, Z., Zhong, C., Zhang, Y., He, Z.: Modality-aware
mutual learning for multi-modal medical image segmentation. In: International
Conference on Medical Image Computing and Computer Assisted Intervention.
pp. 589–599. Springer (2021)
23. Zhou, C., Ding, C., Lu, Z., Wang, X., Tao, D.: One-pass multi-task convolutional
neural networks for efficient brain tumor segmentation. In: International Confer-
ence on Medical Image Computing and Computer Assisted Intervention. pp. 637–
645. Springer (2018)