Automatic Radiology Report Generation Based On Multi-View Image Fusion and Medical Concept Enrichment
Automatic Radiology Report Generation Based On Multi-View Image Fusion and Medical Concept Enrichment
1 Introduction
Medical images are widely used in clinical decision-making. For example, chest
x-ray images are used for diagnosing pneumonia and pleural effusion. The inter-
pretation of medical images requires extensive expertise and is prone to human
errors. Considering the demands of accurately interpreting medical images in
large amounts within short times, an automatic medical imaging report genera-
tion model can be helpful to alleviate the labor intensity involved in the task. In
2 J. Yuan et al.
this work, we aim to propose a novel medical imaging report generation model
focusing on radiology. To be more specific, the inputs of the proposed frame-
work are chest x-ray images under different views (frontal and lateral) based
on which radiology reports are generated accordingly. Radiology reports contain
information summarized by radiologists and are important for further diagnosis
and follow-up recommendations.
The problem setting is similar to image captioning, where the objective is
to generate descriptions for natural images. Most existing studies apply similar
structures including an encoder based on convolutional neural networks (CNN),
and a decoder based on recurrent neural networks (RNN) [11] which captures the
temporal information and is widely used in natural language processing (NLP).
Attention models have been applied in captioning to connect the visual contents
and semantics selectively [13]. More recently, studies on radiology report genera-
tion have shown promising results. To handle paragraph-level generation, a hier-
archical LSTM decoder has been applied to generate medical imaging reports [6]
incorporating with visual and tag attentions. Xue et al. build an iterative decoder
with visual attentions to enforce the coherence between sentences [14]. Li et al.
propose a retrieval model based on extracted disease graphs for medical report
generation [7]. Medical report generation is different from image captioning in
that: (1) data in medical and clinical domains is often limited in scales and thus
it is difficult to obtain robust models for reasoning; (2) medical reports are para-
graphs other than sentences as in image captioning, where conventional RNN
decoders such as long short-term memory (LSTM) have issues of gradient van-
ishing; and (3) generating medical reports requires higher precision when used
in practice, especially on medical-related contents, such as disease diagnosis.
We choose the widely used Indiana University Chest X-ray radiology report
dataset (IU-RR) [1] for this task. In most cases, radiology reports contain de-
scriptive findings in the form of paragraphs, and conclusive impressions in one or
a few sentences. To address the challenges mentioned above, we aim to improve
both the encoder and decoder in the following aspects:
First, we construct a multi-task scheme consisting of chest x-ray image classi-
fication and report generation. This strategy has been shown to be successful be-
cause the encoder is enforced to learn radiology-related features for decoding [6].
Since the data scale of IU-RR is small, encoder pretraining is important in order
to obtain a robust performance. Different from previous studies using ImageNet
which is collected for general-purposed object recognition, we pretrain with large
scale chest x-ray images from the same domain, namely CheXpert [5], to better
capture domain specific image features for decoding. Second, most of previous
studies using chest x-ray images for disease classification and report generation
consider the frontal and lateral images from the same patient as two independent
cases [6,12]. We argue that lateral images contain complementary information
to frontal images in the process of interpreting medical images. Such multi-view
features should be synthesized selectively other than contributing equally (con-
catenate, mean or sum) to the final results. Moreover, it is likely to generate
inconsistent results for the same patient based on images from different views.
Automatic Radiology Report Generation 3
Fig. 1. Overall framework of the proposed encoder and decoder with attentions. E, D,
and D0 denote the encoder, sentence decoder, and word decoder, respectively.
2 Methodology
X X
v v
X 2
LI = − yi,j log ŷi,j + (1 − yi,j ) log 1 − ŷi,j +λ yif − yil (1)
v∈{f,l} i,j i
and decoding, to generate radiology reports. The decoder contains two layers: a
sentence LSTM decoder that outputs sentence hidden states, and a word LSTM
decoder which decodes the sentence hidden states into natural languages. In this
way, reports are generated sentence-by-sentence.
Sentence Decoder with Attentions: The sentence decoder is fed with visual
features extracted from the encoder, and generates sentence hidden states. Since
we have both frontal and lateral features, the selection of fusion schemes is
important. As show in Figure 2, we propose and compare three fusion schemes:
an intuitive solution is to directly concatenate the features from both views; early
fusion where the concatenated features are selectively attended by the previous
hidden state; and late fusion which fuses the hidden states by two decoders after
visual-sentence attentions. To generate sentence hidden state hts at time step
ts ∈ (1, N s ), we compute the visual attention weights αi with Equation 2, where
vm is the m-th local feature, and Wa , Wv and Ws are weight matrices.
By leveragingPk all the local regions, the attended local feature is thus calculated
as vatt = m=1 αi,m vm , and is concatenated with the previous hidden state to
be fed into the sentence LSTM for computing the current hidden state hts .
Word Decoder with Attentions: Incorporated with the obtained medical
concepts, the sentence hidden states are used as inputs to the word LSTM de-
coder. For each word decoding step, the previous word hidden state ĥtw for time
step tw ∈ (1, Ntws ) is used to generate the word distribution over the vocabulary
and output the word with the highest score. The embedding wtw of the pre-
dicted word ŵtw is then fused with medical concepts in order to generate the
next word hidden state. Given the medical concept embeddings c ∈ Rn×dc for
p medical concepts for the i-th sample, and the predicted concept distributions
yˆic , the attention weights over all medical concepts at time step tw is defined in
Equation 3 where Wac , Wc , and Ww are the weight matrices to be learned.
h i
aci = Wac tanh yˆic Wc c + Ww ĥtw −1 , αic = sof tmax(aci ) (3)
predicted word distribution yˆw tw and the ground truth ytww using Equation 4.
N s
ts Nw
X X
ytww log yˆw tw
LW = − (4)
ts =1 tw =1
3 Experiment
Data Collection: CheXpert [5] contains 224,316 multi-view chest x-ray images
from 65,240 patients of 14 common radiographic observations. The observations
are generated using NLP tools from the radiology reports labeled as positive,
negative, and uncertain. We inherited and visualized the uncertain predictions
to address more expert attention for practical use. An alternative dataset is
ChestX-ray14 [12]. We chose to use CheXpert because its labeler is reported to
be more reliable as compared with ChestX-ray14 [5].
Since neither of the aforementioned datasets released radiology reports, we
use IU-RR [1] for evaluating radiology report generation. For preprocessing, we
first removed samples without multi-view images, and concatenated the “find-
ings” and “impression” sections because in some forms all contents are either in
the “findings” or “impression” section with the other left blank. We filtered out
the reports with less than 3 sentences. In the end, we obtained 3,074 samples
with multi-view images of which 20% (615 samples/1,330 images) are used for
testing, and the 80% (2459 samples/4,918 images) are used for training and val-
idation. For encoder fine-tuning, we extract the same 14 labels as [5] on IU-RR.
For report parsing, we converted the texts to tokens, and added “hstarti” and
“hendi” to the beginning and end of each sentence, respectively. Low frequency
(less than 3 occurrences) words were dropped, and textual errors were replaced
with “hunki” which are caused by being falsely recognized as confidential infor-
mation during the original data de-identification of IU-RR.
Radiology Report Generation: The evaluation metrics we use are BLEU [9],
METEOR [2], and ROUGE [8] scores, all of which are widely used in image cap-
tioning and machine translation tasks. We compared the proposed model with
several state-of-the-art baselines: (1) a visual attention based image captioning
model (Vis-Att) [13]; (2) radiology report generation models, including a hierar-
chical decoder with co-attention (Co-Att) [6], multimodal generative model with
visual attention (MM-Att) [14], and knowledge-drive retrieval based report gen-
eration (KERP) [7]; and (3) the proposed multi-view encoder with hierarchical
decoder (MvH) model, the base model with visual attentions and early fusion
(MvH+AttE), MvH with late fusion fashion (MvH+AttL), and the combination
of late fusion with medical concepts (MvH+AttL+MC). MvH+AttL+MC* is an
oracle run based on ground-truth medical concepts and considered as the upper
bound of the improvement caused by applying medical concepts. As shown in
Table 2, our proposed models generally outperform the state-of-the-art base-
lines. Compared with MvH, multi-view feature fusions by attentions (AttE and
AttL) yield better results. Applying medical concepts significantly improve the
performance especially on Meteors because the recalls rise with more semantical
information provided directly to the word decoder, and Meteor weights more on
recalls over precisions. However, the improvement is subject to prediction errors
on medical concepts, indicating that a better encoder would benefit the whole
model by a large margin as shown in MvH+AttL+MC*.
Discussion: As Figure 3 shows, AttL (and other baseline models) have difficul-
ties generating abnormalities and locations because there is no explicit abnormal
information involved in word-level decoding compared with our proposed model.
Not all the predicted medical concepts would necessarily appear in the generated
reports. On the other hand, the prediction errors from the encoder propagate,
such as predicting “right” instead of “right lung”, and affect the generated re-
ports, suggesting a more accurate encoder is beneficial. Moreover, since there
are no constraints on the sentence decoder during the training, it is likely to
generate similar hidden states for our model. In this case, a stacked attention
mechanism would be beneficial for forcing the decoder to focus on different image
sub-regions. In addition, we observe that it is difficult for our model to generate
unseen sentences and sometimes there are syntax errors. Such errors are due to
8 J. Yuan et al.
Fig. 3. An example report generated by the proposed model. The medical concepts
marked red are false (positive/negative) predictions. The underlined sentences are ab-
normality descriptions. Uncertain predictions are visualized using Grad-cam [10].
the limited corpus scale of IU-RR, and we expect by exploring unpaired textual
data for pretraining the decoder would address such limitations [3].
4 Conclusions
References
1. Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez,
L., Antani, S.K., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology
examinations for distribution and retrieval. JAMIA 23(2), 304–310 (2016)
2. Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evalua-
tion for any target language. In: Proceedings of the ninth workshop on statistical
machine translation. pp. 376–380 (2014)
3. Feng, Y., Ma, L., Liu, W., Luo, J.: Unsupervised image captioning. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4125–
4134 (2019)
Automatic Radiology Report Generation 9
4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
pp. 770–778 (2016)
5. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H.,
Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph
dataset with uncertainty labels and expert comparison. arXiv:1901.07031 (2019)
6. Jing, B., Xie, P., Xing, E.P.: On the automatic generation of medical imaging
reports. In: Proceedings of the 56th Annual Meeting of the Association for Com-
putational Linguistics, ACL 2018, Melbourne, Australia. pp. 2577–2586 (2018)
7. Li, C.Y., Liang, X., Hu, Z., Xing, E.P.: Knowledge-driven encode, retrieve, para-
phrase for medical image report generation. arxiv:1903.10122 (2019)
8. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Proceed-
ings of the ACL-04 Workshop. pp. 74–81. Association for Computational Linguis-
tics, Barcelona, Spain (July 2004)
9. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th annual meeting on
association for computational linguistics. pp. 311–318 (2002)
10. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-
cam: Visual explanations from deep networks via gradient-based localization. In:
Proceedings of the IEEE International Conference on Computer Vision. pp. 618–
626 (2017)
11. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image
caption generator. arxiv:1411.4555 (2015)
12. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8:
Hospital-scale chest x-ray database and benchmarks on weakly-supervised classi-
fication and localization of common thorax diseases. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 2097–2106 (2017)
13. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,
Bengio, Y.: Show, attend and tell: Neural image caption generation with visual
attention. In: arxiv:1502.03044 (2015)
14. Xue, Y., Xu, T., Long, L.R., Xue, Z., Antani, S.K., Thoma, G.R., Huang, X.: Mul-
timodal recurrent model with attention for automated radiology report generation.
In: Medical Image Computing and Computer Assisted Intervention 2018, Granada,
Spain, Proceedings, Part I. pp. 457–466 (2018)