0% found this document useful (0 votes)
93 views9 pages

Automatic Radiology Report Generation Based On Multi-View Image Fusion and Medical Concept Enrichment

The document proposes a model for automatic radiology report generation based on chest x-ray images. It introduces multi-view image fusion and medical concept enrichment to generate more accurate reports. Experimental results on a chest x-ray dataset show the proposed model achieves state-of-the-art performance compared to baselines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views9 pages

Automatic Radiology Report Generation Based On Multi-View Image Fusion and Medical Concept Enrichment

The document proposes a model for automatic radiology report generation based on chest x-ray images. It introduces multi-view image fusion and medical concept enrichment to generate more accurate reports. Experimental results on a chest x-ray dataset show the proposed model achieves state-of-the-art performance compared to baselines.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Automatic Radiology Report Generation based

on Multi-view Image Fusion and


Medical Concept Enrichment

Jianbo Yuan1 , Haofu Liao1 , Rui Luo2 , and Jiebo Luo1


1
Department of Computer Science, University of Rochester
arXiv:1907.09085v2 [eess.IV] 23 Jul 2019

{jyuan10, hliao6, jluo}@cs.rochester.edu


2
Futurewei Technologies, Inc., Bellevue, WA 98004, USA
[email protected]

Abstract. Generating radiology reports is time-consuming and requires


extensive expertise in practice. Therefore, reliable automatic radiology
report generation is highly desired to alleviate the workload. Although
deep learning techniques have been successfully applied to image classi-
fication and image captioning tasks, radiology report generation remains
challenging in regards to understanding and linking complicated medical
visual contents with accurate natural language descriptions. In addition,
the data scales of open-access datasets that contain paired medical im-
ages and reports remain very limited. To cope with these practical chal-
lenges, we propose a generative encoder-decoder model and focus on chest
x-ray images and reports with the following improvements. First, we pre-
train the encoder with a large number of chest x-ray images to accurately
recognize 14 common radiographic observations, while taking advantage
of the multi-view images by enforcing the cross-view consistency. Second,
we synthesize multi-view visual features based on a sentence-level atten-
tion mechanism in a late fusion fashion. In addition, in order to enrich
the decoder with descriptive semantics and enforce the correctness of
the deterministic medical-related contents such as mentions of organs or
diagnoses, we extract medical concepts based on the radiology reports in
the training data and fine-tune the encoder to extract the most frequent
medical concepts from the x-ray images. Such concepts are fused with
each decoding step by a word-level attention model. The experimental
results conducted on the Indiana University Chest X-Ray dataset demon-
strate that the proposed model achieves the state-of-the-art performance
compared with other baseline approaches.

1 Introduction
Medical images are widely used in clinical decision-making. For example, chest
x-ray images are used for diagnosing pneumonia and pleural effusion. The inter-
pretation of medical images requires extensive expertise and is prone to human
errors. Considering the demands of accurately interpreting medical images in
large amounts within short times, an automatic medical imaging report genera-
tion model can be helpful to alleviate the labor intensity involved in the task. In
2 J. Yuan et al.

this work, we aim to propose a novel medical imaging report generation model
focusing on radiology. To be more specific, the inputs of the proposed frame-
work are chest x-ray images under different views (frontal and lateral) based
on which radiology reports are generated accordingly. Radiology reports contain
information summarized by radiologists and are important for further diagnosis
and follow-up recommendations.
The problem setting is similar to image captioning, where the objective is
to generate descriptions for natural images. Most existing studies apply similar
structures including an encoder based on convolutional neural networks (CNN),
and a decoder based on recurrent neural networks (RNN) [11] which captures the
temporal information and is widely used in natural language processing (NLP).
Attention models have been applied in captioning to connect the visual contents
and semantics selectively [13]. More recently, studies on radiology report genera-
tion have shown promising results. To handle paragraph-level generation, a hier-
archical LSTM decoder has been applied to generate medical imaging reports [6]
incorporating with visual and tag attentions. Xue et al. build an iterative decoder
with visual attentions to enforce the coherence between sentences [14]. Li et al.
propose a retrieval model based on extracted disease graphs for medical report
generation [7]. Medical report generation is different from image captioning in
that: (1) data in medical and clinical domains is often limited in scales and thus
it is difficult to obtain robust models for reasoning; (2) medical reports are para-
graphs other than sentences as in image captioning, where conventional RNN
decoders such as long short-term memory (LSTM) have issues of gradient van-
ishing; and (3) generating medical reports requires higher precision when used
in practice, especially on medical-related contents, such as disease diagnosis.
We choose the widely used Indiana University Chest X-ray radiology report
dataset (IU-RR) [1] for this task. In most cases, radiology reports contain de-
scriptive findings in the form of paragraphs, and conclusive impressions in one or
a few sentences. To address the challenges mentioned above, we aim to improve
both the encoder and decoder in the following aspects:
First, we construct a multi-task scheme consisting of chest x-ray image classi-
fication and report generation. This strategy has been shown to be successful be-
cause the encoder is enforced to learn radiology-related features for decoding [6].
Since the data scale of IU-RR is small, encoder pretraining is important in order
to obtain a robust performance. Different from previous studies using ImageNet
which is collected for general-purposed object recognition, we pretrain with large
scale chest x-ray images from the same domain, namely CheXpert [5], to better
capture domain specific image features for decoding. Second, most of previous
studies using chest x-ray images for disease classification and report generation
consider the frontal and lateral images from the same patient as two independent
cases [6,12]. We argue that lateral images contain complementary information
to frontal images in the process of interpreting medical images. Such multi-view
features should be synthesized selectively other than contributing equally (con-
catenate, mean or sum) to the final results. Moreover, it is likely to generate
inconsistent results for the same patient based on images from different views.
Automatic Radiology Report Generation 3

Fig. 1. Overall framework of the proposed encoder and decoder with attentions. E, D,
and D0 denote the encoder, sentence decoder, and word decoder, respectively.

We propose to synthesize multi-view information by applying a sentence-level


attention model, and enforce the encoder to extract consistent features with a
cross-view consistency (CVC) loss.
From the decoder side, we use hierarchical LSTM (sentence and word level
LSTM) to generate radiology reports. RNN decoders tend to memorize word
distributions and patterns which frequently occur in the training data, and thus
might produce inaccurate results when the target contents have not been ob-
served, or when multiple patterns share similar distributions given the previ-
ous contents. Such limitations significantly hinders the credibility of machine-
generated results in practical use, since incorrect medical-related contents can
be very misleading. For example, generating “left-sided pleural effusion” while
the ground truth is “right-sided pleural effusion”. In addition, the source visual
contents stay too far from the targeted word decoder in hierarchical LSTM which
makes the generation process more difficult. To address such issues, we explore
the semantics conveyed in the textual contents and apply them directly to the
word decoder. We first extract frequent medical concepts based on the radiology
reports and fine-tune the encoder to recognize such concepts. The obtained med-
ical concepts contain explicit information to accurately generate deterministic
medical-related contents, such as diagnosis, locations, and observations.
The main contributions of our work are summarized as follows: (1) to the
best of our knowledge, we are the first to employ the latest CheXpert dataset to
obtain a more robust encoder for radiology report generation; (2) we selectively
incorporate multi-view visual contents with sentence-level attentions and enforce
the consistency between different views; (3) we extract and apply medical con-
cepts to the decoder with word-level attentions to enhance the correctness of the
medical-related contents; (4) our integrated framework outperforms the state-of-
the-art baselines in the experiments, and (5) we visualize uncertain radiographic
observations predicted by the encoder to provide an added benefit to direct more
expert attention to such uncertainties for further analysis in practice.

2 Methodology

The proposed framework consists of a multi-view CNN encoder and a concept


enriched hierarchical LSTM decoder as in Figure 1. We apply a multi-task scheme
4 J. Yuan et al.

including: (1) radiographic observation classification to pretrain and fine-tune


the encoder with large-scale images; (2) to extract medical concepts; and (3) to
fuse all information to generate radiology reports. Therefore, two datasets are
used in this work: CheXpert [5], a large collection of chest x-ray images under
14 common chest radiographic observations to pretrain the image encoder, and
Indiana University Chest X-ray [1] containing full radiology reports but in a
considerably smaller scale for training and evaluating the report generation task.

2.1 Image Encoder


The encoder uses Resnet-152 [4] as the backbone and extracts visual features for
predicting chest radiologic observations and radiology report generation.
Chest Radiographic Observations: The task is formulated as a multi-label
classification with 14 common radiographic observations following [5] including:
enlarged cardiom, cardiomegaly, lung opacity, lung lesion, edema, consolidation,
pneumonia, atelectasis, pneumothorax, pleural effusion, pleural other, fracture,
support devices, and no finding. Compared with previous studies using pretrained
encoders based on ImageNet [6,14], pretraining with images from the same do-
main yields better results. We add one full-connected layer as classifier and
compute the binary cross entropy (BCE) loss. Additionally, we consider both
frontal and lateral images of one patient as one input pair and enforce the pre-
diction consistencies under different views by a mean square error (MSE) loss
over the multi-view encoder outputs. The loss function is thus defined in Equa-
tion 1 where yi,j is the j-th ground truth entry (j ∈ [1, 14])) of the i-th sample
(i ∈ (1, N )), the frontal view and lateral view prediction are denoted as ŷif and
ŷil . The encoder outputs global features after average-pooling, and local features
v ∈ Rk×dv from the last CNN block where k denotes the number of local regions
and dv denotes the dimension.

X X
v v
 X 2
LI = − yi,j log ŷi,j + (1 − yi,j ) log 1 − ŷi,j +λ yif − yil (1)
v∈{f,l} i,j i

Medical Concepts: The textual reports contain descriptive information related


to the visual contents which have not yet been explored by existing models. In
IU-RR, Medical text indexer (MTI) can be potentially used in a similar man-
ner [6]. However, MTIs are sometimes noisy and not normalized. Therefore, we
use Semrep3 to extract normalized medical concepts that frequently occur in the
training reports. We empirically set the minimal occurrences as 80 and obtained
69 medical concepts for a decent detection accuracy. We fix the pretrained image
encoder, and add another fully connected layer on top as the concept classifier.

2.2 Hierarchical Decoder


Since conventional RNN decoder is not suitable for paragraph generation, we
apply a hierarchical decoder, which has been widely used in paragraph encoding
3
https://semrep.nlm.nih.gov/
Automatic Radiology Report Generation 5

Fig. 2. Different fusion schemes for multi-view image features.

and decoding, to generate radiology reports. The decoder contains two layers: a
sentence LSTM decoder that outputs sentence hidden states, and a word LSTM
decoder which decodes the sentence hidden states into natural languages. In this
way, reports are generated sentence-by-sentence.
Sentence Decoder with Attentions: The sentence decoder is fed with visual
features extracted from the encoder, and generates sentence hidden states. Since
we have both frontal and lateral features, the selection of fusion schemes is
important. As show in Figure 2, we propose and compare three fusion schemes:
an intuitive solution is to directly concatenate the features from both views; early
fusion where the concatenated features are selectively attended by the previous
hidden state; and late fusion which fuses the hidden states by two decoders after
visual-sentence attentions. To generate sentence hidden state hts at time step
ts ∈ (1, N s ), we compute the visual attention weights αi with Equation 2, where
vm is the m-th local feature, and Wa , Wv and Ws are weight matrices.

ai = Wa [tanh (Wv vi + Ws hts −1 )] , αi = sof tmax(ai ) (2)

By leveragingPk all the local regions, the attended local feature is thus calculated
as vatt = m=1 αi,m vm , and is concatenated with the previous hidden state to
be fed into the sentence LSTM for computing the current hidden state hts .
Word Decoder with Attentions: Incorporated with the obtained medical
concepts, the sentence hidden states are used as inputs to the word LSTM de-
coder. For each word decoding step, the previous word hidden state ĥtw for time
step tw ∈ (1, Ntws ) is used to generate the word distribution over the vocabulary
and output the word with the highest score. The embedding wtw of the pre-
dicted word ŵtw is then fused with medical concepts in order to generate the
next word hidden state. Given the medical concept embeddings c ∈ Rn×dc for
p medical concepts for the i-th sample, and the predicted concept distributions
yˆic , the attention weights over all medical concepts at time step tw is defined in
Equation 3 where Wac , Wc , and Ww are the weight matrices to be learned.
h  i
aci = Wac tanh yˆic Wc c + Ww ĥtw −1 , αic = sof tmax(aci ) (3)

Similar to visual attention


Pp model, the attended medical concept feature is cal-
c
culated as catt = n=1 j,n cn , and is concatenated with the previous word
α
embedding to generate the next word. We use cross entropy loss LW given the
6 J. Yuan et al.

predicted word distribution yˆw tw and the ground truth ytww using Equation 4.

N s
ts Nw
X X
ytww log yˆw tw

LW = − (4)
ts =1 tw =1

3 Experiment

Data Collection: CheXpert [5] contains 224,316 multi-view chest x-ray images
from 65,240 patients of 14 common radiographic observations. The observations
are generated using NLP tools from the radiology reports labeled as positive,
negative, and uncertain. We inherited and visualized the uncertain predictions
to address more expert attention for practical use. An alternative dataset is
ChestX-ray14 [12]. We chose to use CheXpert because its labeler is reported to
be more reliable as compared with ChestX-ray14 [5].
Since neither of the aforementioned datasets released radiology reports, we
use IU-RR [1] for evaluating radiology report generation. For preprocessing, we
first removed samples without multi-view images, and concatenated the “find-
ings” and “impression” sections because in some forms all contents are either in
the “findings” or “impression” section with the other left blank. We filtered out
the reports with less than 3 sentences. In the end, we obtained 3,074 samples
with multi-view images of which 20% (615 samples/1,330 images) are used for
testing, and the 80% (2459 samples/4,918 images) are used for training and val-
idation. For encoder fine-tuning, we extract the same 14 labels as [5] on IU-RR.
For report parsing, we converted the texts to tokens, and added “hstarti” and
“hendi” to the beginning and end of each sentence, respectively. Low frequency
(less than 3 occurrences) words were dropped, and textual errors were replaced
with “hunki” which are caused by being falsely recognized as confidential infor-
mation during the original data de-identification of IU-RR.

Table 1. Average ROC-AUC (avg-AUC) on radiographic observation classification.

Metrics Base ImgNet CX CX+CVC CX+CVC+F


avg-AUC 0.727 0.731 0.747 0.751 0.764

Chest Radiographic Observations: We conducted extensive experiments on


the encoder regarding two factors: how to properly pretrain and fine-tune the en-
coder, and how to leverage the multi-view information. The classification results
on radiographic observations are shown in Table 1. In general, pretraining with
ImageNet (ImgNet) performs marginally better than models without pretraining
(Base), and encoders pretrained by CheXpert (CX) performs the best, indicat-
ing that pretraining with large scale data in the same domain helps. Enforcing
cross-view consistency (CX+CVC) also improves the results. We obtained the
best result by fusing multi-view predictions with max operation (CX+CVC+F).
Automatic Radiology Report Generation 7

Table 2. Evaluations of generated radiology reports.

Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE


Vis-Att [13] 0.399 0.251 0.168 0.118 0.167 0.323
MM-Att [14] 0.464 0.358 0.270 0.195 0.274 0.366
KERP [7] 0.482 0.325 0.226 0.162 − 0.339
Co-Att [6] 0.517 0.386 0.306 0.247 0.217 0.447
(Ours) MvH 0.478 0.334 0.277 0.191 0.265 0.318
(Ours) MvH+AttE 0.483 0.337 0.285 0.228 0.282 0.335
(Ours) MvH+AttL 0.488 0.357 0.296 0.246 0.313 0.351
(Ours) MvH+AttL+MC 0.529 0.372 0.315 0.255 0.343 0.453
(Ours) MvH+AttL+MC* 0.649 0.500 0.413 0.303 0.418 0.496

Radiology Report Generation: The evaluation metrics we use are BLEU [9],
METEOR [2], and ROUGE [8] scores, all of which are widely used in image cap-
tioning and machine translation tasks. We compared the proposed model with
several state-of-the-art baselines: (1) a visual attention based image captioning
model (Vis-Att) [13]; (2) radiology report generation models, including a hierar-
chical decoder with co-attention (Co-Att) [6], multimodal generative model with
visual attention (MM-Att) [14], and knowledge-drive retrieval based report gen-
eration (KERP) [7]; and (3) the proposed multi-view encoder with hierarchical
decoder (MvH) model, the base model with visual attentions and early fusion
(MvH+AttE), MvH with late fusion fashion (MvH+AttL), and the combination
of late fusion with medical concepts (MvH+AttL+MC). MvH+AttL+MC* is an
oracle run based on ground-truth medical concepts and considered as the upper
bound of the improvement caused by applying medical concepts. As shown in
Table 2, our proposed models generally outperform the state-of-the-art base-
lines. Compared with MvH, multi-view feature fusions by attentions (AttE and
AttL) yield better results. Applying medical concepts significantly improve the
performance especially on Meteors because the recalls rise with more semantical
information provided directly to the word decoder, and Meteor weights more on
recalls over precisions. However, the improvement is subject to prediction errors
on medical concepts, indicating that a better encoder would benefit the whole
model by a large margin as shown in MvH+AttL+MC*.
Discussion: As Figure 3 shows, AttL (and other baseline models) have difficul-
ties generating abnormalities and locations because there is no explicit abnormal
information involved in word-level decoding compared with our proposed model.
Not all the predicted medical concepts would necessarily appear in the generated
reports. On the other hand, the prediction errors from the encoder propagate,
such as predicting “right” instead of “right lung”, and affect the generated re-
ports, suggesting a more accurate encoder is beneficial. Moreover, since there
are no constraints on the sentence decoder during the training, it is likely to
generate similar hidden states for our model. In this case, a stacked attention
mechanism would be beneficial for forcing the decoder to focus on different image
sub-regions. In addition, we observe that it is difficult for our model to generate
unseen sentences and sometimes there are syntax errors. Such errors are due to
8 J. Yuan et al.

Fig. 3. An example report generated by the proposed model. The medical concepts
marked red are false (positive/negative) predictions. The underlined sentences are ab-
normality descriptions. Uncertain predictions are visualized using Grad-cam [10].

the limited corpus scale of IU-RR, and we expect by exploring unpaired textual
data for pretraining the decoder would address such limitations [3].

4 Conclusions

In this paper, we present a novel encoder-decoder model for radiology report


generation. The proposed model takes advantage of multi-view information in
radiology by applying visual attentions in a late fusion fashion, and enriches
the semantics involved in the hierarchical LSTM decoder with medical concepts.
Consequently, both the visual and textual contents have been better exploited to
achieve the state-of-the-art performance. The automatic interpretation approach
will simplify and expedite the conventional process of generating radiology re-
ports and better assist human experts in decision making. As a valuable added
benefit, uncertain radiographic observations are extracted and visualized by our
model because it is important to direct more expert attention to such uncertain-
ties for further analysis in practice.

Acknowledgment This work is supported in part by NSF through award IIS-


1722847, NIH through the Morris K. Udall Center of Excellence in Parkinson’s
Disease Research, and our corporate sponsor.

References
1. Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez,
L., Antani, S.K., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology
examinations for distribution and retrieval. JAMIA 23(2), 304–310 (2016)
2. Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evalua-
tion for any target language. In: Proceedings of the ninth workshop on statistical
machine translation. pp. 376–380 (2014)
3. Feng, Y., Ma, L., Liu, W., Luo, J.: Unsupervised image captioning. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4125–
4134 (2019)
Automatic Radiology Report Generation 9

4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
pp. 770–778 (2016)
5. Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H.,
Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph
dataset with uncertainty labels and expert comparison. arXiv:1901.07031 (2019)
6. Jing, B., Xie, P., Xing, E.P.: On the automatic generation of medical imaging
reports. In: Proceedings of the 56th Annual Meeting of the Association for Com-
putational Linguistics, ACL 2018, Melbourne, Australia. pp. 2577–2586 (2018)
7. Li, C.Y., Liang, X., Hu, Z., Xing, E.P.: Knowledge-driven encode, retrieve, para-
phrase for medical image report generation. arxiv:1903.10122 (2019)
8. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Proceed-
ings of the ACL-04 Workshop. pp. 74–81. Association for Computational Linguis-
tics, Barcelona, Spain (July 2004)
9. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th annual meeting on
association for computational linguistics. pp. 311–318 (2002)
10. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-
cam: Visual explanations from deep networks via gradient-based localization. In:
Proceedings of the IEEE International Conference on Computer Vision. pp. 618–
626 (2017)
11. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image
caption generator. arxiv:1411.4555 (2015)
12. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8:
Hospital-scale chest x-ray database and benchmarks on weakly-supervised classi-
fication and localization of common thorax diseases. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 2097–2106 (2017)
13. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R.,
Bengio, Y.: Show, attend and tell: Neural image caption generation with visual
attention. In: arxiv:1502.03044 (2015)
14. Xue, Y., Xu, T., Long, L.R., Xue, Z., Antani, S.K., Thoma, G.R., Huang, X.: Mul-
timodal recurrent model with attention for automated radiology report generation.
In: Medical Image Computing and Computer Assisted Intervention 2018, Granada,
Spain, Proceedings, Part I. pp. 457–466 (2018)

You might also like