Medical Paper
Medical Paper
Abstract
Chest diseases are especially deadly because of their effects on the lungs. The lungs are
very essential organs in the human body, thus the harmful damages can be caused. The main
method used to diagnose conditions like pneumonia, emphysema, etc. is a chest x-
ray.Traditionally, radiologists thoroughly examine the x-ray image and document the
patient’scondition in a report. Appling, image captioning task in medical field that helps to
reduces thehuman error. Medical image captioning is the process of producing natural language
processing for medical images. It might be useful in helping doctors identify illnesses
andcomprehend medical situations. Target is to find abnormalities in a chest x ray image. In this
study, suggest a medical picture captioning model that builds on EfficientNet and GRU
advantages. Encoder-decoder architecture is utilized. This work utilizes a gamma correction
technique for enhancing a quality of medical image. Bidirectional Encoder Representation from
Transformer (BERT) is used to remove the textual features from the text. The evolving model is
tested on publically available medical image dataset: the IU X-Ray dataset that is associated with
various chest and lung related diseases and it has 3955related radiology reports from 3955
patients along with 7470 X-ray images. The enhanced model is performed well on performance
metrics BLEU, ROUUGE-L.
Introduction
In our routine life, evolving Computer vision (CV) and Natural Language Processing
(NLP) are playing a crucial role in solving various real-time problems. One of the interesting
research areas that are receiving a lot of attention is image captioning [25]. Information about an
image can be accurately explained by humans. Image captioning is one of them, and it aims to
recognise objects, activities, and their relationships in an image order to provide a syntactically
and semantically correct visual description [26]. A Family of convolutional neural networks is
used to extract visual features in an image and a language generation model such as Long Short
Term Memory or transformer based model is used to generate a text for extracted visual features.
Medical image captioning
Chest diseases are especially deadly because of their effects on the lungs.The lungs are
vital organs in the human body, thus any damage caused potentially be harmful.The main
method used to diagnose conditions like pneumonia, emphysema, etc. is a chest x-ray.
Traditionally, radiologists thoroughly examine the x-ray image and document the patient's
condition in a report.To correctly read a chest x-ray, a radiologist needs to be equipped with the
following abilities: a thorough understanding of the basic anatomy of the thorax as well as the
physiology of various chest diseases; the capacity to analyze the radiograph by recognizing
different patterns; the capacity to analyze and evaluate the evolution of chest x-rays over time
and recognize any changes that might occur; knowledge of clinical presentation and history;
understanding of correlation with diagnostic outcomes. If a report is written by a physician with
little experience, this complicated effort may lead to errors, but it can also be time-consuming
and tiresome for a more experienced practitioner. By applying image captioning in Medical field
for diagnosing a chest X ray can reduce the human error and save the radiology time.
A state-of-the-art use of computer vision and natural language processing in the
healthcare industry is medical image captioning. It entails the analysis and interpretation of
medical images, including X-rays, MRIs, CT scans, and histopathological images, using
sophisticated algorithms to produce captions that are both descriptive and clinically relevant
[21]. The process of creating visual representations of a body's interior for clinical analysis, as
well as representations of how specific organs or tissues function, is known as medical image
captioning. They are frequently used in clinics and hospitals to diagnose illnesses and fractures.
Medical professionals with specialized training read and interpret the medical images. They then
communicate their findings about each body area examined through written Medical Reports.
The objective of Medical image captioning is to help the medical professionals to comprehend
and interpret complex medical imagery more easily, which will ultimately improve treatment
planning and diagnostic accuracy. Medical image captioning that diagnoses the medical
condition by providing a textual description about the medical image [32].
Conventional approaches to medical image analysis frequently depend on trained
radiologists and clinicians to manually interpret the images. But the need for more precise and
effective diagnosis, along with the growing amount of medical imaging data, prompted research
into AI-driven solutions like medical image captioning [22].
For diagnosing reports for a chest X ray, A family of Convolutional Neural Networks is
used to retrieve the visual features by recognizing patterns, structures, and abnormalities within
the images and Recurrent Neural Network or transformer-based architecture are used to generate
the description for an image. This enables the model to produce captions that are understandable
by humans and provide clinically relevant details about the conditions or anomalies that are
being observed.
Medical image captioning has the potential to improve workflow for radiologists,
promote better professional communication among healthcare providers, and make diagnostic
information more easily accessible [29]. With its comprehensive explanations of various medical
conditions and their visual representations, it can also be a useful tool in medical education.
The challenge faced by medical image captioning is diversity and complexity of medical
images because it requires an accurate and contextually relevant description which is true to the
image. The capabilities of medical image captioning are being improved and expanded by
ongoing research and AI advancements, which is adding to its increasing importance in the field
of healthcare technology. For this research work,EfficientNetV2B0 is used as a features extractor
for detecting abnormalities in chest x ray image and GRU is used to generate a description for an
image.
Literature Survey
Image Captioning
The author utilizes an attention strategy for captioning images. Implement stimulus
driven attention to extract a color, position, dimension of an object in an image and concept
driven attention for classical question answering [1]. For extracting feature vectors from an
image they used VGG19 and LSTM [30] is utilized for generating a description for an image.
For this research work they used Flickr30k, MSCOCO dataset. This method achieved a BLEU
score of 0.6.Introduced a captioning model by extracting higher level features in an image.
Flickr8k, Flickr30k, MSCOCO, PASCAL, SBU datasets are utilized for this research work [2].
The author combine both lower level image features(color, shape of an object) with higher level
image features(human in an image).VGG 16 is utilized for the extraction of feature vectors in an
image and LSTM is used for generating a caption for an image by using extracted feature vector
of VGG 16. This method achieved a BLEU score of 0.764. The author proposed a Human
Centric Captioning Model (HCCM) [3] that describes the relationship between human and object
in an image. This model gives additional weights to human body parts. For this work they
created a Human Centric COCO (HC-COCO) dataset of 16,125 images and 78,462 captions
more than 70% of the captions describe human behavior. Faster R-CNN is used for extracting
human activity related features in an image and LSTM is used for caption generation. This model
gives more importance to human in an image. Attention-based Encoder-Decoder architecture is
proposed [4]. In order to extract features from an image, CNN (Xception) is integrated with
YOLOv4 to extract object features. GRU is employed as a language generation model to serve as
a decoder portion. This study uses the Bahdanau soft attention system. The deterministic
attention mechanism of the model is responsible for its overall smoothness and differentiability.
The machine translation was made better by utilizing the attention mechanism. They proposed
the importance factor, which prioritizes large objects in the foreground of a picture over little
ones in the background.The author introduced a Hybrid Attention Network that integrates both
bottom-up attention and top-down attention to enhance the generated caption. Adaptive fusion
technique is used for combining the attention mechanism. Bottom-up attention detects the
semantic object in an image and reassigns the weight. It helps to avoid the object hallucination.
Generated caption quality is improved by using top-down attention. For extracting visual
information in an image Faster R-CNN is utilized and a language generation model LSTM is
used. This model gives more importance to human features in an image [5].
Medical image captioning
The author suggests a complete Transformer-based report creation model called TrMRG
(Transformer Medical Report Generator). Vision Transformer is used to detect abnormalities
found in chest x-ray images [9]. Causal Language Modelling is utilized for generating reports.
For this work, publicly available Indiana University Chest x-Ray dataset (IU X-Ray) utilized and
achieved 0.532 as a BLEU score. The main limitation is that for normal circumstances, the
reports produced by the proposed TrMRG resemble the reference report; but, when it comes to
sick cases, the model frequently omits some crucial medical terminology. The primary cause is
the IU dataset's unique medical words and very limited corpus size.To address the data-bias
issue, suggest a contrastive triplet network (CTN) based on the Transformer architecture for
automated chest X-ray reporting. By comparing visual and semantic data between normal and
abnormal cases using a triplet network, CTN effectively accentuates abnormalities
[10].Enhancing the triplet comparison includes contrasting the visual and semantic embeddings
between triplets in two distinct stages. In order to convert the report to a fixed-size vector in an
embedded space for semantic comparison, pre-train a textual encoder based on the BERT
architecture.This study presents a novel architecture for an X-ray report generator that uses a
Multi-Head Attention (MHA) system and enhances images during pre-processing [11]. In the
pre-processing stage, a gamma correction is performed to improve the quality of X-ray images
and address noise that occurred in the image during the acquisition procedure.ChexNet is used to
detect the defects in an image and BERT embedding is used to preprocess the text. Finally, the
report is generated by using Long Short Term Memory.The author presents a complete deep
neural network for CXR image generation that generates radiological reports with clinical use
using contextual word representations. VGG 16 is used to extract the visual features in an image
[12]. The BERT model is fed the text corpus that was taken out of the dataset's ground truth
reports in order to produce word embeddings. the extracted visual features and corresponding
word embedding passed to the LSTM for report generation. Finally, Sentiment analysis using
DistilBERT is used to determine whether sentences in the generated reports are good and
negative.The first encoder-to-decoder model for CXR report extraction was used by Wang et al.
[15]. The authors utilised multi-label abnormality classification (ChestX-ray14 abnormality
labels) and multi-task learning of CXR report generation (IU X-ray). ResNet-50 and LSTM with
attention mechanism is utilized for diagnosing a chest ray image. When compared to a general-
domain picture captioning method, which was evaluated using multiple NLG metrics, the
suggested method fared better.The author proposes a semantic fusion network in lesion area
detection model that detects the visual and pathological information in a medical image to
overcome the drawback in existing research. The model learns to fuse pathological information
from ResNet 50 into LSTM for generating diagnostic reports [13]. It gives accurate pathological
information and higher metrics scores. The research work utilizes a hybrid retrieval and
reinforcement learning strategy that aligned with the report's template [16]. The sentence decoder
generated a topic for reinforcement learning and training using a Template Database or
Generation Module, resulting in a BLEU-4 value of 0.15. This work suggested using Clinical
Finding Scores to evaluate report quality, taking into account medical abnormal and negative
phrases [17].
EFFICIENTNETV1
EfficientNetis a Convolutional neural network architecture and scaling method that
uniformly scales all dimensions of depth/width/resolution using a compound coefficient [19]. As
a baseline neural architecture search is utilized and scale it up to obtain a family of models,
called EfficientNet, which achieve much better accuracy and efficiency than previous ConvNets.
Resolution scaling refers as enhancing the image size. Larger input images contain more fine-
grained details. By increasing the depth, EfficientNet can capture more complex patterns and
features in the data. Width scaling involves adjusting the number of channels (or neurons) in
each layer of the neural network. Wider networks have more channels, which allows them to
capture more information at each layer. scaling method is proposed for combining all the 3
scaling. The compound scaling method is justified by the intuition that if the input image is
bigger, then the network needs more layers to increase the receptive field and more channels to
capture more fine-grained patterns on the bigger image.
f =¿
EFFICIENTNETV2
PROPOSED METHODOLOGY
Let X ∈ D C× H × W represent the input image Y represent the output feature map after passing the
convolutional layer where c is the number of channel, H and W represent image height and
width.
I is the feature vector for an input image X from EfficientnetV2B0 where I ∈ I 1 ,2 , 3 ….. k where is a
number of vector in I. I F is the feature vector extracted from a frontal view of a chest X-ray
image. I lis the feature vector extracted from a lateral view of a chest X-ray image. Eff represent
EfficientNetV2B0.
I F =Eff ( X) (2)
I l=Eff ( X) (3)
Combining the output features from the both view of input image
Y= concat ( I F , I l) (4)
The two feature vectors I F and I l are concatenate by using pooling layers.
Text Preprocessing
Text preprocessing is a crucial step in medical image captioning. Indiana Chest X ray
dataset, one report is associated with more than one image and it has 8 feature which includes
image_id, indication, problem, MeSH, comparison, findings, impression, image view. Among 8
feature, finding feature is utilized for this work remaining features are dropped. Finding feature
contains useful information about X-ray for both normal and abnormal case. Preprocessing starts
with removing null values in finding features as it contains 512 null fields. Second step is to
remove additional spaces, punctuation and number. After that decontraction, that replaces word
from “won’t” to “will not” and removing stop words. Finally, tokenization that helps to
breakdown sentences into sub words and allocate a unique integer id as shown in Fig 5.
word embedding. BERT generates contextualized embeddings for each token. Contextualized
embeddings capture the meaning of a word based on its surrounding context in the text.For each
token, BERT typically outputs a vector representation in a high-dimensional space. GRU is a
type of recurrent neural network that is designed to address the vanishing gradient problem in
traditional RNN. It captures long-term dependencies in sequential data more effectively
compared to traditional RNNs. It has two gates. Update gate control how much of past
information should retrain and reset gate determine how much of past information to forget.
GRU is used as the main structure of the report generation model. Gates of GRU control
the flow of information, and the calculation procedures are shown as follows:
Y is the input features, o t is the output word at a time step t, e t is the output word at time
step t, ht is the hidden state at time step t, ⨀ denote elementwise multiplication.
z t =sigmoid ¿ (5)
r t =sigmoid ¿ (6)
'
ht =tanh ¿ (7)
'
ht =( 1−z t ) ⨀ ht −1+ z t ⨀ ht (8)
o t=softmax ( V . ht ) (9)
W z ,W r , W are the weight matrix and b z , b r ,b are bias vector, which are the learnable parameters
of the GRU.Cross-entropy loss function is used to increase the probability of generating correct
report.
Implementation Details
The captioning model is mainly implemented by PyTorch, Tensorflow andoptimized via
Adam [20]. EfficientNetV2B0 [18]to detect the visual features in chest X-ray image.1280
dimensional features are obtained from frontal view of chest-X ray and 1280-dimensional
features vector is obtained from lateral view of chest X-ray image. By concatenating visual
feature from a chest X ray image, we obtained 2560-dimensional featuresI from the poollayer
before the classification layer. For the proposed work EfficientNetV2B0 is used which gives a
better accuracy and faster training speed.For generating finding in a chest X-ray image Gated
recurrent Unit type of recurrent neural network is utilized. Thedimension of the hidden state and
word embedding in GRU is 512 and 768.Thedimension of the input in Transformer is set to 512,
and the numberof heads is set to 8. We set the dropout probability to 0.1. The Adamoptimizer is
used to train the captioning model with a learning rateof 5e-4 under cross-entropy loss. The batch
size is 10, lasting for 30epochs, and the predetermined sampling probability is increased by0.05
every 5 epochs.
Experiments and analysis
Dataset Used
Medical image captioning datasets usually consist of medical images and corresponding
reports. For this work, IU Chest X-ray dataset is used [27]. It contains 7470 X-ray images and
3955 associated radiology reports from 3955 patients. The dataset has a frontal and lateral view
of chest x-ray images. More than one image is associated with one report. The report contains
various features. Problem section indicates the abnormality found in a chest x ray image. The
indication shows the symptoms of a disease provided by the patient. Comparison section
provides information about a patient’s previous medical treatment. The Finding section provides
a problem found in an image as shown in fig 6. The MeSh section is written by the radiology
expert. Finding feature is utilized for this work. It contains 512 null fields. By eliminating 512
null fields 80% of data is utilized for training and remaining data is utilized for testing and
validation.
Fig6: Chest X ray images and finding feature in IU chest X ray Dataset
Table 1: DatasetDescription
Fig 7 shows that report generated by the proposed method on the IU chest X ray dataset. In the
above table, generated report is the finding predicted by the proposed model
EfficientNetV2B0+GRU and actual report is the finding in dataset.
Evaluation Metrics
BLEU: Bilingual Evaluation Understudyis one of the popular and expensive methods that is
used to measure similarity between generated sentence and expected sentence [7]. It uses n-gram
precision to compare candidate sentence generated sentence to one or more reference sentence
and count the number of matches. n-gram ranges from 1,2,3,4. It score ranges from 0 to 1. BLEU
with ahigh score indicate better quality of caption. Advantage of BLEU is easy to compute. Main
disadvantage is it measures the similarity between generated and excepted sentence rather than
fluency. Each n-gram in BLEU is equally computed.
Let Pn be the modified precision score for whole text corpus, A - candidate sentence, B -
reference sentence.
To compute
a-candidate translation length, b-reference corpus length and positive weights W n summing to
one.
{
BP= 1,∧if a> b
a , if a ≤ b
(11)
(∑ )
N
BLEU =BP ∙ exp W n log p n (12)
n
( )
N
b
log BLEU =min 1− , 0 + ∑ W n log pn (13)
a n=0
CIDE: Consensus-based Image Description Evaluation. It measures the similarity between the
images and a generated caption [8]. CIDEr measures the generated caption's quality based on
how closely it matches the typical descriptions provided by human annotators. It is highly
correlated with human consensus score and gives more weight-age to important n-grams.
n is computed using the average cosine similarity between the candidate sentence and the
reference sentences, which accounts for both precision and recall
M n n
1 g ( A )∙ g (B)
CIDEr n ( A , B )= ∑
M i=1 ‖gn ( A )‖‖g n ( B )‖
(14)
Where A be the candidate caption, B be the reference caption and gn ( A ) , g n ( B ) is the formed by
Term Frequency Inverse Document Frequency (TF-IDF) of all n-grams in A, B.‖g n ( A )‖ ,‖g n ( B )‖
is the magnitude of vector gn ( A ) , g n ( B ) , where M is the number reference caption
∑ ∑ c match ( gramn )
B ∈ referencesentences gramn ∈ B
ROUGE−N= (15)
∑ ∑ c ( gramn )
B ∈reference sentences gramn ∈ B
In the table 2, performance analysis of proposed model with the existing works in terms of
BLEU@N (where N= 1,2,3,4), CIDEr and ROUGE-L score. The IU chest X ray dataset is taken,
and the generating finding in the chest X-ray is carried out for the dataset. From this analysis, the
proposed method shows a better accuracy rate of about 0.458 as BLEU@1, BLEU@2 as 0.310,
BLEU@3 as 0.221, BLEU@4 as 0.159, CIDEr as 0.348 and ROUGE-L as 0.368 than the
existing research work.
0.5000
0.4500
0.4000
0.3500
0.3000
0.2500
0.2000
0.1500
0.1000
0.0500
0.0000
BLEU @1 BLEU @2 BLEU @3 BLEU @4 CIDEr ROUGE-L
Conclusion
Thus,including image improvement methods and the integration of EfficientNetV2 with
GRU provides a strong solution for medical image captioning that improves interpretability and
accuracy.The model's ability to extract detailed information from medical images and produce
logical textual descriptions is made possible by combining the sequential learning skills of GRU
with the potent feature extraction capabilities of EfficientNetV2. The model shows improved
performance in producing meaningful and enlightening captions by combining EfficientNetV2
for feature extraction with GRU for sequential learning.More precise captioning is made possible
by combining these designs with gamma-corrected photos. In Future, transformer based model
will be used in order to achieve better caption for an image.
References
[1] Songtao Ding, Shiru Qu, Yuling Xi, Shaohua Wan, Stimulus-driven and concept-driven analysis for
image caption generation, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.04.095, vol 398,
2020, pg.no. 520-530.
[2] Songtao Ding, Shiru Qu, Yuling Xi, Arun Kumar Sangaiah, Shaohua Wan,Image caption generation
with high-level image features, Elsevier: Pattern Recognition Letters
https://doi.org/10.1016/j.patrec.2019.03.021, Volume 123,2019.
[3] Zuopeng Yang, Pengbo Wang, Tianshu Chu, Jie Yang,Human-Centric Image Captioning” Elsevier:
Pattern Recognition”, https://doi.org/10.1016/j.patcog.2022.108545,2022, Vol 126.
[4] Muhammad Abdelhadie Al‐Malla, Assef Jafar and Nada Ghneim, Image captioning model using
attention and object features to mimic human image understanding, Springer: Journal of Big Data,
https://doi.org/10.1186/s40537-022-00571-w, 2022.
[5] Wenhui Jiang, Qin Li, Kun Zhan, Yumung Fang, Fei Shen, Hybrid attention network for image
captioning, Elsevier: Displays, https://doi.org/10.1016/j.displa.2022.102238, Vol 73, July 2022.
[6] Mohsan, M. M., Akram, M. U., Rasool, G., Alghamdi, N. S., Baqai, M. A. A., & Abbas, M. (2022).
Vision Transformer and Language Model Based Radiology Report Generation. IEEE Access, 11,
1814-1824.
[7] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of
machine translation,” in Proc. 40 th Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311–318.
[9] Mohsan, M. M., Akram, M. U., Rasool, G., Alghamdi, N. S., Baqai, M. A. A., & Abbas, M. (2022).
Vision transformer and language model based radiology report generation. IEEE Access, 11, 1814-
1824.
[10] Yang, Y., Yu, J., Jiang, H., Han, W., Zhang, J., & Jiang, W. (2022). A contrastive triplet network
for automatic chest X-ray reporting. Neurocomputing, 502, 71-83.
[11] Tsaniya, H., Fatichah, C., & Suciati, N. (2024). Automatic radiology report generator using
transformer with contrast-based image enhancement. IEEE Access.
[12] Kaur, N., & Mittal, A. (2022). RadioBERT: A deep learning-based system for medical report
generation from chest X-ray images using contextual embeddings. Journal of Biomedical
Informatics, 135, 104220.
[13] Zeng, X., Wen, L., Xu, Y., & Ji, C. (2020). Generating diagnostic report for medical image by
high-middle-level visual information incorporation on double deep learning models. Computer
methods and programs in biomedicine, 197, 105700.
[14] Tharsanee, R. M., Soundariya, R. S., Kumar, A. S., Karthiga, M., & Sountharrajan, S. (2021).
Deep convolutional neural network–based image classification for COVID-19 diagnosis. In Data
Science for COVID-19 (pp. 117-145). Academic Press.
[15] Wang X, Peng Y, Lu L, Lu Z, Summers RM. TieNet: Text-image embedding network for
common thorax disease classification and reporting in chest X-Rays. In: 2018 IEEE/CVF conference
on computer vision and pattern recognition. IEEE; 2018, p. 9049–58.
http://dx.doi.org/10.1109/cvpr.2018.00943.
[16] Y. Li, X. Liang, Z. Hu, and E. P. Xing, ‘‘Hybrid retrieval-generation reinforced agent for medical
image report generation,’’ in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1530–1540.
[17] G. Liu, T.-M. H. Hsu, M. McDermott, W. Boag, W.-H. Weng, P. Szolovits, and M. Ghassemi,
‘‘Clinically accurate chest X-ray report generation,’’ 2019, arXiv:1904.02633. [Online]. Available:
https://arxiv.org/abs/1904.02633.
[18] Tan, M., & Le, Q. (2021, July). Efficientnetv2: Smaller models and faster training.
In International conference on machine learning (pp. 10096-10106). PMLR.
[19] Tan, M., & Le, Q. (2019, May). Efficientnet: Rethinking model scaling for convolutional neural
networks. In International conference on machine learning (pp. 6105-6114). PMLR.
[20] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
[21] Ayesha, H., Iqbal, S., Tariq, M., Abrar, M., Sanaullah, M., Abbas, I., ... & Hussain, S. (2021).
Automatic medical image interpretation: State of the art and future directions. Pattern
Recognition, 114, 107856.
[22] Yang, S., Wu, X., Ge, S., Zhou, S. K., & Xiao, L. (2022). Knowledge matters: Chest radiology
report generation with general and specific knowledge. Medical image analysis, 80, 102510.
[23] Li, Y., Liang, X., Hu, Z., & Xing, E. P. (2018). Hybrid retrieval-generation reinforced agent for
medical image report generation. Advances in neural information processing systems, 31.
[24] Zhang, Y., Wang, X., Xu, Z., Yu, Q., Yuille, A., & Xu, D. (2020, April). When radiology report
generation meets knowledge graph. In Proceedings of the AAAI conference on artificial
intelligence (Vol. 34, No. 07, pp. 12910-12917).
[25] Jiang, W., Li, Q., Zhan, K., Fang, Y., & Shen, F. (2022). Hybrid attention network for image
captioning. Displays, 73, 102238.
[26] Sasibhooshan, R., Kumaraswamy, S., & Sasidharan, S. (2023). Image caption generation using
visual attention prediction and contextual spatial relation extraction. Journal of Big Data, 10(1), 18.
[27] Demner-Fushman, D., Kohli, M. D., Rosenman, M. B., Shooshan, S. E., Rodriguez, L., Antani,
S., ... & McDonald, C. J. (2016). Preparing a collection of radiology examinations for distribution and
retrieval. Journal of the American Medical Informatics Association, 23(2), 304-310.
[28] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &
Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078.
[29] Nicolson, A., Dowling, J., & Koopman, B. (2023). Improving chest X-ray report generation by
leveraging warm starting. Artificial intelligence in medicine, 144, 102633.
[30] J. Donahue et al., "Long-Term Recurrent Convolutional Networks for Visual Recognition and
Description," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp.
677-691, 1 April 2017, doi: 10.1109/TPAMI.2016.2599174.
[31] C. Lin, ‘‘Rouge: A package for automatic evaluation of summaries,’’ in Proc. Text
Summarization Branches Out, 2004, pp. 74–81.
[32] Wang, E. K., Zhang, X., Wang, F., Wu, T. Y., & Chen, C. M. (2019). Multilayer dense
attention model for image caption. IEEE Access, 7, 66358-66368.