0% found this document useful (0 votes)
29 views8 pages

RadTextAid AAAI GenAI Workshop 2024 v0 Camera Ready

RadTextAid is a novel framework designed to enhance the generation of clinically relevant radiology reports from chest X-ray images by integrating lightweight Vision-Language Models (VLMs) with a CNN-based classifier. The framework aims to improve the detection of abnormalities and the overall quality of reports by focusing on clinically significant content and removing redundant information from training datasets. Experimental results indicate that RadTextAid outperforms existing models, demonstrating its potential to assist radiologists in efficient report generation.

Uploaded by

sohel bashar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views8 pages

RadTextAid AAAI GenAI Workshop 2024 v0 Camera Ready

RadTextAid is a novel framework designed to enhance the generation of clinically relevant radiology reports from chest X-ray images by integrating lightweight Vision-Language Models (VLMs) with a CNN-based classifier. The framework aims to improve the detection of abnormalities and the overall quality of reports by focusing on clinically significant content and removing redundant information from training datasets. Experimental results indicate that RadTextAid outperforms existing models, demonstrating its potential to assist radiologists in efficient report generation.

Uploaded by

sohel bashar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

RadTextAid: A CNN-Guided Framework Utilizing Lightweight Vision-Language

Models for Assistive Radiology Reporting


Mahmud Wasif Nafee, Tasmia Rahman Aanika, Taufiq Hasan
mHealth Lab, Department of Biomedical Engineering
Bangladesh University of Engineering and Technology (BUET), Dhaka-1205, Bangladesh.
Email: [email protected]

Abstract gists, which is common in low- and middle-income coun-


tries (LMICs). (Parag and Hardcastle 2022)
Deciphering chest X-rays is crucial for diagnosing tho-
racic diseases such as pneumonia, lung cancer, and car- Deep learning and natural language processing (NLP) -
diomegaly. Radiologists often work under significant based algorithms can automatically generate draft radiology
workloads and handle large volumes of data, which reports, potentially improving reporting efficiency while re-
can lead to exhaustion and burnout. Advanced deep taining accuracy. However, a central issue in report gen-
learning models can effectively generate draft radiology eration is the accurate detection and precise documenta-
reports, potentially alleviating the radiologist’s work- tion of abnormalities while differentiating them from routine
load. However, many current systems create reports that physiological and pathological observations. Publicly avail-
include clinically irrelevant or redundant information. able datasets predominantly contain information on routine
To address these limitations, we propose RadTextAid, observations, with limited references to abnormalities or
a novel multi-modal framework for generating high-
comprehensive medical histories. This abundance of routine
quality, clinically relevant radiology reports. Our ap-
proach integrates VLMs for natural language genera- findings in the training data poses significant obstacles for
tion, augmented by disease-specific tags derived from state-of-the-art Vision Language Models (VLMs) employed
a CNN analyzing chest X-ray images to identify key in automated report generation, as these models rely on the
pathological features. A key feature within our frame- token loss during training. The similarity of captions in these
work is the pre-processing of the radiology report train- datasets diminishes the model’s ability to generate differ-
ing dataset. This removes routine, repetitive, or non- ential annotations concerning clinical abnormalities in their
informative phrases commonly found in chest X-ray re- reports. Consequently, the models lack penalties for either
ports and ensures that the model focuses its learning overlooking or inaccurately predicting pathological condi-
on clinically meaningful content, which expert radiolo- tions, leading them to generate text primarily related to rou-
gists qualitatively validated. Experimental results show
tine observations.
that our system yields an absolute improvement of 4.8%
in terms of BERTScore and 3.16% in terms of the F1- To address the issue of inadequately generating patho-
cheXbert metric compared to a state-of-the-art model. logical findings, one viable approach is to train the VLM
Thus, the results demonstrate that the proposed RadTex- exclusively on texts pertaining to clinical abnormalities
tAid framework not only improves the detection of ab- rather than on the entirety of the report. This targeted
normalities from chest X-ray images but also enhances training would enhance the model’s capacity to generate
the overall quality and coherence of generated reports, detailed pathological findings, necessitating greater atten-
thus paving the way toward more efficient and effective tion and precision. Alternatively, implementing a Convolu-
radiology reporting. tional Neural Network (CNN)-based classifier to first iden-
tify chest X-ray abnormalities could serve as a mechanism to
Introduction guide the VLM through targeted prompts, thereby enhanc-
ing overall reporting accuracy. However, to our knowledge,
Interpreting chest X-ray images is a critical, time-sensitive lightweight VLMs guided by CNN models have not been
task essential for the timely diagnosis and treatment of adequately studied in the literature for radiology report gen-
numerous medical conditions. Radiologists typically invest eration.
considerable time meticulously analyzing each chest X-ray
In this study, we propose RadTextAid, a novel framework
image to prevent misdiagnoses that could adversely affect
designed to train a lightweight multi-modal model to as-
patient outcomes. However, this thorough approach often
sist radiologists in automatic report generation by focusing
leads to inefficiencies, particularly in settings characterized
specifically on clinical abnormalities. The proposed pipeline
by high patient volume or a shortage of qualified radiolo-
consists of three distinct models: (i) The PaliGemma multi-
Copyright © 2025, GenAI4Health Workshop @ Association for the modal vision language model is trained to generate relevant
Advancement of Artificial Intelligence (www.aaai.org). All rights findings; (ii) The Llama 3.1 8b model, which extracts only
reserved. the pathological findings to inform the training corpus; and
(iii) a CNN-based chest X-ray image classifier (CheXNet) to up-to-date computer vision and natural language processing
produce pathological tags for prompt guidance of the vision checkpoints. TrMRG (Mohsan et al. 2023), also known as
language model. Our main contributions can be summarized the Transformer Medical report generator, is a comprehen-
as follows. sive model that utilizes the Transformer architecture to gen-
erate reports. This model incorporates pre-trained computer
• We propose a novel, efficient pipeline to train vision lan-
vision and language models, making it a powerful tool for
guage models for domain-specific tasks such as X-ray re-
report generation. (Wang et al. 2023) introduces METrans-
port generation.
former that incorporates a transformer framework and in-
• We investigate the state-of-the-art large language models cludes ”expert tokens” into both the transformer encoder and
(LLMs) to extract clinically relevant information from decoder, representing many experts.
medical reports with few-shot prompting. The challenge of large vision-language models in med-
• We explore the potential of CNN-generated disease la- ical imaging is addressed by (Kim et al. 2024) through a
bels from chest X-ray images to construct prompts to Llava-based framework, though its size poses limitations.
guide the vision language model. Their findings showed that lightweight models perform bet-
ter, prompting us to suggest models like Paligemma and
Related Works Florence-2 as viable alternatives. While (Alfarghaly et al.
Chest radiography holds the position of being the most 2021) and (Srinivasan et al. 2021) explore confidence scores
prevalent imaging examination worldwide. In contrast to of tags for report generation, the potential of structured
general image captioning tasks that prioritize coherence prompts from tags remains unexplored. Our proposed model
(Chen et al. 2018), medical image captioning necessitates aims to bridge these gaps in the literature.
a greater emphasis on accuracy in identifying anomalies and
extracting information while still maintaining coherence. Methodology
(Singh et al. 2022) This means the generated report should
be comprehensible and convey precise medical information Proposed Framework
effectively (Srinivasan et al. 2020).
An overview of the proposed radiology report generation
Recent developments in computational machine transla- framework is shown in Fig. 1. We begin with a chest x-ray
tion (Sirshar et al. 2022) have shown that by using a robust image as input, which flows through a multi-label classifier
sequence model and an LSTM with an attention head, per- based on CheXNet-121 (Rajpurkar et al. 2017). This clas-
formance improves significantly.. The Contrastive Attention sifier scans the available image for relevant visual features
(CA) approach proposed by Liu et al. (Liu et al. 2021) dis- and produces 105 diagnostic tags that indicate the identified
tills the contrastive information by comparing the current in- disease conditions or abnormalities. These tags provide se-
put image with normal X-ray images (from healthy subjects) mantic information about the X-ray findings and are visual-
rather than concentrating just on it. Srinivasan et al. (Srini- ized in the system diagram as individual elements, offering
vasan et al. 2021) introduces a deep neural network that uti- clarity and interpretability of the multi-label classification
lizes a set of chest x-ray images to predict the medical tags results.
and provide a comprehensible radiology report.
Next, the initial Chest X-ray image and the produced
Recently, transformer architectures have been used more tags (disease labels), which are text outputs, are fed into a
commonly for radiology report generation to overcome the multi-modal Vision-Language Model (VLM). In our study,
limitations of Recurrent Neural Networks (RNNs). The we have considered the Florence-2 (Xiao et al. 2023) and
CNX-B2 model (Alqahtani et al. 2024) is a CNN combined PaliGemma (Beyer et al. 2024) VLM models. The VLM
with a transformer network to generate medical reports. The fuses the visual characteristics learnt from the image with
work of (Alfarghaly et al. 2021) involves fine-tuning a pre- the semantic information obtained by the tags (disease la-
trained ChexNet model (Rajpurkar et al. 2017) to predict bels), creating a detailed diagnostic report with observations
specific identifiers from the image, generating weighted se- of inferred conditions and abnormalities along with addi-
mantic features from the pre-trained embeddings of the pre- tional clinical insights. The proposed framework can be di-
dicted tags, and producing complete medical reports by con- vided into two components:
ditioning a pre-trained GPT-2 model on the visual and se-
mantic features. In (Mondal et al. 2023), the authors in- 1. The multi-label classifier based on CheXNet-121 to pro-
troduce EfficienTransNet, an automatic chest X-ray report duce 105 disease labels.
generation approach based on CNN-Transformers. Efficien- 2. The multi-modal Vision-Language Model (VLM), i.e.
TransNet incorporates clinical history or indications to en- Florence 2 or PaliGemma
hance the report generation process and align with radiol-
ogists’ workflow, which is mostly overlooked in recent re-
search. CNN-based multi-label classifier
In order to enhance the generation of chest x-ray reports, The input chest x-ray image is first passed through a CNN
(Nicolson, Dowling, and Koopman 2023) propose the ex- model to produce the tags’ predictions. Our base model is
ploration of warm starting the encoder and decoder using based on a ChexNet model (Rajpurkar et al. 2017), which
Figure 1: Our proposed model architecture for RadTextAid

is essentially a Densenet121 model (Huang, Liu, and Wein- prompts. The input image and prompt are passed through a
berger 2016) pre-trained on Chest X-ray14 dataset to iden- multi-modal VLM. We consider the pre-trained Florence-2
tify and localize 14 thoracic diseases. However, these 14 and PaliGemma models due to their lightweight architecture
tags were insufficient for conditioning the report generation compared to Llava 1.6 (34B parameters) (Liu et al. 2024),
model. Thus, we chose a fine-tuned model to classify the GPT-4 (200B parameters) (OpenAI 2024), GPT-4o Mini (8B
manual tags from the IU-Xray dataset (Demner-Fushman et parameters) (OpenAI 2024), and Qwen2-VL (72B parame-
al. 2015) by removing the final layer and adding a new final ters) (Wang et al. 2024). Our method is partly inspired by
layer containing 105 nodes for the most occurring manual (Alfarghaly et al. 2021), where a conditioned transformer
tags from the dataset as implemented in (Alfarghaly et al. model is used to generate radiology reports.
2021).
The positive tags represent the physiological conditions Florence-2
identified in the chest X-ray and serve as valuable inputs
for constructing structured prompts. These prompts have the Florence-2 is a recent VLM (Xiao et al. 2023) (0.7B pa-
potential to enhance the performance of Vision-Language rameters) developed for various vision and vision-language
models in generating accurate and contextually relevant re- tasks with a unified sequence-to-sequence architecture. It
ports. uses a Dual Attention Vision Transformer (DaViT) (Ding et
The model uses the binary cross-entropy (BCE) loss to al. 2022) as its vision encoder to process images into token
handle multiple diagnostic labels, embeddings V ∈ RNv ×Dv , where Nv represents the number
of visual tokens and Dv their dimensionality. An extended
N M
1 XX version of the tokenizer combines these embeddings with
LBCE = − [yij log(ŷij ) + (1 − yij ) log(1 − ŷij )] task-related text prompts Tprompt ∈ RNt ×D .
N i=1 j=1
The model uses a multi-modal transformer-based
where N is the number of samples (batch size), M is the encoder-decoder, with input X = [V ′ , Tprompt ], where V′ is
number of labels (e.g., M = 105 here), yij is the ground the dimensionally aligned projection of V. It uses a standard
truth binary label for the j-th class of the i-th sample, ŷij is cross-entropy language modeling objective for training:
the predicted probability for the j-th class of the i-th sample,
|y|
and log is the natural logarithm. X
L=− log Pθ (yi | y<i , x),
i=1
The multi-modal Vision-Language Model (VLM)
In this step, we use the positive disease tags obtained from where, y represents the target sequence, and x combines vi-
the multi-label CNN-based classifier to construct textual sual and textual inputs.
Florence 2 is trained on FLD-5B, a large-scale dataset Radiology Report Training Dataset To ensure the accu-
with 126M images and more than 5 billion annotations (text, racy and relevance of the generated diagnostic reports, we
region-text pairs, text-phrase-region triplets), this empowers pre-process the radiology reports by eliminating the usage
the model to learn multiple levels of spatial and semantic of routine or repetitive words and phrases generally con-
granularity (PLN). tained in chest x-ray reports. These words and phrases are
common in normal (healthy) cases and can cause skewed
PaliGemma results where the VLM generates many general terms to re-
duce the training loss. To address this, the dataset is adjusted
PaliGemma is a unified VLM (Beyer et al. 2024) for by utilizing a filtering process with the use of the pre-trained
general-use multi-modal functionalities based on SigLIP Llama 3.1-8B model (Sam and Vavekanand 2024) as shown
ViT-So400m image encoder (Zhai et al. 2023) and the in Fig.2.
Gemma-2B decoder-only language model (Team et al. 2024)
with fewer than 3 billion parameters. The SigLIP en-
coder uses a contrastive pretraining method with sigmoid Evaluation Metrics
loss to produce image embeddings, achieving SOTA clip- To evaluate the performance of our fine-tuned VLM,
level visual representation quality. The textual inputs to the we have used two different performance metrics, the
Gemma-2B model are processed using a SentencePiece to- BERTScore and F1-cheXbert. We opt for BERTScore as it is
kenizer, and then by autoregressive decoding, image tokens, an automated assessment measure for text production. Simi-
and text prompts can be combined into a single sequence. lar to conventional metrics, BERTScore calculates a similar-
A linear projection aligns the SigLIP output with Gemma- ity score for every token in the candidate sentence in relation
2 B’s input space, facilitating seamless integration. The to each token in the reference phrase. Instead of relying on
model is then further trained on higher-resolution images exact matches, the system calculates token similarity using
and domain-focused data in the final stages, which leads to contextual embeddings (Zhang et al. 2020).
better accuracy. For PaliGemma, an input sequence takes the In contrast, the F1-cheXbert (Smit et al. 2020) score uti-
following mathematical form: lizes the CheXbert(Smit et al. 2020) transformer to output
selected labels on both original and generated reports and
then calculates the F1 score between these two sets of la-
tokens = [image tokens . . . , BOS, prefix tokens . . . , bels.
SEP, suffix tokens . . . , EOS, PAD . . .]
Implementation Details
Here, BOS= Beginning of Sentence token, EOS = End of
The multi-label CNN-based classification model is fine-
Sentence token, PAD= Padding token , SEP = Separator to-
tuned using TensorFlow, following the methodology out-
ken
lined in (Alfarghaly et al. 2021). With 32 images per batch
and the Adam optimizer, the model is trained end-to-end us-
Experiments ing mini-batch gradient descent. Binary cross-entropy loss
(eq.) was the loss that needed to be examined in this model.
Datasets We left all the model parameters up for fine-tuning.
On the other hand, the VLMs are implemented using
MIMIC-CXR Database v2.0.0 This publically available, PyTorch. Both models are trained and tested on Intel(R)
large-scale MIMIC-CXR Database v2.0.0 is intended to fa- Xeon(R) CPUs and L4 GPUs provided by Google Colab
cilitate medical imaging research, especially in the area of notebooks. The loss function provided by the VLM pack-
chest radiography. The dataset contains 377,110 JPG format ages are used for training the Vision-Language Models. The
images and structured labels derived from the 227,827 free- Adam optimizer, with an initial learning rate of 5 × 10−5
text radiology reports associated with these images (Johnson and a batch size of 16 for multi-label classifiers and 4 for the
et al. 2024). VLMs, is employed for training or fine-tuning. The VLMs
were fine-tuned for 10 epochs.
Indiana University Chest X-rays and Reports (OpenI)
Accessible via OpenI, the Indiana University Chest X-rays
dataset is an extensive collection of chest radiographs with
their corresponding diagnostic reports intended to facilitate
medical imaging research and education. The dataset con-
tains 7, 470 pairs of images and reports covering a broad
spectrum of both common and uncommon thoracic disor-
ders (Demner-Fushman et al. 2015).
Chest X-ray images from both datasets were utilized as in-
puts, with the corresponding findings from each image serv-
ing as outputs to fine-tune the proposed model. The manual
tags for finetuning the CNN classifier were provided by (Al-
farghaly et al. 2021). Figure 2: Radiology report pre-processing for training.
Results F1-cheXbert score from 0.5516 to 0.700; that is, it generates
almost 14.84% more accurate reports. For Paligemma, both
Qualitative analysis: Extracting abnormal findings types of prompts result in similar BERTScore metrics, but
A Llama 3.1 8b instruct model was used to extract sentences the F1-cheXbert score shows that the model can generate
or phrases in the findings that indicated clinical abnormali- 2.5% more accurate reports with specific prompt guidance.
ties. Zero-shot prompting the Llama model with the instruc- With the help of this comparative analysis, we can determine
tion ”Extract the sentences or phrases that seem to indi- that Paligemma with tag-specific prompt guidance should be
cate clinical abnormalities” did not generate fruitful results. utilized in our pipeline.
Thus, we consulted with experienced radiologists to create
a few examples of extracted clinical abnormal findings from Comparison with captioning baselines
the rest of the report. With 12 examples, few-shot prompt- We compared our final methods to three types of models:
ing the Llama model with the same instruction showed much (1) CNN encoders with RNN decoders. (2) CNN encoders
better results, as shown in Fig. 2. Following this adjustment, with Transformer decoders. (3) Vision Transformer en-
two trained radiologists observed 30 reports each and af- coders with LLM decoders. For type-1, we look at attention-
firmed that the Llama model efficiently eliminated the re- based CNN-LSTM architecture mentioned in ref (Sirshar et
dundant information and effectively extracted the abnormal al. 2022) (Liu et al. 2021). For comparison, we follow the
findings. architecture of (Sirshar et al. 2022). Type-2 entails CNN
as encoders and transformers as decoders. A similar ap-
Vision Language model comparison and necessity proach was taken in (Alqahtani et al. 2024), (Alfarghaly et
of prompt guidance al. 2021), (Mondal et al. 2023). We used the architecture
While comparing the two VLMs, we also tested their per- CNX-B2 described in ref (Alqahtani et al. 2024) for compar-
formances with a generic prompt (”Write a Chest X-ray re- ative evaluation. Transformers were used end-to-end for en-
port”) and a tag-specific prompt (such as ”Write a Chest X- coding and decoding in type-3 captioning baselines as seen
ray report mentioning cardiomegaly”, ”Write a Chest X-ray in ref (Wang et al. 2023),(Nicolson, Dowling, and Koopman
report mentioning pleural effusion, consolidation”, etc.) Ta- 2023). For our quantitative analysis, we explored the archi-
ble 1 shows that Paligemma outperforms Florence-2 in all tecture CvT2DistilGPT2 described in ref (Nicolson, Dowl-
the BERTScore and the F1-cheXbert metrics. In the cases ing, and Koopman 2023).
of both VLMs, we observe that tag-specific prompts en- Table 2 shows the results for both MIMIC-CXR and
hance the performance of the model. For Florence-2, tag- IU X-Ray. These results show that the proposed method
specific prompts increase the BERTScore (at least 4%) the considerably improves the performance over the baseline

(a) Report Aid Example 1 (b) Report Aid Example 2 (c) Report Aid Example 3

Figure 3: Example reports generated using the proposed RadTextAid model.


OpenI Dataset
VLM Prompts BERTScore BERTScore Recall BERTScore F1 F1-cheXbert
Precision
Florence-2 Generic 0.2467 0.2721 0.2507 0.5516
Florence-2 Tag-specific 0.2753 0.2874 0.2608 0.7000
Paligemma Generic 0.3181 0.3334 0.3173 0.7666
Paligemma Tag-specific 0.3273 0.3322 0.3156 0.7916

Table 1: Comparative Analysis of the Two Vision-Language Models.

OpenI Dataset
Pipeline BERTScore Precision BERTScore Recall BERTScore F1 F1-cheXbert
Attention-based CNN-LSTM 0.2174 0.2018 0.2123 0.5451
CNX-B2 0.2194 0.1608 0.2268 0.6484
CvT2-Distil-GPT2 0.3089 0.2996 0.3009 0.7600
Rad-TextAid (Proposed) 0.3273 0.3322 0.3156 0.7916

MIMIC-CXR Dataset
Pipeline BERTScore Precision BERTScore Recall BERTScore F1 F1-cheXbert
Attention-based CNN-LSTM 0.1871 0.1926 0.1852 0.5347
CNX-B2 0.1886 0.1606 0.2265 0.5333
CvT2-Distil-GPT2 0.2939 0.2927 0.2902 0.6813
Rad-TextAid (Proposed) 0.2813 0.3034 0.2975 0.6956

Table 2: Comparative Analysis Against Literature Captioning Baselines.

methods. When compared with the best existing method, diagnostic tags from a chest X-ray image associated with
CvT2DistilGPT2, RadTextAid shows an absolute improve- different physiological conditions and pathologies. These
ment of 4.8% in the NLG metric (BERTScore) and gen- tags, supplemented with the original image, are input to a
erates 3.16% more accurate reports according to the F1- multi-modal VLM (e.g., Florence 2, PaliGemma), combin-
cheXbert score. Similar trends are observed in the results ing image and semantic information to create a coherent,
for MIMIC-CXR. However, the overall performance of all relevant diagnostic radiology report.
models drops, likely due to the significantly higher average
findings length in the MIMIC-CXR dataset compared to the A key innovation of the approach is pre-processing the
IU X-Ray dataset. training corpus, using the Llama 3.1 pre-trained model,
eliminating all the redundant and overused vocabulary from
Qualitative results the text, meaning that the reports generated will be based
We present a few qualitative examples of RadTextAid to on only the abnormalities carrying diagnostic information.
demonstrate its superiority in Fig. 3. The model, trained us- This helps reduce duplication, improve clinical value, and
ing our pipeline, has successfully detected and described all provide a strong base for automatic report creation.
the clinical abnormalities as seen in the original report, as
shown in Fig. 3(a) and 3(c). However, in the second exam- The experimental results indicate that our model achieves
ple shown in Fig. 3(b), the model mistakenly detects and state-of-the-art performance based on BERTScore and F1-
describes a clinical abnormality not mentioned in the refer- cheXbert metrics. These results demonstrate the model’s
ence report. This error may have occurred due to the image capability to generate relevant and accurate diagnostic re-
quality. A more robust preprocessing step of the image could ports. Our comparative analysis reveals that the proposed
help resolve this issue. framework surpasses existing architectures, especially in in-
tegrating multi-modal learning techniques, leading to im-
proved outcomes. Additionally, the lightweight models have
Conclusion a unique advantage in terms of scalability in low-resource
In this study, we demonstrate the effectiveness of a novel healthcare settings, where access to a cloud server may
Vision-Language Model (VLM) -based framework for de- not always be available. Overall, the proposed method and
veloping automated diagnostic reports aiming to assist chest the experimental results highlight the effectiveness of our
X-ray image analysis and reporting. Initially, the model uses approach in optimizing automatic radiological reporting,
a CheXNet-121-based multi-label classifier, identifying 105 which could significantly enhance patient care.
References Mohsan, M. M.; Akram, M. U.; Rasool, G.; Alghamdi, N. S.;
Alfarghaly, O.; Khaled, R.; Elkorany, A.; Helal, M.; and Baqai, M. A. A.; and Abbas, M. 2023. Vision trans-
Fahmy, A. 2021. Automated radiology report generation us- former and language model based radiology report gener-
ing conditioned transformers. Informatics in Medicine Un- ation. IEEE Access 11:1814–1824.
locked 24:100557. Mondal, C.; Pham, D.-S.; Gupta, A.; Ghosh, S.; Tan, T.; and
Gedeon, T. 2023. EfficienTransNet: An automated chest x-
Alqahtani, F. F.; Mohsan, M. M.; Alshamrani, K.; Zeb, J.; ray report generation paradigm. In Proceedings of the 1st
Alhamami, S.; and Alqarni, D. 2024. Cnx-b2: A novel cnn- International Workshop on Multimodal and Responsible Af-
transformer approach for chest x-ray medical report genera- fective Computing. New York, NY, USA: ACM.
tion. IEEE Access 12:26626–26635.
Nicolson, A.; Dowling, J.; and Koopman, B. 2023. Im-
Beyer, L.; Steiner, A.; Pinto, A. S.; Kolesnikov, A.; Wang, proving chest x-ray report generation by leveraging warm
X.; Salz, D.; Neumann, M.; Alabdulmohsin, I.; Tschannen, starting. Artificial Intelligence in Medicine 144:102633.
M.; Bugliarello, E.; Unterthiner, T.; Keysers, D.; Koppula,
OpenAI. 2024. Gpt-4o system card.
S.; Liu, F.; Grycner, A.; Gritsenko, A.; Houlsby, N.; Ku-
mar, M.; Rong, K.; Eisenschlos, J.; Kabra, R.; Bauer, M.; Parag, P., and Hardcastle, T. C. 2022. Shortage of radiolo-
Bošnjak, M.; Chen, X.; Minderer, M.; Voigtlaender, P.; Bica, gists in low to middle income countries in the interpretation
I.; Balazevic, I.; Puigcerver, J.; Papalampidi, P.; Henaff, O.; of CT scans in trauma. Banglad. J. Med. Sci. 21(3):489–491.
Xiong, X.; Soricut, R.; Harmsen, J.; and Zhai, X. 2024.
Paligemma: A versatile 3b vlm for transfer. Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan,
T.; Ding, D. Y.; Bagul, A.; Langlotz, C. P.; Shpanskaya,
Chen, H.; Zhang, H.; Chen, P.-Y.; Yi, J.; and Hsieh, C.-J. K. S.; Lungren, M. P.; and Ng, A. Y. 2017. Chexnet:
2018. Attacking visual language grounding with adversarial Radiologist-level pneumonia detection on chest x-rays with
examples: A case study on neural image captioning. deep learning. CoRR abs/1711.05225.
Sam, K., and Vavekanand, R. 2024. Llama 3.1: An in-depth
Demner-Fushman, D.; Kohli, M.; Rosenman, M.; Shooshan,
analysis of the next generation large language model.
S.; Rodriguez, L.; Antani, S.; Thoma, G.; and Mcdonald, C.
2015. Preparing a collection of radiology examinations for Singh, A.; Raguru, J. K.; Prasad, G.; Chauhan, S.; Tiwari,
distribution and retrieval. Journal of the American Medical P. K.; Zaguia, A.; and Ullah, M. A. 2022. Medical image
Informatics Association : JAMIA 23. captioning using optimized deep learning model. Computa-
tional Intelligence and Neuroscience 2022.
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; and Yuan,
L. 2022. Davit: Dual attention vision transformers. Sirshar, M.; Paracha, M. F. K.; Akram, M. U.; Alghamdi,
N. S.; Zaidi, S. Z. Y.; and Fatima, T. 2022. Attention
Huang, G.; Liu, Z.; and Weinberger, K. Q. 2016. Densely based automated radiology report generation using CNN and
connected convolutional networks. CoRR abs/1608.06993. LSTM. PLoS One 17(1):e0262209.
Smit, A.; Jain, S.; Rajpurkar, P.; Pareek, A.; Ng, A. Y.; and
Johnson, A.; Lungren, M.; Peng, Y.; Lu, Z.; Mark, R.; Lungren, M. P. 2020. Chexbert: Combining automatic la-
Berkowitz, S.; and Horng, S. 2024. MIMIC-CXR-JPG - belers and expert annotations for accurate radiology report
chest radiographs with structured labels. labeling using bert.
Kim, Y.; Wu, J.; Abdulle, Y.; Gao, Y.; and Wu, H. 2024. En- Srinivasan, P.; Thapar, D.; Bhavsar, A.; and Nigam, A. 2020.
hancing human-computer interaction in chest x-ray analysis Hierarchical x-ray report generation via pathology tags and
using vision and language model with eye gaze patterns. multi head attention. In Proceedings of the Asian Confer-
ence on Computer Vision (ACCV).
Liu, F.; Yin, C.; Wu, X.; Ge, S.; Zhang, P.; and Sun, X. 2021.
Contrastive attention for automatic chest X-ray report gen- Srinivasan, P.; Thapar, D.; Bhavsar, A.; and Nigam, A. 2021.
eration. In Zong, C.; Xia, F.; Li, W.; and Navigli, R., eds., Hierarchical x-ray report generation via pathology tags and
Findings of the Association for Computational Linguistics: multi head attention. In Computer Vision – ACCV 2020,
ACL-IJCNLP 2021, 269–280. Online: Association for Com- Lecture notes in computer science. Cham: Springer Interna-
putational Linguistics. tional Publishing. 600–616.
Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Se-
Liu, F.; Yin, C.; Wu, X.; Ge, S.; Zou, Y.; Zhang, P.; Zou, Y.; quence to sequence learning with neural networks. CoRR
and Sun, X. 2023. Contrastive attention for automatic chest abs/1409.3215.
x-ray report generation.
Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupati-
Liu, H.; Li, C.; Li, Y.; Li, B.; Zhang, Y.; Shen, S.; and Lee, raju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M. S.; Love,
Y. J. 2024. Llava-next: Improved reasoning, ocr, and world J.; et al. 2024. Gemma: Open models based on gemini re-
knowledge. search and technology. arXiv preprint arXiv:2403.08295.
Wang, Z.; Liu, L.; Wang, L.; and Zhou, L. 2023. METrans-
former: Radiology report generation by transformer with
multiple learnable expert tokens.
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen,
K.; Liu, X.; Wang, J.; Ge, W.; Fan, Y.; Dang, K.; Du, M.;
Ren, X.; Men, R.; Liu, D.; Zhou, C.; Zhou, J.; and Lin, J.
2024. Qwen2-vl: Enhancing vision-language model’s per-
ception of the world at any resolution.
Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng,
M.; Liu, C.; and Yuan, L. 2023. Florence-2: Advancing a
unified representation for a variety of vision tasks. arXiv
preprint arXiv:2311.06242.
Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L. 2023.
Sigmoid loss for language image pre-training.
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi,
Y. 2020. Bertscore: Evaluating text generation with bert.

You might also like