RadTextAid AAAI GenAI Workshop 2024 v0 Camera Ready
RadTextAid AAAI GenAI Workshop 2024 v0 Camera Ready
is essentially a Densenet121 model (Huang, Liu, and Wein- prompts. The input image and prompt are passed through a
berger 2016) pre-trained on Chest X-ray14 dataset to iden- multi-modal VLM. We consider the pre-trained Florence-2
tify and localize 14 thoracic diseases. However, these 14 and PaliGemma models due to their lightweight architecture
tags were insufficient for conditioning the report generation compared to Llava 1.6 (34B parameters) (Liu et al. 2024),
model. Thus, we chose a fine-tuned model to classify the GPT-4 (200B parameters) (OpenAI 2024), GPT-4o Mini (8B
manual tags from the IU-Xray dataset (Demner-Fushman et parameters) (OpenAI 2024), and Qwen2-VL (72B parame-
al. 2015) by removing the final layer and adding a new final ters) (Wang et al. 2024). Our method is partly inspired by
layer containing 105 nodes for the most occurring manual (Alfarghaly et al. 2021), where a conditioned transformer
tags from the dataset as implemented in (Alfarghaly et al. model is used to generate radiology reports.
2021).
The positive tags represent the physiological conditions Florence-2
identified in the chest X-ray and serve as valuable inputs
for constructing structured prompts. These prompts have the Florence-2 is a recent VLM (Xiao et al. 2023) (0.7B pa-
potential to enhance the performance of Vision-Language rameters) developed for various vision and vision-language
models in generating accurate and contextually relevant re- tasks with a unified sequence-to-sequence architecture. It
ports. uses a Dual Attention Vision Transformer (DaViT) (Ding et
The model uses the binary cross-entropy (BCE) loss to al. 2022) as its vision encoder to process images into token
handle multiple diagnostic labels, embeddings V ∈ RNv ×Dv , where Nv represents the number
of visual tokens and Dv their dimensionality. An extended
N M
1 XX version of the tokenizer combines these embeddings with
LBCE = − [yij log(ŷij ) + (1 − yij ) log(1 − ŷij )] task-related text prompts Tprompt ∈ RNt ×D .
N i=1 j=1
The model uses a multi-modal transformer-based
where N is the number of samples (batch size), M is the encoder-decoder, with input X = [V ′ , Tprompt ], where V′ is
number of labels (e.g., M = 105 here), yij is the ground the dimensionally aligned projection of V. It uses a standard
truth binary label for the j-th class of the i-th sample, ŷij is cross-entropy language modeling objective for training:
the predicted probability for the j-th class of the i-th sample,
|y|
and log is the natural logarithm. X
L=− log Pθ (yi | y<i , x),
i=1
The multi-modal Vision-Language Model (VLM)
In this step, we use the positive disease tags obtained from where, y represents the target sequence, and x combines vi-
the multi-label CNN-based classifier to construct textual sual and textual inputs.
Florence 2 is trained on FLD-5B, a large-scale dataset Radiology Report Training Dataset To ensure the accu-
with 126M images and more than 5 billion annotations (text, racy and relevance of the generated diagnostic reports, we
region-text pairs, text-phrase-region triplets), this empowers pre-process the radiology reports by eliminating the usage
the model to learn multiple levels of spatial and semantic of routine or repetitive words and phrases generally con-
granularity (PLN). tained in chest x-ray reports. These words and phrases are
common in normal (healthy) cases and can cause skewed
PaliGemma results where the VLM generates many general terms to re-
duce the training loss. To address this, the dataset is adjusted
PaliGemma is a unified VLM (Beyer et al. 2024) for by utilizing a filtering process with the use of the pre-trained
general-use multi-modal functionalities based on SigLIP Llama 3.1-8B model (Sam and Vavekanand 2024) as shown
ViT-So400m image encoder (Zhai et al. 2023) and the in Fig.2.
Gemma-2B decoder-only language model (Team et al. 2024)
with fewer than 3 billion parameters. The SigLIP en-
coder uses a contrastive pretraining method with sigmoid Evaluation Metrics
loss to produce image embeddings, achieving SOTA clip- To evaluate the performance of our fine-tuned VLM,
level visual representation quality. The textual inputs to the we have used two different performance metrics, the
Gemma-2B model are processed using a SentencePiece to- BERTScore and F1-cheXbert. We opt for BERTScore as it is
kenizer, and then by autoregressive decoding, image tokens, an automated assessment measure for text production. Simi-
and text prompts can be combined into a single sequence. lar to conventional metrics, BERTScore calculates a similar-
A linear projection aligns the SigLIP output with Gemma- ity score for every token in the candidate sentence in relation
2 B’s input space, facilitating seamless integration. The to each token in the reference phrase. Instead of relying on
model is then further trained on higher-resolution images exact matches, the system calculates token similarity using
and domain-focused data in the final stages, which leads to contextual embeddings (Zhang et al. 2020).
better accuracy. For PaliGemma, an input sequence takes the In contrast, the F1-cheXbert (Smit et al. 2020) score uti-
following mathematical form: lizes the CheXbert(Smit et al. 2020) transformer to output
selected labels on both original and generated reports and
then calculates the F1 score between these two sets of la-
tokens = [image tokens . . . , BOS, prefix tokens . . . , bels.
SEP, suffix tokens . . . , EOS, PAD . . .]
Implementation Details
Here, BOS= Beginning of Sentence token, EOS = End of
The multi-label CNN-based classification model is fine-
Sentence token, PAD= Padding token , SEP = Separator to-
tuned using TensorFlow, following the methodology out-
ken
lined in (Alfarghaly et al. 2021). With 32 images per batch
and the Adam optimizer, the model is trained end-to-end us-
Experiments ing mini-batch gradient descent. Binary cross-entropy loss
(eq.) was the loss that needed to be examined in this model.
Datasets We left all the model parameters up for fine-tuning.
On the other hand, the VLMs are implemented using
MIMIC-CXR Database v2.0.0 This publically available, PyTorch. Both models are trained and tested on Intel(R)
large-scale MIMIC-CXR Database v2.0.0 is intended to fa- Xeon(R) CPUs and L4 GPUs provided by Google Colab
cilitate medical imaging research, especially in the area of notebooks. The loss function provided by the VLM pack-
chest radiography. The dataset contains 377,110 JPG format ages are used for training the Vision-Language Models. The
images and structured labels derived from the 227,827 free- Adam optimizer, with an initial learning rate of 5 × 10−5
text radiology reports associated with these images (Johnson and a batch size of 16 for multi-label classifiers and 4 for the
et al. 2024). VLMs, is employed for training or fine-tuning. The VLMs
were fine-tuned for 10 epochs.
Indiana University Chest X-rays and Reports (OpenI)
Accessible via OpenI, the Indiana University Chest X-rays
dataset is an extensive collection of chest radiographs with
their corresponding diagnostic reports intended to facilitate
medical imaging research and education. The dataset con-
tains 7, 470 pairs of images and reports covering a broad
spectrum of both common and uncommon thoracic disor-
ders (Demner-Fushman et al. 2015).
Chest X-ray images from both datasets were utilized as in-
puts, with the corresponding findings from each image serv-
ing as outputs to fine-tune the proposed model. The manual
tags for finetuning the CNN classifier were provided by (Al-
farghaly et al. 2021). Figure 2: Radiology report pre-processing for training.
Results F1-cheXbert score from 0.5516 to 0.700; that is, it generates
almost 14.84% more accurate reports. For Paligemma, both
Qualitative analysis: Extracting abnormal findings types of prompts result in similar BERTScore metrics, but
A Llama 3.1 8b instruct model was used to extract sentences the F1-cheXbert score shows that the model can generate
or phrases in the findings that indicated clinical abnormali- 2.5% more accurate reports with specific prompt guidance.
ties. Zero-shot prompting the Llama model with the instruc- With the help of this comparative analysis, we can determine
tion ”Extract the sentences or phrases that seem to indi- that Paligemma with tag-specific prompt guidance should be
cate clinical abnormalities” did not generate fruitful results. utilized in our pipeline.
Thus, we consulted with experienced radiologists to create
a few examples of extracted clinical abnormal findings from Comparison with captioning baselines
the rest of the report. With 12 examples, few-shot prompt- We compared our final methods to three types of models:
ing the Llama model with the same instruction showed much (1) CNN encoders with RNN decoders. (2) CNN encoders
better results, as shown in Fig. 2. Following this adjustment, with Transformer decoders. (3) Vision Transformer en-
two trained radiologists observed 30 reports each and af- coders with LLM decoders. For type-1, we look at attention-
firmed that the Llama model efficiently eliminated the re- based CNN-LSTM architecture mentioned in ref (Sirshar et
dundant information and effectively extracted the abnormal al. 2022) (Liu et al. 2021). For comparison, we follow the
findings. architecture of (Sirshar et al. 2022). Type-2 entails CNN
as encoders and transformers as decoders. A similar ap-
Vision Language model comparison and necessity proach was taken in (Alqahtani et al. 2024), (Alfarghaly et
of prompt guidance al. 2021), (Mondal et al. 2023). We used the architecture
While comparing the two VLMs, we also tested their per- CNX-B2 described in ref (Alqahtani et al. 2024) for compar-
formances with a generic prompt (”Write a Chest X-ray re- ative evaluation. Transformers were used end-to-end for en-
port”) and a tag-specific prompt (such as ”Write a Chest X- coding and decoding in type-3 captioning baselines as seen
ray report mentioning cardiomegaly”, ”Write a Chest X-ray in ref (Wang et al. 2023),(Nicolson, Dowling, and Koopman
report mentioning pleural effusion, consolidation”, etc.) Ta- 2023). For our quantitative analysis, we explored the archi-
ble 1 shows that Paligemma outperforms Florence-2 in all tecture CvT2DistilGPT2 described in ref (Nicolson, Dowl-
the BERTScore and the F1-cheXbert metrics. In the cases ing, and Koopman 2023).
of both VLMs, we observe that tag-specific prompts en- Table 2 shows the results for both MIMIC-CXR and
hance the performance of the model. For Florence-2, tag- IU X-Ray. These results show that the proposed method
specific prompts increase the BERTScore (at least 4%) the considerably improves the performance over the baseline
(a) Report Aid Example 1 (b) Report Aid Example 2 (c) Report Aid Example 3
OpenI Dataset
Pipeline BERTScore Precision BERTScore Recall BERTScore F1 F1-cheXbert
Attention-based CNN-LSTM 0.2174 0.2018 0.2123 0.5451
CNX-B2 0.2194 0.1608 0.2268 0.6484
CvT2-Distil-GPT2 0.3089 0.2996 0.3009 0.7600
Rad-TextAid (Proposed) 0.3273 0.3322 0.3156 0.7916
MIMIC-CXR Dataset
Pipeline BERTScore Precision BERTScore Recall BERTScore F1 F1-cheXbert
Attention-based CNN-LSTM 0.1871 0.1926 0.1852 0.5347
CNX-B2 0.1886 0.1606 0.2265 0.5333
CvT2-Distil-GPT2 0.2939 0.2927 0.2902 0.6813
Rad-TextAid (Proposed) 0.2813 0.3034 0.2975 0.6956
methods. When compared with the best existing method, diagnostic tags from a chest X-ray image associated with
CvT2DistilGPT2, RadTextAid shows an absolute improve- different physiological conditions and pathologies. These
ment of 4.8% in the NLG metric (BERTScore) and gen- tags, supplemented with the original image, are input to a
erates 3.16% more accurate reports according to the F1- multi-modal VLM (e.g., Florence 2, PaliGemma), combin-
cheXbert score. Similar trends are observed in the results ing image and semantic information to create a coherent,
for MIMIC-CXR. However, the overall performance of all relevant diagnostic radiology report.
models drops, likely due to the significantly higher average
findings length in the MIMIC-CXR dataset compared to the A key innovation of the approach is pre-processing the
IU X-Ray dataset. training corpus, using the Llama 3.1 pre-trained model,
eliminating all the redundant and overused vocabulary from
Qualitative results the text, meaning that the reports generated will be based
We present a few qualitative examples of RadTextAid to on only the abnormalities carrying diagnostic information.
demonstrate its superiority in Fig. 3. The model, trained us- This helps reduce duplication, improve clinical value, and
ing our pipeline, has successfully detected and described all provide a strong base for automatic report creation.
the clinical abnormalities as seen in the original report, as
shown in Fig. 3(a) and 3(c). However, in the second exam- The experimental results indicate that our model achieves
ple shown in Fig. 3(b), the model mistakenly detects and state-of-the-art performance based on BERTScore and F1-
describes a clinical abnormality not mentioned in the refer- cheXbert metrics. These results demonstrate the model’s
ence report. This error may have occurred due to the image capability to generate relevant and accurate diagnostic re-
quality. A more robust preprocessing step of the image could ports. Our comparative analysis reveals that the proposed
help resolve this issue. framework surpasses existing architectures, especially in in-
tegrating multi-modal learning techniques, leading to im-
proved outcomes. Additionally, the lightweight models have
Conclusion a unique advantage in terms of scalability in low-resource
In this study, we demonstrate the effectiveness of a novel healthcare settings, where access to a cloud server may
Vision-Language Model (VLM) -based framework for de- not always be available. Overall, the proposed method and
veloping automated diagnostic reports aiming to assist chest the experimental results highlight the effectiveness of our
X-ray image analysis and reporting. Initially, the model uses approach in optimizing automatic radiological reporting,
a CheXNet-121-based multi-label classifier, identifying 105 which could significantly enhance patient care.
References Mohsan, M. M.; Akram, M. U.; Rasool, G.; Alghamdi, N. S.;
Alfarghaly, O.; Khaled, R.; Elkorany, A.; Helal, M.; and Baqai, M. A. A.; and Abbas, M. 2023. Vision trans-
Fahmy, A. 2021. Automated radiology report generation us- former and language model based radiology report gener-
ing conditioned transformers. Informatics in Medicine Un- ation. IEEE Access 11:1814–1824.
locked 24:100557. Mondal, C.; Pham, D.-S.; Gupta, A.; Ghosh, S.; Tan, T.; and
Gedeon, T. 2023. EfficienTransNet: An automated chest x-
Alqahtani, F. F.; Mohsan, M. M.; Alshamrani, K.; Zeb, J.; ray report generation paradigm. In Proceedings of the 1st
Alhamami, S.; and Alqarni, D. 2024. Cnx-b2: A novel cnn- International Workshop on Multimodal and Responsible Af-
transformer approach for chest x-ray medical report genera- fective Computing. New York, NY, USA: ACM.
tion. IEEE Access 12:26626–26635.
Nicolson, A.; Dowling, J.; and Koopman, B. 2023. Im-
Beyer, L.; Steiner, A.; Pinto, A. S.; Kolesnikov, A.; Wang, proving chest x-ray report generation by leveraging warm
X.; Salz, D.; Neumann, M.; Alabdulmohsin, I.; Tschannen, starting. Artificial Intelligence in Medicine 144:102633.
M.; Bugliarello, E.; Unterthiner, T.; Keysers, D.; Koppula,
OpenAI. 2024. Gpt-4o system card.
S.; Liu, F.; Grycner, A.; Gritsenko, A.; Houlsby, N.; Ku-
mar, M.; Rong, K.; Eisenschlos, J.; Kabra, R.; Bauer, M.; Parag, P., and Hardcastle, T. C. 2022. Shortage of radiolo-
Bošnjak, M.; Chen, X.; Minderer, M.; Voigtlaender, P.; Bica, gists in low to middle income countries in the interpretation
I.; Balazevic, I.; Puigcerver, J.; Papalampidi, P.; Henaff, O.; of CT scans in trauma. Banglad. J. Med. Sci. 21(3):489–491.
Xiong, X.; Soricut, R.; Harmsen, J.; and Zhai, X. 2024.
Paligemma: A versatile 3b vlm for transfer. Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan,
T.; Ding, D. Y.; Bagul, A.; Langlotz, C. P.; Shpanskaya,
Chen, H.; Zhang, H.; Chen, P.-Y.; Yi, J.; and Hsieh, C.-J. K. S.; Lungren, M. P.; and Ng, A. Y. 2017. Chexnet:
2018. Attacking visual language grounding with adversarial Radiologist-level pneumonia detection on chest x-rays with
examples: A case study on neural image captioning. deep learning. CoRR abs/1711.05225.
Sam, K., and Vavekanand, R. 2024. Llama 3.1: An in-depth
Demner-Fushman, D.; Kohli, M.; Rosenman, M.; Shooshan,
analysis of the next generation large language model.
S.; Rodriguez, L.; Antani, S.; Thoma, G.; and Mcdonald, C.
2015. Preparing a collection of radiology examinations for Singh, A.; Raguru, J. K.; Prasad, G.; Chauhan, S.; Tiwari,
distribution and retrieval. Journal of the American Medical P. K.; Zaguia, A.; and Ullah, M. A. 2022. Medical image
Informatics Association : JAMIA 23. captioning using optimized deep learning model. Computa-
tional Intelligence and Neuroscience 2022.
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; and Yuan,
L. 2022. Davit: Dual attention vision transformers. Sirshar, M.; Paracha, M. F. K.; Akram, M. U.; Alghamdi,
N. S.; Zaidi, S. Z. Y.; and Fatima, T. 2022. Attention
Huang, G.; Liu, Z.; and Weinberger, K. Q. 2016. Densely based automated radiology report generation using CNN and
connected convolutional networks. CoRR abs/1608.06993. LSTM. PLoS One 17(1):e0262209.
Smit, A.; Jain, S.; Rajpurkar, P.; Pareek, A.; Ng, A. Y.; and
Johnson, A.; Lungren, M.; Peng, Y.; Lu, Z.; Mark, R.; Lungren, M. P. 2020. Chexbert: Combining automatic la-
Berkowitz, S.; and Horng, S. 2024. MIMIC-CXR-JPG - belers and expert annotations for accurate radiology report
chest radiographs with structured labels. labeling using bert.
Kim, Y.; Wu, J.; Abdulle, Y.; Gao, Y.; and Wu, H. 2024. En- Srinivasan, P.; Thapar, D.; Bhavsar, A.; and Nigam, A. 2020.
hancing human-computer interaction in chest x-ray analysis Hierarchical x-ray report generation via pathology tags and
using vision and language model with eye gaze patterns. multi head attention. In Proceedings of the Asian Confer-
ence on Computer Vision (ACCV).
Liu, F.; Yin, C.; Wu, X.; Ge, S.; Zhang, P.; and Sun, X. 2021.
Contrastive attention for automatic chest X-ray report gen- Srinivasan, P.; Thapar, D.; Bhavsar, A.; and Nigam, A. 2021.
eration. In Zong, C.; Xia, F.; Li, W.; and Navigli, R., eds., Hierarchical x-ray report generation via pathology tags and
Findings of the Association for Computational Linguistics: multi head attention. In Computer Vision – ACCV 2020,
ACL-IJCNLP 2021, 269–280. Online: Association for Com- Lecture notes in computer science. Cham: Springer Interna-
putational Linguistics. tional Publishing. 600–616.
Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Se-
Liu, F.; Yin, C.; Wu, X.; Ge, S.; Zou, Y.; Zhang, P.; Zou, Y.; quence to sequence learning with neural networks. CoRR
and Sun, X. 2023. Contrastive attention for automatic chest abs/1409.3215.
x-ray report generation.
Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupati-
Liu, H.; Li, C.; Li, Y.; Li, B.; Zhang, Y.; Shen, S.; and Lee, raju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M. S.; Love,
Y. J. 2024. Llava-next: Improved reasoning, ocr, and world J.; et al. 2024. Gemma: Open models based on gemini re-
knowledge. search and technology. arXiv preprint arXiv:2403.08295.
Wang, Z.; Liu, L.; Wang, L.; and Zhou, L. 2023. METrans-
former: Radiology report generation by transformer with
multiple learnable expert tokens.
Wang, P.; Bai, S.; Tan, S.; Wang, S.; Fan, Z.; Bai, J.; Chen,
K.; Liu, X.; Wang, J.; Ge, W.; Fan, Y.; Dang, K.; Du, M.;
Ren, X.; Men, R.; Liu, D.; Zhou, C.; Zhou, J.; and Lin, J.
2024. Qwen2-vl: Enhancing vision-language model’s per-
ception of the world at any resolution.
Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng,
M.; Liu, C.; and Yuan, L. 2023. Florence-2: Advancing a
unified representation for a variety of vision tasks. arXiv
preprint arXiv:2311.06242.
Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L. 2023.
Sigmoid loss for language image pre-training.
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi,
Y. 2020. Bertscore: Evaluating text generation with bert.