Vision-Language Models For Medical Report Generation and Visual Question Answering: A Review
Vision-Language Models For Medical Report Generation and Visual Question Answering: A Review
Abstract
Medical vision-language models (VLMs) combine computer vision (CV) and natural lan-
guage processing (NLP) to analyze visual and textual medical data. Our paper reviews recent
advancements in developing VLMs specialized for healthcare, focusing on models designed for
medical report generation and visual question answering (VQA). We provide background on
NLP and CV, explaining how techniques from both fields are integrated into VLMs to enable
learning from multimodal data. Key areas we address include the exploration of medical vision-
language datasets, in-depth analyses of architectures and pre-training strategies employed in
recent noteworthy medical VLMs, and comprehensive discussion on evaluation metrics for as-
sessing VLMs’ performance in medical report generation and VQA. We also highlight current
challenges and propose future directions, including enhancing clinical validity and addressing
patient privacy concerns. Overall, our review summarizes recent progress in developing VLMs
to harness multimodal medical data for improved healthcare applications.
1 Introduction
The last decade has witnessed enormous progress in artificial intelligence (AI) and machine learn-
ing (ML), including the development of foundation models (FMs), large language models (LLMs),
and vision-language models (VLMs). These AI/ML developments have started transforming sev-
eral aspects of our daily lives, including healthcare. AI/ML can potentially transform the whole
healthcare continuum by significantly optimizing and improving disease screening and diagnostic
procedures, treatment planning, and post-treatment surveillance and care [Baj+21]. Various com-
puter vision (CV) and natural language processing (NLP) models, more recently LLMs, have been
instrumental in driving this transformative trend [He+23; Zho+23a]. CV models have been trained
and validated for various screening and diagnosis use cases leveraging radiology data from X-rays,
mammograms, magnetic resonance imaging (MRI), computed tomography (CT), and others. Re-
cently, AI models focused on digital pathology using histopathology and immunohistochemistry
data have also shown significant advances in accurate disease diagnosis, prognosis, and biomarker
identification [Waq+23b]. On the other hand, by training models using large datasets of med-
ical literature, clinical notes, and other healthcare-related text, LLMs can extract insights from
electronic health records (EHR) efficiently, assist healthcare professionals in generating concise
summary reports, and facilitate the interpretation of patient information. Noteworthy examples
of such LLMs include GatorTron [Yan+22], ChatDoctor [Li+23c], Med-PaLM (Medical Pathways
Language Model) [Sin+23] and Med-Alpaca [Han+23].
The healthcare data is inherently multimodal, and consequently, the AI/ML models often need
to be trained using multiple data modalities, including text (e.g., clinical notes, radiology reports,
surgical pathology reports, etc.), imaging (e.g., radiological scans, digitized histopathology slides,
1
etc.), and tabular data (e.g., numerical data such as vitals or labs and categorical data such as
race, gender, and others) [Aco+22; Shr+23; Waq+23a; Tri+23; Moh23]. In routine clinical prac-
tice, healthcare professionals utilize a combination of these data modalities for diagnosing and
treating various conditions. Integrating information from diverse data modalities enhances the pre-
cision and thoroughness of disease assessments, diagnoses, treatment planning, and post-treatment
surveillance. The need for AI/ML models to ingest, integrate, and learn from information stemming
from varied data sources is the driving force for multimodal learning [Hua+21; Waq+23a].
The recent progress in multimodal learning has been driven by the development of vision-
language models (VLMs) [Gan+22; Che+23; Moh23]. These cutting-edge models can analyze,
interpret, and derive insights from both visual and textual data. In the medical domain, these
models contribute to developing a more holistic understanding of patient information and improv-
ing the performance of ML models in various clinical tasks. Many of these models, like CLIP
(Contrastive Language–Image Pre-training) [Rad+21], LLaVa (Large Language and Vision Assis-
tant) [Liu+23b], and Flamingo [Ala+22] are tailored to healthcare domain through training on
extensive medical datasets. Adapting VLMs for medical visual question-answering [Lin+23b] is
particularly noteworthy, empowering healthcare professionals to pose queries regarding medical
images such as CT scans, MRIs, mammograms, ultrasound, X-rays, and more. The question-
answering capability elevates the interactive nature of the AI/ML models in healthcare, facilitating
dynamic and informative exchanges between healthcare providers and the AI system. Furthermore,
adapting VLMs for medical report generation enables them to produce detailed and contextually
relevant reports by amalgamating information from both visual and textual sources. This not only
streamlines the documentation process but also ensures that the generated reports are comprehen-
sive and accurately reflect the subtleties present in the data, further enhancing healthcare workflow
efficiency.
In contrast to previous related surveys [Lin+23b; TLZ23; Shr+23], this review focuses on the
latest advancements in VLMs tailored for medical report generation and visual question-answering.
The overall structure of this review is shown in Fig. 1 and is outlined as follows. In Section 2,
we provide essential background on neural networks, CV, and NLP. In Section 3, we delve into
the exploration of VLMs’ architectures, training strategies, and downstream tasks. The goal of
Section 2 and Section 3 is to ensure the accessibility of this review for readers, irrespective of their
ML background. We split Section 4 into three key sub-sections. In Section 4.1, we describe 17
publicly available vision-language datasets. These datasets encompass medical image-text pairs or
question-answer pairs related to medical images. Next, in Section 4.2, we meticulously outline the
metrics and their formulas, where applicable, employed for evaluating VLMs in the context of report
generation and visual question-answering tasks. In Section 4.3, we conduct a thorough review of 15
recent medical VLMs, with 14 of them being publicly available. To the best of our knowledge, most
of these models have not been reviewed in any previous surveys. Finally, in Section 5, we discuss
the current challenges within the field of medical VLMs, offering insights into potential research
directions that could profoundly influence their future development. The list of medical VLMs and
datasets can also be found on GitHub.
2
Tokenization
ROCO
Introduction Token Embeddings
MIMIC-CXR
NNs RNNs
MIMIC-CXR-JPG
ML - A Brief Review NLP Transformers
MedViLL
MIMIC-NLE
CV CNNs Transfer Learning
PubMedCLIP
CXR-PRO
ViTs Curriculum Learning
RepsNet
IU-Xray
Self-Supervised Learning
Single- vs. Dual-Stream BiomedCLIP
VLMs MedICaT
Model Architecture Pre-Training Process
Encoder vs. Encoder- & Tasks UniXGen
Decoder VLMs PMC-OA
Model Training Fine-Tuning Techniques
RAMM
MS-CXR
Parameter-Efficient
VLMs Fine-Tuning LoRA X-REM
Prompt Engineering SLAKE
In-Context Learning Prompt Tuning Visual Med-Alpaca
VQA-RAD
RAG
Prefix Token Tuning CXR-RePaiR-Gen
PathVQA
Downstream Tasks Report Generation
LLaVa-Med
Visual Question VQA-Med 2019
Answering XrayGPT
VQA-Med 2020
Other Tasks
CAT-ViL DeiT
Medical Datasets
VQA-Med 2021
for VLMs
Evaluation Metrics MUMC
for RG EndoVis 2017
VLM Evaluation
Medical VLMs
Metrics Evaluation Metrics Med-Flamingo
for VQA EndoVis 2018
Medical Models RaDialog
Challenges & Potential
Future Directions
into an input layer, an output layer, and multiple intermediate layers called hidden layers. The
basic NN is a “feedforward NN”, where neurons can be numbered in such a way that a connection
from neuron i to neuron j can exist if and only if i < j [Bal21]. In any NN, the connections
between nodes carry weight, and neurons utilize “activation functions” on their inputs. Activation
functions play a crucial role by introducing non-linearity to the model, enabling it to learn com-
plex nonlinear mappings between inputs and outputs. Common activation functions include the
sigmoid, hyperbolic tangent (tanh), and Rectified Linear Unit (ReLU). NNs utilize a loss function
to quantify the difference between the predicted outputs and the actual targets. The loss function
produces a scalar value, and the goal during training is to minimize this loss value.
Backpropagation, short for backward propagation of errors, is a key algorithm for training deep
neural networks. During the forward pass, input data is fed through the network, predictions are
generated, and a scalar loss value is calculated using the loss function. During backpropagation, we
calculate the gradient of the loss function with respect to the weights of the network. This gradient
information is then used to update the weights in an effort to minimize the difference between
predicted and target values. Backpropagation is an application of the chain rule for computing
derivatives [Bal21]. After the backward pass, the optimization algorithm takes these gradients and
adjusts the learnable parameters (weights and biases) of the NN, which, in turn, will result in the
minimization of the loss value in the next batch. Common optimization methods include gradient
descent, stochastic gradient descent (SGD), [Rob51], Adam (Adaptive Moment Estimation [KB14],
and many others.
3
at understanding, processing, and generating human language. Named entity recognition (NER),
a prominent NLP task, focuses on identifying and classifying entities within the text, such as
names of individuals, medical conditions, etc. For instance, in medical literature, NER can assist
in extracting crucial information from documents. Text summarization in NLP is widely used for
generating coherent summaries of lengthy texts. Sentiment analysis is a task that determines the
emotional tone expressed in a given text, providing valuable insights for applications like social
media monitoring or customer feedback analysis. Machine translation is a fundamental NLP task
in breaking down language barriers by automatically translating text from one language to another.
Question answering in NLP is directed at comprehending and responding to user queries, propelling
advancements in virtual assistants and information retrieval.
2.2.1 Tokenization
The first step in NLP is tokenization, which is the mechanism of splitting or fragmenting the
sentences and words to their possible smallest morpheme, called a token. A morpheme is the
smallest possible word after which it cannot be broken further [RB21]. One example of a word-
level tokenization method is whitespace tokenization, which segments text based on whitespace
characters. In many NLP applications, subword tokenization methods are preferred due to their
effectiveness in handling out-of-vocabulary words. WordPiece [Wu+16] begins by treating each
character as a token, creating an initial vocabulary. Employing a flexible and adaptive merging
strategy, WordPiece considers any pair of adjacent characters or subword units that enhance the
overall likelihood of the training data. This likelihood reflects the model’s probability of accurately
representing the training data given its current state. In contrast, Byte-Pair Encoding (BPE)
[SHB16] shares similarities with WordPiece but adheres to a more deterministic merging strategy.
In each iteration, BPE merges the most frequent pair of adjacent characters or subword units,
progressing toward a predefined vocabulary size. Byte-level BPE [WCG20] operates at an even
finer granularity, considering individual bytes rather than characters. Byte-level BPE extends the
concept of subword tokenization to bytes, allowing it to capture more nuanced patterns at the byte
level.
4
words into n-grams, and employs a skip-gram training method similar to Word2Vec [Mik+13a] to
learn embeddings for these sub-word units. There are also word embeddings tailored to represent
biomedical and clinical terms better. In tasks where the order of words is not essential, other fea-
ture extraction techniques can be effective, for example, bag-of-words (BoW) in text classification
or term frequency-inverse document frequency (tf-idf ) for information retrieval.
In addition to general-purpose word embeddings, there are ones designed for biomedical and
clinical terms. BioWordVec [Zha+19] incorporates MeSH (Medical Subject Headings) terms along
with text from PubMed abstracts and employs the fastText [Boj+17] algorithm to learn improved
biomedical word embeddings. Another prominent approach is Cui2vec [Bea+20], which leverages
diverse multi-modal data derived from medical publications and clinical notes. Cui2vec system-
atically maps medical terms onto a common Concept Unique Identifier (CUI) space, followed by
the construction of a co-occurrence matrix. This matrix captures instances where different CUIs
appear together, which is a foundation for generating word embeddings using techniques such as
GloVe [PSM14] or Word2Vec [Mik+13a]. In most cases, it is common to add positional encodings
to capture the order of tokens in a sequence. Positional encoding vectors, often based on sinusoidal
functions, systematically encode token positions, enriching embeddings with positional information
for utilization in ML models tailored to specific NLP tasks [Ahm+23].
2.2.4 Transformers
In recent years, there has been a remarkable advancement in NLP mainly due to the development of
the Transformer models [Vas+17]. Beyond incorporating embeddings and positional encodings, the
Transformer architecture consists of an encoder that processes input data, represented by vectors
obtained from embedded and positionally encoded tokens. The encoder-generated representation
then serves as the input for the subsequent decoder, which transforms these vector representations
into a relevant output tailored to the specific task at hand. A defining characteristic of the Trans-
former lies in its self-attention mechanism, notably the scaled dot-product attention, which proves
instrumental in capturing intricate dependencies within sequences. This mechanism, employed in
both the encoder and decoder, utilizes queries and keys during the attention process. Queries
serve as projections of the input sequence, encapsulating information for attending to other posi-
tions, while keys represent the positions within the sequence. Enhanced by multi-head attention
for parallelization, the self-attention mechanism enables the model to dynamically weigh different
parts of the input sequence, fostering a nuanced understanding of contextual relationships. Each
layer in both the encoder and decoder encompasses sub-layers, including a feedforward NN, fur-
5
ther augmenting the model’s capacity to capture intricate patterns within the data. In practice,
Transformers face limitations in effectively processing long sequences and exhibit less selectivity
about relevant information when considering all positions in the sequence. Various techniques
have been proposed to address these issues. One such approach, known as hierarchical attention
[Yan+16], strategically reduces computational complexity and enhances contextual sensitivity by
initially computing attention at the word level and then at the sentence level. Another notable ad-
vancement in attention algorithms is the FlashAttention [Dao+22] and FlashAttention-2 [Dao23],
designed to accelerate attention computations significantly.
The synergy between the enhanced computational power provided by Graphical Processing
Units (GPUs) and the advancements in attention mechanisms has played a pivotal role in the
development of large language models (LLMs). These models are meticulously trained on vast
datasets with a very large number of parameters. Initial LLMs include but are not limited to,
BERT (Bidirectional Encoder Representations from Transformers) [Dev+19] (the largest version
comprising 235 M parameters), ALBERT (A Lite BERT) [Lan+19] (with the largest variant at 12 M
parameters), and Megatron-LM [Sho+19] (the largest version featuring 1.2 B parameters). The era
of even larger LLMs began in 2020, introducing models like GPT-3 (the 3-generation Generative
Pre-trained Transformer) [Bro+20] (175 B parameters) and PaLM (Pathways Language Model)
[Cho+22] (540 B parameters). Some of the most recent LLMs are LLaMA (Large Language Model
Meta AI) [Tou+23b], Vicuna [Chi+23], Llama 2 [Tou+23a], and Mistral [Jia+23]. Note that
encoder-only LLMs can be used to generate token embeddings (e.g., BERT [Dev+19] or GatorTron
[Yan+22]).
6
and treating images as sequences of smaller patches. Each image patch undergoes a process of
flattening into a vector, followed by passage through an embedding layer. The embedding layer
enriches the flattened image patches, providing a more expressive and continuous representation.
Next, positional encodings are incorporated into the embeddings, conveying information about
the spatial arrangement of the image patches. A distinctive feature of ViTs is the introduction
of a special token designed to capture global information about the entire image. This special
token has an associated learnable token embedding, represented by a vector with its unique set
of parameters. ViTs have achieved notable success in semantic segmentation [RBK21], anomaly
detection [Mis+21], medical image classification [Man+23] and even outperformed CNNs in some
cases [Tya+21; Xin+22].
Single-Stream Models A single-stream VLM adopts an efficient architecture for processing both
visual and textual information within a unified module. This architecture incorporates an early
fusion of distinct data modalities, where feature vectors from various data sources are concatenated
into a single vector (e.g., MedViLL [Moo+22]). Subsequently, this combined representation is fed
into a single stream. One notable advantage of the single-stream design is its parameter efficiency,
achieved by employing the same set of parameters for all modalities. This not only simplifies the
model but also contributes to computational efficiency during both training and inference phases
[Che+23].
Dual-Stream Models A dual-stream VLM extracts visual and textual representations sepa-
rately in parallel streams that do not share parameters. This architecture usually has higher
computational complexity than single-stream architectures. Visual features are generated from
pre-trained vision encoders, such as CNNs or ViTs, and textual features are obtained from pre-
trained text encoders, usually based on a Transformer architecture (e.g., PubMedCLIP [EMD23]).
Both features are then fed into a multimodal fusion module, often leveraging attention mechanisms,
to integrate information from both data modalities and to learn cross-modal representations. This
late fusion approach allows for more intricate interactions between visual and textual information,
7
enabling the model to capture complex cross-modal dependencies. However, it comes at the cost
of increased computational complexity compared to single-stream architecture.
Encoder-only Models These models are advantageous in scenarios where the primary objec-
tive is efficient representation learning. They often exhibit streamlined processing and reduced
computational complexity, making them suitable for tasks requiring compact and informative rep-
resentations. However, these models might lack the capability to generate intricate and detailed
outputs, limiting their use in tasks demanding nuanced responses or creative generation.
Encoder-Decoder Models These models offer the flexibility to generate complex and diverse
outputs, making them well-suited for tasks like image captioning, translation, or any application re-
quiring creative responses. The decoding step allows for the transformation of joint representations
into meaningful outputs. However, this versatility comes at the cost of increased computational
load and complexity.
8
3.2.3 Self-Supervised Learning (SSL)
SSL is a fundamental paradigm in training VLMs, offering a powerful alternative to traditional
supervised learning by allowing models to generate their own labels from the data [Ran+23a]. This
is particularly beneficial when obtaining large amounts of labeled data is challenging or expensive.
In self-supervised learning for VLMs, the models formulate tasks that leverage inherent structures
within the data, enabling them to learn meaningful representations across modalities without ex-
plicit external labels. Contrastive learning, masked language modeling, and masked image modeling
(described in the following sub-section) are examples of self-supervised learning tasks.
Masked Language Modeling (MLM) MLM is a widely used task in NLP [Tay53]. It was
first introduced and applied in the BERT model [Dev+19]. MLM involves randomly selecting a
percentage of tokens within textual data and replacing them with a special token, often denoted as
MASK. The model predicts these masked tokens by taking into account the context on both sides
of them, allowing the model to grasp nuanced contextual information. VLMs such as UNITER
[Che+20b] and VisualBERT [Li+19] leverage MLM for pre-training.
Masked Image Modeling (MIM) Extending the idea of MLM to images gave rise to MIM
[Xie+22]. In MIM, certain patches are masked, prompting the model to predict the contents of
masked regions. This process enables the model to draw context from the entirety of the image, en-
couraging the integration of both local and global visual features. VLMs like UNITER [Che+20b]
and ViLBERT [Lu+19] leverage MIM for enhanced performance. The cross-entropy loss is em-
ployed in MLM and MIM tasks to measure the difference between predicted and actual probability
9
distributions for the masked elements. Additionally, MLM can be combined with MIM, allowing the
reconstruction of the masked signal in one modality with support from another modality [Kwo+23].
Image-Text Matching (ITM) ITM is another common vision-language pre-training task. Through-
out the training, the model learns to map images and corresponding textual descriptions into a
shared semantic space, where closely aligned vectors represent similar content in both modali-
ties. In single-stream VLMs, the special token [CLS] represents the joint representation for both
modalities. In contrast, in dual-stream VLMs, the visual and textual representations of [CLS]V
and [CLS]T are concatenated. This joint representation is fed into a fully-connected layer followed
by the sigmoid function, predicting a score indicating match or mismatch [Che+23]. Models like
CLIP [Rad+21], ALBEF (ALign the image and text representations BEfore Fusing) [Li+21], and
METER [Dou+22] leverage ITM during pre-training.
Combining Multiple Tasks In VLM pre-training, multiple tasks are often combined in a uni-
fied framework, allowing models to grasp nuanced contextual information across modalities. The
final loss function can combine contrastive loss, cross-entropy loss for masked token prediction, and
other task-specific losses. This comprehensive pre-training approach equips VLMs with versatile
representations for diverse downstream tasks. For example, ALBEF [Li+21] adopts a comprehen-
sive pre-training objective encompassing three tasks: CL, MLM, and ITM. The overall loss is then
computed as the sum of these individual components.
Supervised Fine-Tuning (SFT) Before employing SFT, the VLM is pre-training on an ex-
tensive image-text dataset, establishing a foundational understanding of the complex relationship
between visual and textual representations. SFT involves meticulous fine-tuning on a more fo-
cused dataset, curated to match the nuances of the targeted application. This dual-phase strategy,
encompassing broad pre-training and task-specific fine-tuning, enables the model to benefit from
large-scale generalization while seamlessly adapting to the intricacies of particular applications
[Ouy+22].
Instruction Fine-Tuning (IFT) IFT refers to the process of refining a pre-trained language
model by providing specific instructions or guidance tailored to a particular task or application
[Ren+24]. This process typically involves exposing the model to examples or prompts related to
the desired instructions and updating its parameters based on the feedback received during this
task-specific training phase. Medical VLM, RaDialog [Pel+23], employs this fine-tuning technique.
10
3.3 Parameter-Efficient Fine-Tuning (PEFT)
In this section, we explore strategies for adapting VLMs while keeping the model’s parameters frozen
and only updating newly added layers. In recent years, PEFT has gained prominence, encompassing
various techniques and strategies that aim to make the most effective use of parameters during the
fine-tuning process, particularly in scenarios with limited labeled data for the target task. The
main strategy of PEFT is incorporating task-specific parameters, known as adapters, into a pre-
trained model while retaining its original parameters. The architecture of adapter modules typically
incorporates a bottleneck structure, projecting original features into a reduced dimension, applying
non-linearity, and then projecting back to the original dimension. This thoughtful design ensures
parameter efficiency by limiting the number of added parameters per task. Integrated after each
layer of the pre-trained model, adapter modules capture task-specific details while preserving shared
parameters, enabling the model’s seamless extension to new tasks without significant interference
with previously acquired knowledge.
11
3.4.1 Prompt Engineering
Prompt engineering is a technique that involves enhancing a large pre-trained model with task-
specific instructions, referred to as prompts, to tailor the model’s output for specific tasks [Gu+23].
Examples include instructing the model to generate a radiology report for a specific image (e.g.,
RAMM [Pel+23]). Prompt engineering can also expose the VLM to a sequence of interconnected
examples or prompts, guiding it to a desired output. Another approach incorporates progressively
structured instructions or questions, refining focus and enhancing the model’s ability to generate
coherent and contextually relevant responses [Gu+23].
12
3.5.2 Visual Question Answering (VQA)
VQA is another important visual-language understanding task, where the model needs to com-
prehend images or videos and the posed question to provide a relevant and accurate response
[Ant+15]. The spectrum of questions encountered in VQA is broad, encompassing inquiries about
the presence of specific objects, their locations, or distinctive properties within the image. In the
medical context [Lin+23b], this may involve questions regarding the presence of medical conditions
or abnormalities, such as “What abnormality is seen in the image?” [Ion+21] or “Is there gastric
fullness?” [Lau+18]. Other queries may delve into details like the imaging method used [Aba+19],
the organ system involved [Lau+18], or the presence of specific anatomical structures [Liu+21a].
Questions in VQA fall into two categories. Open-ended questions elicit responses in the form of
phrases or sentences, fostering detailed and nuanced answers [Tha+23]. On the other hand, closed-
ended questions are designed to prompt limited responses, often with predetermined options, such
as a short list of multiple choices, a yes/no response, or a numeric rating [Baz+23]. The task of
VQA is commonly approached as either a classification task, a generation task, or both [Lin+23b].
In the classification approach, models select the correct answer from a predefined set, while in the
generation task, models produce free-form textual responses unconstrained by predefined options.
4 Medical VLMs
4.1 Medical Datasets for VLMs
The adaptation of VLMs to various medical tasks is achieved through their pre-training and fine-
tuning using specialized task-specific datasets. Below is the list of vision-language datasets available
in the public domain that contain medical image-text pairs or question-answer (QA) pairs. Most
of them are employed by medical VLMs described in Section 4.3 for pre-training, fine-tuning, and
evaluating VQA and RG tasks. The comparative analysis of these datasets is presented in Table 1.
The last column in Table 1 provides a link to the source of the data on the web with the following
abbreviations: GH - GitHub, PN - PhysioNet, and HF - Hugging Face.
13
Table 1: A list of datasets used for developing medical VLMs.
Dataset # image-text pairs # QA pairs Other components Link
ROCO
81, 825 – – GH
[Pel+18]
MIMIC-CXR
377, 110 – – PN
[Joh+19b]
MIMIC-CXR-JPG
377, 110 – pathology labels PN
[Joh+19a]
MS-CXR
1, 162 – bounding box annotations PN
[Boe+22]
IU-Xray or Open-I
7, 470 – labels Site
[Dem+15]
PMC-OA
1, 650, 000 – – HF
[Lin+23a]
VQA-RAD
– 3, 515 315 radiology images Site
[Lau+18]
PathVQA
– 32, 799 4, 998 pathology images GH
[He+20]
VQA-Med 2019
– 15, 292 4, 200 radiology images GH
[Aba+19]
14
tomography (CT), ultrasound, x-ray, fluoroscopy, positron emission tomography (PET), mammog-
raphy, magnetic resonance imaging (MRI), angiography, and PET-CT. The out-of-class group has
6, 127 images, including synthetic radiology images, clinical photos, portraits, compound radiology
images, and digital art. Each image is accompanied by a corresponding caption, keyword, Unified
Medical Language System (UMLS) semantic types (SemTypes), UMLS concept unique identifiers
(CUIs), and a download link. To facilitate model training, the dataset is randomly split into a
training set (65, 460 radiology and 4, 902 out-of-class images), a validation set (8, 183 radiology
and 612 out-of-class images), and a test set (8, 182 radiology and 613 out-of-class images) using an
80/10/10 split ratio, respectively.
4.1.2 Medical Information Mart for Intensive Care - Chest X-Ray (MIMIC-CXR)
MIMIC-CXR collection encompasses 377, 110 chest X-rays paired with 227, 835 associated free-
text radiology reports [Joh+19b]. The dataset is derived from de-identified radiographic studies
conducted at the Beth Israel Deaconess Medical Center in Boston, MA. Each imaging study within
the MIMIC-CXR dataset consists of one or more images, typically featuring lateral and from back-
to-front (posteroanterior, PA) views in Digital Imaging and Communications in Medicine (DICOM)
format.
4.1.3 MIMIC-CXR-JPG
MIMIC-CXR-JPG [Joh+19a] is a pre-processed variant of the MIMIC-CXR dataset [Joh+19b]. In
this version, the original 377, 110 images are converted into compressed JPG format. The 227, 827
reports associated with these images are enriched with labels for various common pathologies. The
labels are derived from the analysis of the impression, findings, or final sections of the radiology
reports, facilitated by the use of NegBio [Pen+17] and CheXpert (Chest eXpert) [Irv+19] tools.
4.1.4 MIMIC-NLE
MIMIC-NLE dataset is specifically designed for the task of generating natural language explanations
(NLEs) to justify predictions made on medical images, particularly in the context of thoracic
pathologies and chest X-ray findings [Kay+22]. The dataset consists of 38, 003 image-NLE pairs
or 44, 935 image-diagnosis-NLE triplets, acknowledging instances where a single NLE may explain
multiple diagnoses. NLEs are extracted from MIMIC-CXR [Joh+19b] radiology reports. The
dataset exclusively considers X-ray views from front-to-back (anteroposterior, AP) and back-to-
front (posteroanterior, PA). All NLEs come with diagnosis and evidence (for a diagnosis) labels.
The dataset is split into the training set with 37, 016 images, a test set with 273 images, and a
validation set with 714 images.
15
4.1.6 Indiana University chest X-rays (IU-Xray)
IU-Xray dataset, also known as the Open-I dataset, is accessible through the National Library
of Medicine’s Open-i service [Dem+15]. The dataset originates from two hospital systems within
the Indiana Network for Patient Care database. This dataset comprises 7, 470 DICOM chest X-
rays paired with 3, 955 associated radiology reports. The reports typically include sections such as
indications, findings, and impressions, and they are manually annotated using MeSH and RadLex
(Radiology Lexicon) codes to represent clinical findings and diagnoses. Throughout this review, we
will refer to the dataset interchangeably as IU-Xray and Open-I, maintaining consistency with the
nomenclature used in related literature.
4.1.9 MS-CXR
MS-CXR dataset contains image bounding box labels paired with radiology findings, annotated
and verified by two board-certified radiologists [Boe+22]. The dataset consists of 1, 162 image-text
pairs of bounding boxes and corresponding text descriptions. The annotations cover 8 different car-
diopulmonary radiological findings and are extracted from MIMIC-CXR [Joh+19b] and REFLACX
(Reports and Eye-tracking data For Localization of Abnormalities in Chest X-rays) [Big+22] (based
on MIMIC-CXR) datasets. The findings include atelectasis, cardiomegaly, consolidation, edema,
lung opacity, pleural effusion, pneumonia, and pneumothorax.
16
4.1.11 VQA-RAD
VQA-RAD dataset contains 104 head axial single-slice CTs or MRIs, 107 chest x-rays, and 104
abdominal axial CTs [Lau+18]. The images are meticulously chosen from MedPix, an open-access
online medical image database, ensuring each image corresponds to a unique patient. Furthermore,
every selected image has an associated caption and is deliberately devoid of any radiology markings.
Every caption provides details about the imaging plane, modality, and findings generated and
reviewed by expert radiologists. Also, VQA-RAD contains 3, 515 QA pairs, with an average of
10 questions per image. Among them, 1, 515 are free-form questions and answers, allowing for
unrestricted inquiry. Additionally, 733 pairs involve rephrased questions and answers, introducing
linguistic diversity. Another 1, 267 pairs are framed, featuring questions presented in a structured
format, offering consistency and systematic evaluation. Additionally, QA pairs are split into 637
open-ended and 878 closed-ended types. Within the closed-ended group, a predominant focus is on
yes/no questions.
4.1.12 PathVQA
PathVQA is a dataset that encompasses 4, 998 pathology images accompanied by a total of 32, 799
QA pairs derived from these images [He+20]. The images are sourced from pathology books:
“Textbook of Pathology” and “Basic Pathology”, and the digital library “Pathology Education
Informational Resource”. Out of all QA pairs, 16, 465 are of the open-ended type, while the
remaining pairs are of the closed-ended yes/no type. On average, each image is associated with 6.6
questions, which cover a broad spectrum of visual contents, encompassing aspects such as color,
location, appearance, shape, etc.
17
present in the images. Similarly to VQA-Med 2019, the dataset also contains radiology images and
questions for the VQG task. The validation set comprises 85 images with 200 questions, and the
test set includes 100 images.
Bilingual Evaluation Understudy (BLEU) The BLEU score was originally designed for ma-
chine translation evaluation, but it has been adapted for RG and even VQA in a modified form.
BLEU provides a quantitative measure of how well the machine-generated text aligns with human-
generated reference text [Pap+02]. First, the precision of different n-grams, which are consecutive
sequences of n words, is calculated using the formula:
#overlapping n-grams
Precision(n) = , (1)
#all n-grams in a model-generated text
where ‘overlapping n-grams’ refer to n-grams in the model-generated text that share common
elements with at least one n-gram in the reference text. To ensure the precision score remains
robust and is not disproportionately affected by repeated n-grams in the model-generated text, a
modification known as clipping is often introduced. This process involves capping the count of each
n-gram in the model-generated text to a maximum count. This maximum count is determined by
18
the highest count observed in any single reference text for the same n-gram. The final BLEU-n
score is defined as:
n
!
1 X
BLEU-n = BP × exp log [Precision(k)] . (2)
n
k=1
In eq. 2, BP is referred to as the brevity penalty and is calculated as:
(
1 if c ≥ r
BP = (1−r/c)
(3)
e if c < r,
where c is the length of the model-generated text, and r is the length of the reference text. It is
common to use n = 4. The BLEU score ranges from 0 to 1, where a higher score suggests better
agreement with the reference text. The overall BLEU score of the model is the average of BLEU
scores for each pair of reports.
10 × P × R
METEOR = (1 − Penalty) (6)
R+9×P
where
#overlapping 1-grams
R= , (7)
#1-grams in a reference text
#overlapping 1-grams
P = , (8)
#1-grams in a model-generated text
3
1 #chunks
Penalty = × , (9)
2 #overlapping 1-grams
19
and chunks are groups of adjacent 1-grams in the model-generated text that overlap with adjacent
1-grams in the reference text. The METEOR score ranges from 0 to 1, with higher scores indicating
better alignment between the model-generated text and the reference text. The overall METEOR
score of a model is the average of scores for each instance.
Perplexity Perplexity measures the average uncertainty of a model in predicting each word in a
text [Hao+20]. The formula for perplexity is defined as:
n
!
1X
Perplexity = exp − ln P (wk |w1 , w2 , . . . , wk−1 ) , (10)
n
k=1
where n is the total number of words in the text. The value of the perplexity metric can range from
1 to +∞, and lower values signify a more accurate and confident model in capturing the language
patterns within the given text.
BERTScore BERTScore metric was initially designed for evaluating models that use BERT
[Dev+19] embeddings [Zha+20b]. However, it can also leverage other word embeddings to evaluate
the similarity between model-generated and reference text. The BERTScore of a single text pair is
calculated according to the relationship:
2×P ×R
BERTScore = , (11)
P +R
where P represents the ratio of the maximum cosine similarity score between tokens in the model-
generated text and the reference text to the numbers of tokens in the model-generated text and R
represents the ratio of the maximum cosine similarity score between tokens in the model-generated
text and the reference text to the numbers of tokens in the reference text. The BERTScore of the
model is the average of BERTScores across all text pairs.
RadGraph F1 RadGraph F1 is a novel metric that measures overlap in clinical entities and
relations extracted from radiology reports [Yu+23]. The RadGraph F1 score is computed in the
following way. First, the RadGraph model maps model-generated and reference reports into graph
representations with clinical entities represented as nodes and their relations as edges between
them. Second, the number of nodes that match between the two graphs based on clinical entity
text and labels (entity type) is determined. Third, the number of edges that match between the
two graphs based on their start and end entities and labels (relation type) is calculated. Lastly, the
F1 score is separately computed for clinical entities and relations, and then the RadGraph F1 score
for a report pair is the average of these two scores. The overall model performance is determined
by averaging RadGraph F1 scores across all report pairs.
Human evaluation Human evaluation plays a crucial role in assessing VLMS’s quality in medical
RG. Human evaluation can be performed for RG in various ways. For instance, in [Jeo+23], expert
radiologists evaluate the performance of the X-REM model in the RG task as follows. Initially,
each report is segmented into lines, and radiologists assign scores to each line based on five error
categories. These scores reflect the severity of errors, with higher values indicating more severe
errors. Two metrics are utilized to obtain a comprehensive measure of the overall severity of errors
in a report. Maximum Error Severity (MES) represents the highest score across all lines in the
report. In contrast, Average Error Severity (AES) is calculated by averaging the scores across all
lines in the report. According to radiologists, 18% of model-generated reports received an MES
score of 0, while 24% received an AES score of 0.
20
Additional Evaluation Metrics for Report Generation The next few metrics are designed
for classification evaluation, and RG can be viewed as such a task. In [Moo+22], [Lee+23], and
[Pel+23], these metrics are computed based on the 14 labels obtained from applying the CheXpert
[Irv+19] or CheXbert [Smi+20] labeler to the reference reports as well as the model-generated
reports. In this context, reports bearing accurate diagnosis labels are categorized as positive, while
those with inaccurate labels are regarded as negative. The following metrics are also called clinical
efficacy metrics.
• Accuracy measures the ratio of all positive predictions to the total number of predictions.
• Precision measures the accuracy of positive predictions made by a model. The precision
score is calculated by considering the ratio of true positive predictions to the total number of
instances that the model predicted as positive:
True Positivies
Presicion = . (12)
True Positivies + False Positivies
High precision indicates that the model has a low rate of false positives.
• Recall is a metric that assesses the ability of a model to predict all positive classes. Recall is
defined as the ratio of correctly predicted positive observations to the total actual positives:
True Positives
Recall = . (13)
True Positives + False Negatives
The high recall means that the model effectively identifies most of the actual positive in-
stances.
• F1 Score assesses the overall model’s performance by balancing precision and recall into a
single value. The F1 score is defined as:
2 × Precision × Recall
F1 = . (14)
Precision + False Recall
The F1 scores range from 0 to 1, with higher values indicating better performance. In multi-
class classification, it is common to compute the macro-F1 score by averaging the F1 scores
calculated independently for each class. This method ensures an unbiased evaluation of the
model’s performance across all classes, assigning equal importance to each class, irrespective
of its size or prevalence in the dataset.
Accuracy Accuracy is a fundamental metric for gauging overall model correctness in VQA eval-
uation. It is determined by calculating the proportion of correctly predicted answers to the total
number of questions. Sometimes, the average accuracy is computed by applying the model to
various testing datasets, providing a comprehensive assessment of its performance across diverse
scenarios. For a detailed comparison of accuracies among different medical VLMs discussed in
Section 4.3, refer to Table 3.
21
Exact Match Exact match metric computes the ratio of generated answers that match exactly
(excluding punctuation) the correct answer. However, this measure is rather strict, as it may not
give credit to valuable answers that, despite being semantically correct, diverge from an exact lexical
match with the correct answer. This metric is more suitable for evaluating answers to close-ended
questions than open-ended ones.
Human Evaluation Human evaluation is valuable for assessing a model’s performance and
applies not only to tasks such as VQA but also to RG. Human evaluation can be performed for
VQA in various ways. For instance, in [Moo+23], the human evaluation process of MedFlamingo
model employs an application featuring a user-friendly interface. Within this interface, medical
experts are empowered to evaluate each VQA problem individually, assigning scores ranging from
0 to 10. The final scores of the few-shot performance are 5.61 on VQA-RAD, 1.81 on PathVQA,
and 4.33 on the specifically curated Visual USMLE dataset. In contrast, the scores for zero-shot
performance are lower, with 3.82 on RAD-VQA, 1.72 on PathVQA, and 4.18 on Visual USMLE.
22
Table 2: A list of medical VLMs developed for VQA and RG.
Model Stream Decoder Architecture VQA RG Datasets Code
MedViLL MIMIC-CXR,
single No RN50 + BERT + + GH
[Moo+22] Open-I, VQA-RAD
ViT-B/32 or RN50 or
PubMedCLIP ROCO, SLAKE,
dual No RN50×4 + Transformer + – GH
[EMD23] VQA-RAD
+ BAN
RepsNet ResNeXt-101 + BERT VQA-RAD,
dual No + + Site
[TBF22] + BAN + GPT-2 IU-Xray
ViT-B/16
BiomedCLIP PMC-15, SLAKE,
dual No + PubMedBERT + – HF
[Zha+23b] VQA-RAD
+ METER
UniXGen
single No VQGAN + Transformer – + MIMIC-CXR GH
[Lee+23]
Swiss Transformer PMCPM, ROCO
+ PubMedBERT MIMIC-CXR,
RAMM
dual No + multimodal encoder w/ + – SLAKE, VQA-RAD, GH
[Yua+23]
retrieval-atten. module VQA-Med 2019,
VQA-Med 2021
ALBEF
X-REM MIMIC-CXR,
dual No (ViT-B/16 + BERT – + GH
[Jeo+23] MedNLI, RadNLI
+ multimodal encoder)
ROCO; MedDialog,
Visual DePlot or Med-GIT MEDIQA QA,
Med-Alpaca single No + prompt manager + – MEDIQA RQE, GH
[Shu+23] +LLaMa-7B MedQA, PubMedQA
+ GPT-3.5-Turbo
ALBEF
+ FAISS retriever
CXR-RePaiR-Gen + prompt manager CXR-PRO,
dual No – + –
[Ran+23b] + text-davinci-003 MS-CXR
or GPT-3.5-Turbo
or GPT-4
PMC-15 + GPT-4,
LLaVa-Med ViT-L/14 + projection
single No + – VQA-RAD, SLAKE, GH
[Li+23a] layer + LLaMa-7B
PathVQA
MedCLIP + linear
XrayGPT MIMIC-CXR
single No transformation layer + + GH
[Tha+23] Open-I
+ Vicuna-7B
RN18 + tokenizer
CAT-ViL DeiT EndoVis 2017,
single No + CAT-ViL fusion + – GH
[BIR23] EndoVis 2018
module + DeiT
ROCO, MedICaT,
ViT-B/12 + BERT
MUMC ImageCLEF Caption,
dual Yes + multimodal encoder + – GH
[Li+23b] VQA-RAD, SLAKE
+ answer decoder
PathVQA
MTB, PMC-OA,
Med-Flamingo ViT-L/14 + perceiver
single No + – VQA-RAD, PathVQA, GH
[Moo+23] resampler + LLaMa-7B
Visual USMLE
BioViL-T + BERT
RaDialog MIMIC-CXR,
single No + prompt manager + + GH
[Pel+23] Instruct
+ Vicuna-7B
23
achieves a BLEU-4 score of 0.049, a perplexity value of 5.637, an accuracy of 73.4%, a precision
value of 0.512, a recall value of 0.594, and an F1 score of 0.550 on Open-I.
4.3.2 PubMedCLIP
PubMedCLIP [EMD23] is a CLIP-based [Rad+21] model pre-trained on ROCO [Pel+18] dataset,
consisting of over 80K image-caption pairs sourced from PMC articles. The model utilizes a CLIP
text encoder, which is based on the Transformer [Vas+17] architecture, and three distinct CLIP
visual encoders: ViT-B/32 [Dos+21], ResNet-50, and ResNet-50×4 [He+16]. Following the con-
trastive learning approach in CLIP, the model generates joint representations by computing cosine
similarity between textual and visual features. The pre-training objective involves the computa-
tion of cross-entropy loss values for both vision and language. These losses are then averaged to
derive an overall loss value. Following pre-training, the model is repurposed as a pre-trained visual
encoder for VQA. The visual feature in VQA is the concatenation of the model’s output with a
convolutional denoising autoencoder (CDAE) [Mas+11] output, an image denoising module. The
question is encoded using a GloVe [PSM14] word embedding followed by an LSTM [HS97]. The im-
age and question features are combined using bilinear attention networks (BAN) [KJZ18], and the
resulting representations are passed through an answer classifier, which is a two-layer feedforward
NN. The VQA loss is determined by combining the classification and image reconstruction losses.
During the VQA fine-tuning, the SLAKE (English) [Liu+21a] and VQA-RAD [Lau+18] datasets,
comprising both open- and close-ended questions, are employed. The model’s effectiveness is evalu-
ated in the context of two existing Medical VQA (MedVQA) methods: Mixture of Enhanced Visual
Features (MEVF) [Zha+20a] and question-conditioned reasoning (QCR) [Liu+23a]. The assess-
ment involved replacing the visual encoder component in MEVF and QCR with PubMedCLIP and
subsequently evaluating the model’s performance. PubMedCLIP in the QCR framework achieves
better accuracies on VQA-RAD and SLAKE datasets than in the MEVF framework. The highest
accuracies of PubMedCLIP in the QCR framework on both datasets are shown in Table 3.
4.3.3 RepsNet
RepsNet is designed for VQA tasks. It can generate automated medical reports and interpret med-
ical images. The model employs a modified version of the pre-trained ResNeXt-101 [Xie+16] as its
image encoder and utilizes pre-trained BERT [Dev+19] as the text encoder, with text tokenization
done through WordPiece [Wu+16]. Fusion of image and question features is achieved using BAN
[KJZ18]. To align images with textual descriptions, the model employs bidirectional contrastive
learning [Che+20a]. For VQA tasks, the model is fine-tuned and evaluated on VQA-RAD [Lau+18]
(see Table 3). In contrast, for RG, fine-tuning and evaluation are done using IU-Xray [Dem+15]
dataset. The model categorizes answers through classification for close-ended questions and gener-
ates answers using the modified version of GPT-2 language decoder based on image features and
prior context. The BLEU-2 and BLEU-4 scores of RepsNet on the IU-Xray dataset are 0.44 and
0.27, respectively.
4.3.4 BiomedCLIP
BiomedCLIP is pre-trained on the specifically curated PMC-15 dataset that consists of 15 M
figure-caption pairs derived from the PMC articles [Zha+23b]. However, the models is not publicly
available. The model architecture is similar to CLIP [Rad+21], except that the text encoder is a
pre-trained PubMedBERT [Gu+21] model with WordPiece tokenizer [Wu+16]. The model uses
ViT-B/16 [Dos+21] as the visual data encoder. During pre-training, the model adopts a contrastive
24
learning approach, and to mitigate memory usage, it utilizes the sharding contrastive loss [Che+22].
For adaptation to VQA, the model incorporates the METER [Dou+22] framework. This involves
deploying a Transformer-based co-attention multimodal fusion module that produces cross-modal
representations. These representations are then fed into a classifier for the final prediction of
answers. The model is evaluated on VQA-RAD [Lau+18] and SLAKE (English) [Liu+21a] datasets
(see Table 3).
25
4.3.7 Contrastive X-Ray REport Match (X-REM)
X-REM is a retrieval-based radiology RG model that uses an ITM score to measure the similarity
of a chest X-ray image and radiology report for report retrieval [Jeo+23]. The VLM backbone
of the model is ALBEF [Li+21]. ALBEF utilizes ViT-B/16 [Dos+21] as its image encoder and
initializes the text encoder with the first 6 layers of the BERT [Dev+19] base model. The multi-
modal encoder in ALBEF, responsible for combining visual and textual features to generate ITM
scores, is initialized using the final six layers of the BERT base model. X-REM leverages ALBEF’s
pre-trained weights and performs further pre-training on X-rays paired with extracted impression
sections (2, 192 pairs), findings sections (1, 597 pairs), or both (2, 192 pairs) from the MIMIC-CXR
[Joh+19b] dataset. Subsequently, the model is fine-tuned on the ITM task, where the scoring mech-
anism involves using the logit value for the positive class as the similarity score for image-text pairs.
To address the positive skewness in medical datasets, 14 clinical labels obtained from the CheXbert
[Smi+20] labeler are utilized. The model efficiently manages the computational burden associated
with ITM scores by employing ALBEF’s pre-aligned unimodal embeddings. This involves narrowing
down the candidate reports based on high cosine similarity with the input image before computing
ITM scores. Additionally, the text encoder undergoes fine-tuning on natural language inference
(NLI) task, utilizing datasets such as MedNLI [RS18] and RadNLI [Miu+21]. This step is crucial
for preventing the retrieval of multiple reports with overlapping or conflicting information. X-REM
achieves a BLEU-2 score of 0.186 on the MIMIC-CXR (Findings only) dataset. The BERTScore
of the model is 0.386 on MIMIC-CXR (Findings only) and is 0.287 on MIMIC-CXR (Impressions
and Findings). The human evaluation of X-REM is described in Section 4.2.
26
the reports or sentences corpus with the highest dot-product similarity to the image embedding are
retrieved. The CXR-PRO [RCR22] dataset is employed for text retrieval to gather relevant impres-
sions for generating the radiology report. The retrieved impression sections from the CXR-PRO
dataset serve as the context for the prompt to an LLM, along with instructions to generate the
radiology report. Two distinct prompts are employed for generating free-text reports: one for the
text-davinci-003 model and another for RG in a conversational setting with the GPT-3.5-Turbo
and GPT-4 models. The model is evaluated on MS-CXR [Boe+22] and CXR-PRO datasets. There
is no code provided for this model yet. CXR-RePaiR-Gen reaches a BERTScore score of 0.2865 on
the CXR-PRO dataset when based on GPT-4. Additionally, CXR-RePaiR-Gen achieves a score of
0.1970 on MS-CXR when based on text-davinci-003. The model attains a RadGraph F1 score of
0.1061 on the CXR-PRO dataset when based on GPT-4 and 0.0617 on the MS-CXR dataset when
it is based on text-davinci-003. In these instances, the CXR-RePaiR-Gen utilizes three retrieval
samples per input during the RAG process.
4.3.11 XrayGPT
XrayGPT is a conversational medical VLM specifically developed for analyzing chest radiographs
[Tha+23]. The VLM uses MedCLIP [Wan+22b] as a vision encoder to generate visual features.
27
These features undergo a meticulous transformation process: initially, they are mapped to a lower-
dimensional space through a linear projection head and subsequently translated into tokens via a
linear transformation layer. At its core, the model incorporates two text queries: (1) the assistant
query plays a role in contextualizing the model’s behavior and defining its purpose as “You are a
helpful healthcare virtual assistant”, (2) the doctor’s query serves as a prompt that guides the model
in providing information relevant to chest X-ray analysis. Tokens generated from a visual input are
concatenated with the tokenized queries and then fed into the medical LLM, which generates the
summary of the chest x-ray. The LLM employed in this architecture is Vicuna-7B [Chi+23], fine-
tuned on a rich dataset consisting of 100, 000 real conversations between patients and doctors, along
with 20, 000 radiology conversations sourced from ShareGPT.com. During training, the weights of
both the vision encoder and the LLM remain frozen while the weights in the linear transformation
layer undergo updates. The model is first trained on 213, 514 image-text pairs from pre-processed
MIMIC-CXR [Joh+19b] dataset and then on 3, 000 image-text pairs from Open-I [Dem+15] dataset.
XrayGPT achieves ROUGE-1 = 0.3213, ROUGE-2 = 0.0912, and ROUGE-L = 0.1997 on MIMIC-
CXR dataset.
4.3.13 Masked image and text modeling with Unimodal and Multimodal Contrastive
losses (MUMC)
MUMC utilizes a ViT-B/12 [Dos+21] as its image encoder, the first 6 layers of BERT [Dev+19]
as its text encoder, and the last 6 layers of BERT as its multimodal encoder [Li+23b]. The
multimodal encoder incorporates cross-attention layers to align visual and textual features. For
pre-training, the model employs a combination of contrastive learning, MLM, and ITM objectives.
Also, the model utilizes a newly introduced masked image strategy, randomly masking 25% of image
patches as a data augmentation technique. This exposes the model to a greater variety of visual
contexts and enables learning representations that are more robust to partially occluded inputs.
The pre-training is performed on the ROCO [Rad+21], MedICaT [Sub+20], and Image Retrieval
in Cross-Language Evaluation Forum (ImageCLEF) caption [Rüc+22] datasets. For downstream
28
VQA tasks, an answering decoder is added on top of the multimodal encoder to generate answer
text tokens. The encoder weights are initialized from pre-training, and the model is fine-tuned and
evaluated on VQA-RAD [Lau+18], SLAKE [Liu+21a], and PathVQA [He+20] (see Table 3).
4.3.14 Med-Flamingo
Med-Flamingo is a multimodal few-shot learner model based on the Flamingo [Ala+22] architec-
ture, adapted to the medical domain [Moo+23]. The model is pre-trained on the MTB [Moo+23]
dataset, a newly curated collection comprising 4, 721 segments from various Medical TextBooks,
encompassing both textual content and images. Each segment is designed to contain at least one
image and up to 10 images, with a specified maximum length. Also, it is pre-trained on 1.3 M
image-caption pairs from the PMC-OA [Lin+23a] dataset. The Model’s few-shot capabilities are
achieved through training on these mixed text and image datasets, enabling it to generalize and
perform diverse multimodal tasks with only a few examples. The model utilizes a pre-trained frozen
CLIP vision encoder ViT-L/14 for visual feature generation. To convert these visual features into
a fixed number of tokens, the model employs a module known as the perceiver resampler, which
is trained from scratch. Subsequently, these tokens, along with tokenized text inputs, undergo
further processing in a pre-trained frozen LLM LLaMA-7B [Tou+23b], enhanced with strategically
inserted gated cross-attention layers that are also trained from scratch. This augmentation not
only facilitates the learning of novel relationships but also bolsters training stability. The model’s
performance is evaluated on established benchmarks such as VQA-RAD [Lau+18] and PathVQA
[He+20], demonstrating its effectiveness in medical visual question-answering. The exact match
scores for MedFlamingo demonstrate a few-shot performance of 0.200 on VQA-RAD and 0.303 on
PathVQA. In contrast, the zero-shot performance yields an exact match score of 0.000 on VQA-
RAD and 0.120 on PathVQA. Additionally, it is evaluated on a specifically created Visual United
States Medical Licensing Examination (USMLE) dataset, comprising 618 challenging open-ended
USMLE-style questions augmented with images, case vignettes, and tables of laboratory measure-
ments, covering a diverse range of medical specialties. The human evaluation of the Med-Flamingo
model on VQA-RAD, PathVQA, and Visual USMLE datasets is described in Section 4.2.
4.3.15 RaDialog
RaDialog is a VLM that integrates automated radiology RG with conversational assistance [Pel+23].
The model incorporates BioViL-T [Ban+23], a hybrid model that fuses the strengths of ResNet-50
[He+16] and Transformer [Vas+17] architectures. Pre-trained on radiology images and reports,
BioViL-T serves as a vision encoder that generates patch-wise visual features. The extracted
features undergo alignment through a BERT [Dev+19] model, transforming them into a concise
representation of 32 tokens. The model incorporates the CheXpert classifier to offer organized find-
ings in medical images. These findings are generated based on labels obtained from the CheXbert
[Smi+20] model. The classifier is trained independently using labels predicted by CheXbert from
the findings section of radiology reports. The model integrates visual features, structured findings,
and the directive “Write a radiology report” into a singular prompt, which is used as input for the
LLM, a Vicuna-7B [Chi+23] model fine-tuned using LoRA [Hu+22]. The training is performed
on X-ray image-report pairs from MIMIC-CXR [Joh+19b] dataset. RaDialog achieves a BLEU-4
score of 0.095, ROUGE-L score of 0.2710, METEOR score of 0.14, and BERTScore of 0.400 on
the MIMIC-CXR dataset. To address the challenge of catastrophic forgetting during training and
ensure the model’s capability across diverse downstream tasks, it is specifically trained on the newly
created Instruct [Pel+23] dataset. This dataset is meticulously curated to encompass a spectrum
29
Table 3: The comparison of medical VLMs’ accuracies on VQA tasks. The underlined accuracies
are the highest for a specific dataset.
MedViLL
– – 59.50% 77.70% – – – –
[Moo+22]
PubMedCLIP
78.40% 82.50% 60.10% 80.00% – – – –
[EMD23]
RepsNet
– – – 87.05% – – – –
[TBF22]
BioMedCLIP
82.50% 89.70% 67.60% 79.80% – – – –
[Zha+23b]
RAMM
82.48% 91.59% 67.60% 85.29% – – 82.13% 39.20%
[Yua+23]
LLaVa-Med
– 84.19% – 85.34% – 91.21% – –
[Li+23a]
MUMC
– – 71.50% 84.20% 39.00% 90.4% – –
[Li+23b]
of 8 diverse tasks: RG, NLE, complete CheXpert QA, binary CheXpert QA, region QA, summa-
rization, report correction, and reformulation report using simple language. Carefully formulated
prompts accompany each task, tailored to elicit specific responses from the model. For instance,
some prompts involve answering questions about particular X-ray regions. RaDialog trained on
the Instruct dataset achieves an F1 score of 0.397 on the binary CheXpert QA task and 0.403 on
the complete CheXpert QA task. In contrast, RaDialog without being trained on Instruct achieves
lower F1 scores of 0.018 and 0.098, respectively.
30
learning method, models are trained across multiple institutions, and only model weights are shared,
not the data. Thus, it effectively addresses major concerns about patient privacy while enabling
collaborative model training across diverse datasets.
Traditional metrics may fall short in capturing the nuanced complexities of clinical language,
posing a barrier to reliable evaluations of VLM performance [Yu+23]. This issue becomes partic-
ularly evident when evaluating the accuracy of medical reports or addressing open-ended medical
queries, where metrics need to discern clinically relevant distinctions. Therefore, the development
and adoption of specialized metrics tailored for medical RG and VQA is imperative. Such metrics
are pivotal not only for evaluating model performance but also for assessing aspects like generaliza-
tion, efficiency, and robustness. Establishing these metrics will significantly contribute to fostering
precise evaluations and continual advancements in the capabilities of medical VLMs.
The issue of hallucinations in generative VLMs poses a significant challenge to their reliability
and practical application [Liu+24]. Hallucinations refer to instances where VLMs generate outputs
that are not grounded in the provided images or are inconsistent with the established knowledge.
In medical contexts, these hallucinations can have serious consequences, leading to inaccurate
diagnostic information or treatment recommendations. One identified root cause of hallucinations
is the lack of alignment between visual and textual information [Sun+23]. Training VLMs to
effectively align these data modalities is crucial in mitigating the risk of hallucinations. For instance,
LLaVA-RLHF [Sun+23] achieved hallucination reduction by incorporating RLHF to align different
modalities. Further research is needed into building medical VLMs that base their generation on
the factual medical knowledge of minimal hallucination.
Overcoming catastrophic forgetting poses an additional challenge in the development of medical
VLMs. Catastrophic forgetting occurs when a model learning new information inadvertently erases
or distorts previously acquired knowledge, potentially compromising its overall competence. Strik-
ing a balance during fine-tuning can be crucial; moderate fine-tuning can be helpful to adapt the
model to a specific task, while excessive fine-tuning can lead to catastrophic forgetting [Zha+23a;
KBR23]. Leveraging methodologies from continual learning [Wan+23a; Zho+23b; CR24; KBR23;
KBR24] might be useful in the context of medical VLMs, where the ability to adapt and accumulate
knowledge across diverse clinical tasks is paramount. Continual learning focuses on training models
to sequentially learn from and adapt to new data over time while retaining knowledge from previ-
ously encountered tasks [KBR24]. Also, incorporating adapters within the framework of continual
learning can be a valuable tool in mitigating catastrophic forgetting [Zha+23c].
Finally, clinical validation and adoption of VLMs necessitate a collaborative bridge between
medical experts and AI/ML researchers. Trust, alignment with clinical needs, and ethical deploy-
ment are critical components for successfully integrating these models into healthcare workflows.
Establishing robust collaborations ensures a dynamic synergy, combining domain expertise with
technological advancements. This synergy is essential for the responsible and effective deployment
of medical VLMs in healthcare.
Acknowledgements
This work was partly supported by NSF awards 1903466, 2234836, and 2234468.
31
References
[Aba+20] Asma Ben Abacha et al. “Overview of the VQA-Med Task at ImageCLEF 2020:
Visual Question Answering and Generation in the Medical Domain”. In: CLEF 2020
Working Notes. CEUR Workshop Proceedings. 2020.
[Aba+19] Asma Ben Abacha et al. “VQA-Med: Overview of the Medical Visual Question An-
swering Task at ImageCLEF 2019”. In: Conference and Labs of the Evaluation Forum.
2019.
[Aco+22] Julián N Acosta et al. “Multimodal Biomedical AI”. In: Nature Medicine 28.9 (2022),
pp. 1773–1784. doi: 10.1038/s41591-022-01981-2.
[Ahm+23] Sabeen Ahmed et al. “Transformers in time-series analysis: A tutorial”. In: Circuits,
Systems, and Signal Processing 42.12 (2023), pp. 7433–7466.
[Ala+22] Jean-Baptiste Alayrac et al. “Flamingo: A Visual Language Model for Few-Shot
Learning”. In: Advances in Neural Information Processing Systems. Vol. 35. 2022,
pp. 23716–23736.
[All+19] Max Allan et al. 2017 Robotic Instrument Segmentation Challenge. 2019. arXiv: 1902.
06426.
[All+20] Max Allan et al. 2018 Robotic Scene Segmentation Challenge. 2020. arXiv: 2001 .
11190.
[Ant+15] Stanislaw Antol et al. “VQA: Visual Question Answering”. In: IEEE International
Conference on Computer Vision (ICCV). 2015, pp. 2425–2433. doi: 10.1109/ICCV.
2015.279.
[Bai+23] Jinze Bai et al. Qwen-VL: A Versatile Vision-Language Model for Understanding,
Localization, Text Reading, and Beyond. 2023. arXiv: 2308.12966.
[BIR23] Long Bai, Mobarakol Islam, and Hongliang Ren. “CAT-ViL: Co-attention Gated
Vision-Language Embedding for Visual Question Localized-Answering in Robotic
Surgery”. In: Medical Image Computing and Computer Assisted Intervention – MIC-
CAI. 2023, pp. 397–407. doi: 10.1007/978-3-031-43996-4_38.
[Baj+21] Junaid Bajwa et al. “Artificial Intelligence in Healthcare: Transforming the Practice
of Medicine”. In: Future Healthcare Journal 8.2 (2021), e188–e194. doi: 10.7861/
fhj.2021-0095.
[Bal21] Pierre Baldi. Deep Learning in Science. Cambridge University Press, 2021. doi: 10.
1017/9781108955652.
[BL05] Satanjeev Banerjee and Alon Lavie. “METEOR: An Automatic Metric for MT Eval-
uation with Improved Correlation with Human Judgments”. In: ACL Workshop on
Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Sum-
marization. 2005, pp. 65–72.
[Ban+23] Shruthi Bannur et al. Learning to Exploit Temporal Structure for Biomedical Vision-
Language Processing. 2023. arXiv: 2301.04558.
[Baz+23] Yakoub Bazi et al. “Vision–Language Model for Visual Question Answering in Medical
Imagery”. In: Bioengineering 10.3 (2023), p. 380. doi: 10.3390/bioengineering10030380.
[Bea+20] Andrew Beam et al. “Clinical Concept Embeddings Learned from Massive Sources
of Multimodal Medical Data”. In: Pacific Symposium on Biocomputing 25 (2020),
pp. 295–306. doi: 10.1142/9789811215636_0027.
32
[BSD19] Asma Ben Abacha, Chaitanya Shivade, and Dina Demner-Fushman. “Overview of the
MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question
Answering”. In: BioNLP Workshop and Shared Task. 2019, pp. 370–379.
[Big+22] Ricardo Bigolin Lanfredi et al. “REFLACX, a Dataset of Reports and Eye-tracking
Data for Localization of Abnormalities in Chest X-rays”. In: Scientific Data 9.1
(2022). doi: 10.1038/s41597-022-01441-z.
[Boe+22] Benedikt Boecking et al. “Making the Most of Text Semantics to Improve Biomedical
Vision–Language Processing”. In: Computer Vision – ECCV. 2022, pp. 1–21. doi:
10.1007/978-3-031-20059-5_1.
[Boe+21] Kevin Michael Boehm et al. “Harnessing Multimodal Data Integration to Advance
Precision Oncology”. In: Nature Reviews Cancer 22 (2021), pp. 114–126. doi: 10.
1038/s41568-021-00408-3.
[Boj+17] Piotr Bojanowski et al. “Enriching Word Vectors with Subword Information”. In:
Transactions of the Association for Computational Linguistics 5 (2017), pp. 135–146.
issn: 2307-387X. doi: 10.1162/tacl_a_00051.
[Bom+22] Rishi Bommasani et al. On the Opportunities and Risks of Foundation Models. 2022.
arXiv: 2108.07258 [cs.LG].
[Bro+20] Tom B. Brown et al. “Language Models are Few-Shot Learners”. In: Advances in
Neural Information Processing Systems. Vol. 33. 2020, pp. 1877–1901.
[CR24] Yuliang Cai and Mohammad Rostami. Dynamic Transformer Architecture for Con-
tinual Learning of Multimodal Tasks. 2024. arXiv: 2401.15275.
[Car+20] Nicolas Carion et al. “End-to-End Object Detection with Transformers”. In: European
conference on computer vision. 2020, pp. 213–229. doi: 10.1007/978-3-030-58452-
8_13.
[Che+23] Feilong Chen et al. “VLP: A Survey on Vision-Language Pre-Training”. In: Machine
Intelligence Research 20 (2023), pp. 38–56. doi: 10.1007/s11633-022-1369-5.
[Che+20a] Ting Chen et al. A Simple Framework for Contrastive Learning of Visual Represen-
tations. 2020. arXiv: 2002.05709.
[Che+20b] Yen-Chun Chen et al. “UNITER: UNiversal Image-TExt Representation Learning”.
In: European Conference on Computer Vision. 2020, pp. 104–120. doi: 10.1007/978-
3-030-58577-8_7.
[Che+22] Mehdi Cherti et al. Reproducible Scaling Laws for Contrastive Language-Image Learn-
ing. 2022. arXiv: 2212.07143.
[Chi+23] Wei-Lin Chiang et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%*
ChatGPT Quality. 2023. url: https://lmsys.org/blog/2023-03-30-vicuna/.
[Cho+21] Jaemin Cho et al. “Unifying Vision-and-Language Tasks via Text Generation”. In:
International Conference on Machine Learning. Vol. 139. 2021, pp. 1931–1942.
[Cho+14] Kyunghyun Cho et al. “Learning Phrase Representations using RNN Encoder–Decoder
for Statistical Machine Translation”. In: Conference on Empirical Methods in Natural
Language Processing. 2014, pp. 1724–1734. doi: 10.3115/v1/D14-1179.
[Cho+22] Aakanksha Chowdhery et al. “PaLM: Scaling Language Modeling with Pathways”.
In: Journal of Machine Learning Research 24.240 (2022), pp. 1–113.
33
[Cor+20] Antonio Coronato et al. “Reinforcement Learning for Intelligent Healthcare Applica-
tions: A Survey”. In: Artificial Intelligence in Medicine 109 (2020), p. 101964. issn:
0933-3657. doi: 10.1016/j.artmed.2020.101964.
[Dai+23] Wenliang Dai et al. InstructBLIP: Towards General-purpose Vision-Language Models
with Instruction Tuning. 2023. arXiv: 2305.06500.
[Dao23] Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Par-
titioning. 2023. arXiv: 2307.08691.
[Dao+22] Tri Dao et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-
Awareness”. In: Advances in Neural Information Processing Systems. 2022.
[Dem+15] Dina Demner-Fushman et al. “Preparing a Collection of Radiology Examinations
for Distribution and Retrieval”. In: Journal of the American Medical Informatics
Association (JAMIA) 23.2 (2015), pp. 304–310. doi: 10.1093/jamia/ocv080.
[Den+09] Jia Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: 2009
IEEE Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255.
doi: 10.1109/CVPR.2009.5206848.
[Dev+19] Jacob Devlin et al. “BERT: Pre-Training of Deep Bidirectional Transformers for Lan-
guage Understanding”. In: Conference of the North American Chapter of the Associ-
ation for Computational Linguistics. Vol. 1. 2019, pp. 4171–4186. doi: 10.18653/v1/
N19-1423.
[Dos+21] Alexey Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Im-
age Recognition at Scale”. In: International Conference on Learning Representations.
2021.
[Dou+22] Zi-Yi Dou et al. “An Empirical Study of Training End-to-End Vision-and-Language
Transformers”. In: IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR). 2022, pp. 18145–18155. doi: 10.1109/CVPR52688.2022.01763.
[EMD23] Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. “PubMedCLIP: How Much
Does CLIP Benefit Visual Question Answering in the Medical Domain?” In: Find-
ings of the Association for Computational Linguistics. 2023, pp. 1181–1193. doi:
10.18653/v1/2023.findings-eacl.88.
[ERO21] Patrick Esser, Robin Rombach, and Björn Ommer. “Taming Transformers for High-
Resolution Image Synthesis”. In: 2021 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR). 2021, pp. 12868–12878. doi: 10.1109/CVPR46437.
2021.01268.
[Gan+22] Zhe Gan et al. Vision-Language Pre-training: Basics, Recent Advances, and Future
Trends. 2022. arXiv: 2210.09263.
[Gu+23] Jindong Gu et al. A Systematic Survey of Prompt Engineering on Vision-Language
Foundation Models. 2023. arXiv: 2307.12980.
[Gu+21] Yu Gu et al. “Domain-Specific Language Model Pretraining for Biomedical Natu-
ral Language Processing”. In: ACM Transactions on Computing for Healthcare 3.1
(2021), p. 23. doi: 10.1145/3458754.
[Han+23] Tianyu Han et al. MedAlpaca – An Open-Source Collection of Medical Conversational
AI Models and Training Data. 2023. arXiv: 2304.08247.
34
[Hao+20] Yiding Hao et al. Probabilistic Predictions of People Perusing: Evaluating Metrics
of Language Model Performance for Psycholinguistic Modeling. 2020. arXiv: 2009.
03954.
[He+23] Kai He et al. A Survey of Large Language Models for Healthcare: from Data, Tech-
nology, and Applications to Accountability and Ethics. 2023. arXiv: 2310.05694.
[He+16] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: IEEE Con-
ference on Computer Vision and Pattern Recognition. 2016, pp. 770–778. doi: 10.
1109/CVPR.2016.90.
[He+20] Xuehai He et al. PathVQA: 30000+ Questions for Medical Visual Question Answer-
ing. 2020. arXiv: 2003.10286.
[HS97] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neural
Computation 9.8 (1997), pp. 1735–1780. doi: 10.1162/neco.1997.9.8.1735.
[Hu+22] Edward J Hu et al. “LoRA: Low-Rank Adaptation of Large Language Models”. In:
International Conference on Learning Representations. 2022.
[Hua+22] Gao Huang et al. “Convolutional Networks with Dense Connectivity”. In: IEEE
Transactions on Pattern Analysis and Machine Intelligence 44.12 (2022), pp. 8704–
8716. doi: 10.1109/TPAMI.2019.2918284.
[Hua+21] Yu Huang et al. “What Makes Multimodal Learning Better than Single (Provably)”.
In: Advances in Neural Information Processing Systems. 2021.
[Ion+21] Bogdan Ionescu et al. “Overview of the ImageCLEF 2021: Multimedia Retrieval in
Medical, Nature, Internet and Social Media Applications”. In: Experimental IR Meets
Multilinguality, Multimodality, and Interaction. 2021, pp. 345–370. doi: 10 . 1007 /
978-3-030-85251-1_23.
[Irv+19] Jeremy A. Irvin et al. “CheXpert: A Large Chest Radiograph Dataset with Uncer-
tainty Labels and Expert Comparison”. In: AAAI Conference on Artificial Intelli-
gence. Vol. 33. 2019, pp. 590–597. doi: 10.1609/aaai.v33i01.3301590.
[Jeo+23] Jaehwan Jeong et al. Multimodal Image-Text Matching Improves Retrieval-based Chest
X-Ray Report Generation. 2023. arXiv: 2303.17579.
[Ji20] Qiang Ji. “5 - Computer vision applications”. In: Probabilistic Graphical Models for
Computer Vision. Computer Vision and Pattern Recognition. Academic Press, 2020,
pp. 191–297. doi: 10.1016/B978-0-12-803467-5.00010-1.
[Jia+21] Chao Jia et al. “Scaling Up Visual and Vision-Language Representation Learning
with Noisy Text Supervision”. In: International Conference on Machine Learning.
Vol. 139. 2021, pp. 4904–4916.
[Jia+23] Albert Q. Jiang et al. Mistral 7B. 2023. arXiv: 2310.06825.
[Jin+21] Di Jin et al. “What Disease does This Patient Have? A Large-Scale Open Domain
Question Answering Dataset from Medical Exams”. In: Applied Sciences 11.14 (2021),
p. 6421. doi: 10.3390/app11146421.
[Jin+19] Qiao Jin et al. “PubMedQA: A Dataset for Biomedical Research Question Answer-
ing”. In: Conference on Empirical Methods in Natural Language Processing. 2019,
pp. 2567–2577. doi: 10.18653/v1/D19-1259.
[Joh+19a] Alistair E. W. Johnson et al. MIMIC-CXR-JPG, a Large Publicly Available Database
of Labeled Chest Radiographs. 2019. arXiv: 1901.07042.
35
[Joh+19b] Alistair EW Johnson et al. “MIMIC-CXR, a De-Identified Publicly Available Database
of Chest Radiographs with Free-Text Reports”. In: Scientific Data 6.317 (2019). doi:
10.1038/s41597-019-0322-0.
[Kay+22] Maxime Kayser et al. “Explaining Chest X-ray Pathologies in Natural Language”.
In: International Conference on Medical Image Computing and Computer-Assisted
Intervention (MICCAI). Vol. 13435. 2022, pp. 701–713. doi: 10.1007/978-3-031-
16443-9_67.
[KBR23] Hikmat Khan, Nidhal C Bouaynaya, and Ghulam Rasool. “The Importance of Ro-
bust Features in Mitigating Catastrophic Forgetting”. In: 2023 IEEE Symposium on
Computers and Communications (ISCC). IEEE. 2023, pp. 752–757.
[KBR24] Hikmat Khan, Nidhal Carla Bouaynaya, and Ghulam Rasool. “Brain-Inspired Contin-
ual Learning: Robust Feature Distillation and Re-consolidation for Class Incremental
Learning”. In: IEEE Access (2024).
[KJZ18] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. “Bilinear Attention Networks”.
In: Advances in Neural Information Processing Systems 31. Vol. 31. 2018, pp. 1564–
1574.
[KB14] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”.
In: International Conference on Learning Representations (2014).
[Kwo+23] Gukyeong Kwon et al. Masked Vision and Language Modeling for Multi-modal Rep-
resentation Learning. 2023. arXiv: 2208.02131.
[Lam+22] Nathan Lambert et al. “Illustrating Reinforcement Learning from Human Feedback
(RLHF)”. In: Hugging Face Blog (2022). https://huggingface.co/blog/rlhf.
[Lan+19] Zhenzhong Lan et al. “ALBERT: A Lite BERT for Self-Supervised Learning of Lan-
guage Representations”. In: International Conference on Learning Representations.
2019.
[Lau+18] Jason J Lau et al. “A Dataset of Clinically Generated Visual Questions and Answers
about Radiology Images”. In: Scientific data 5 (2018), p. 180251. doi: 10 . 1038 /
sdata.2018.251.
[Lee+23] Hyungyung Lee et al. UniXGen: A Unified Vision-Language Model for Multi-View
Chest X-ray Generation and Report Generation. 2023. arXiv: 2302.12172.
[LAC21] Brian Lester, Rami Al-Rfou, and Noah Constant. The Power of Scale for Parameter-
Efficient Prompt Tuning. 2021. arXiv: 2104.08691.
[Lew+20] Patrick Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP
Tasks”. In: Neural Information Processing Systems. Vol. 33. 2020, pp. 9459–9474.
[Li+23a] Chunyuan Li et al. LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day. 2023. arXiv: 2306.00890.
[Li+21] Junnan Li et al. “Align Before Fuse: Vision and Language Representation Learning
with Momentum Distillation”. In: Advances in Neural Information Processing Sys-
tems. 2021.
[Li+19] Liunian Harold Li et al. VisualBERT: A Simple and Performant Baseline for Vision
and Language. 2019. arXiv: 1908.03557.
36
[Li+22] Mingjie Li et al. “Cross-modal Clinical Graph Transformer for Ophthalmic Report
Generation”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). 2022, pp. 20624–20633. doi: 10.1109/CVPR52688.2022.02000.
[Li+23b] Pengfei Li et al. “Masked Vision and Language Pre-Training with Unimodal and
Multimodal Contrastive Losses for Medical Visual Question Answering”. In: Medical
Image Computing and Computer Assisted Intervention (MICCAI). 2023, pp. 374–383.
doi: 10.1007/978-3-031-43907-0_36.
[LL21] Xiang Lisa Li and Percy Liang. Prefix-Tuning: Optimizing Continuous Prompts for
Generation. 2021. arXiv: 2101.00190.
[Li+23c] Yunxiang Li et al. “ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Lan-
guage Model Meta-AI (LLaMA) Using Medical Domain Knowledge”. In: Cureus 15.6
(2023). doi: 10.7759/cureus.40895.
[Lin04] Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Summaries”. In:
Text Summarization Branches Out. 2004, pp. 74–81.
[Lin+23a] Weixiong Lin et al. PMC-CLIP: Contrastive Language-Image Pre-Training using
Biomedical Documents. 2023. arXiv: 2303.07240.
[Lin+23b] Zhihong Lin et al. “Medical Visual Question Answering: A Survey”. In: Artificial
Intelligence in Medicine 143 (2023), p. 102611. doi: 10 . 1016 / j . artmed . 2023 .
102611.
[Liu+23a] Bo Liu et al. “Medical Visual Question Answering via Conditional Reasoning and Con-
trastive Learning”. In: IEEE Transactions on Medical Imaging 42.5 (2023), pp. 1532–
1545. doi: 10.1109/TMI.2022.3232411.
[Liu+21a] Bo Liu et al. “Slake: A Semantically-Labeled Knowledge-Enhanced Dataset for Medi-
cal Visual Question Answering”. In: IEEE 18th International Symposium on Biomed-
ical Imaging (ISBI) (2021), pp. 1650–1654. doi: 10.1109/ISBI48211.2021.9434010.
[LTS23] Chang Liu, Yuanhe Tian, and Yan Song. A Systematic Review of Deep Learning-based
Research on Radiology Report Generation. 2023. arXiv: 2311.14199.
[Liu+22] Fangyu Liu et al. DePlot: One-Shot Visual Language Reasoning by Plot-to-Table
Translation. 2022. arXiv: 2212.10505.
[Liu+24] Hanchao Liu et al. A Survey on Hallucination in Large Vision-Language Models. 2024.
arXiv: 2402.00253.
[Liu+23b] Haotian Liu et al. Visual Instruction Tuning. 2023. arXiv: 2304.08485.
[Liu+21b] Ze Liu et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Win-
dows”. In: International Conference on Computer Vision (ICCV). 2021, pp. 9992–
10002. doi: 10.1109/ICCV48922.2021.00986.
[Lo+20] Kyle Lo et al. “S2ORC: The Semantic Scholar Open Research Corpus”. In: Annual
Meeting of the Association for Computational Linguistics. 2020, pp. 4969–4983. doi:
10.18653/v1/2020.acl-main.447.
[Lu+19] Jiasen Lu et al. “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representa-
tions for Vision-and-Language Tasks”. In: Advances in Neural Information Processing
Systems. 2019, pp. 13–23.
37
[MHC20] Thusitha Mabotuwana, Christopher S Hall, and Nathan Cross. “Framework for Ex-
tracting Critical Findings in Radiology Reports”. In: Journal of Digital Imaging 33.4
(2020), pp. 988–995. doi: 10.1007/s10278-020-00349-7.
[Mah+22] Supriya Mahadevkar et al. “A Review on Machine Learning Styles in Computer Vision
- Techniques and Future Directions”. In: IEEE Access 10 (2022), pp. 107293–107329.
doi: 10.1109/ACCESS.2022.3209825.
[Man+23] Omid Nejati Manzari et al. “MedViT: A Robust Vision Transformer for Generalized
Medical Image Classification”. In: Computers in Biology and Medicine 157 (2023),
p. 106791. doi: 10.1016/j.compbiomed.2023.106791.
[Mas+11] Jonathan Masci et al. “Stacked Convolutional Auto-Encoders for Hierarchical Feature
Extraction”. In: International Conference on Artificial Neural Networks. Vol. 6791.
2011, pp. 52–59.
[Mik+13a] Tomas Mikolov et al. “Distributed Representations of Words and Phrases and Their
Compositionality”. In: Advances in Neural Information Processing Systems. Vol. 26.
2013, pp. 3111–3119.
[Mik+13b] Tomas Mikolov et al. Efficient Estimation of Word Representations in Vector Space.
2013. arXiv: 1301.3781.
[Mis+21] Pankaj Mishra et al. “VT-ADL: A Vision Transformer Network for Image Anomaly
Detection and Localization”. In: IEEE International Symposium on Industrial Elec-
tronics (ISIE). 2021, pp. 01–06. doi: 10.1109/ISIE45552.2021.9576231.
[Miu+21] Yasuhide Miura et al. “Improving Factual Completeness and Consistency of Image-to-
Text Radiology Report Generation”. In: North American Chapter of the Association
for Computational Linguistics. 2021, pp. 5288–5304. doi: 10.18653/v1/2021.naacl-
main.416.
[Moh23] Mohsan, Mashood Mohammad and Akram, Muhammad Usman and Rasool, Ghu-
lam and Alghamdi, Norah Saleh and Baqai, Muhammad Abdullah Aamer and Ab-
bas, Muhammad. “Vision Transformer and Language Model Based Radiology Report
Generation”. In: IEEE Access 11 (2023), pp. 1814–1824. doi: 10.1109/ACCESS.2022.
3232719.
[MPC20] Maram Mahmoud A. Monshi, Josiah Poon, and Vera Chung. “Deep Learning in Gen-
erating Radiology Reports: A Survey”. In: Artificial Intelligence in Medicine 106
(2020), p. 101878. issn: 0933-3657. doi: 10.1016/j.artmed.2020.101878.
[Moo+22] Jong Hak Moon et al. “Multi-Modal Understanding and Generation for Medical Im-
ages and Text via Vision-Language Pre-Training”. In: IEEE Journal of Biomedical
and Health Informatics 26.12 (2022), pp. 6070–6080. doi: 10 . 1109 / JBHI . 2022 .
3207502.
[Moo+23] Michael Moor et al. Med-Flamingo: A Multimodal Medical Few-Shot Learner. 2023.
arXiv: 2307.15189.
[OLV19] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with
Contrastive Predictive Coding. 2019. arXiv: 1807.03748.
[Ouy+22] Long Ouyang et al. “Training language models to follow instructions with human feed-
back”. In: Advances in neural information processing systems 35 (2022), pp. 27730–
27744.
38
[Pap+02] Kishore Papineni et al. “Bleu: a Method for Automatic Evaluation of Machine Trans-
lation”. In: Annual Meeting of the Association for Computational Linguistics. 2002,
pp. 311–318. doi: 10.3115/1073083.1073135.
[Par+22] Darsh Parekh et al. “A Review on Autonomous Vehicles: Progress, Methods and
Challenges”. In: Electronics 11.14 (2022). doi: 10.3390/electronics11142162.
[Pel+18] Obioma Pelka et al. “Radiology Objects in COntext (ROCO): A Multimodal Image
Dataset”. In: Intravascular Imaging and Computer Assisted Stenting and Large-Scale
Annotation of Biomedical Data and Expert Label Synthesis. Vol. 11043. Springer In-
ternational Publishing, 2018, pp. 180–189. doi: 10.1007/978-3-030-01364-6_20.
[Pel+23] Chantal Pellegrini et al. RaDialog: A Large Vision-Language Model for Radiology
Report Generation and Conversational Assistance. 2023. arXiv: 2311.18681.
[Pen+17] Yifan Peng et al. “NegBio: A High-Performance Tool for Negation and Uncertainty
Detection in Radiology Reports”. In: AMIA Summits on Translational Science Pro-
ceedings 2018 (2017), pp. 188–196.
[PSM14] Jeffrey Pennington, Richard Socher, and Christopher Manning. “Glove: Global Vec-
tors for Word Representation”. In: Empirical Methods in Natural Language Processing.
Vol. 14. 2014, pp. 1532–1543. doi: 10.3115/v1/D14-1162.
[Rad+21] Alec Radford et al. Learning Transferable Visual Models from Natural Language Su-
pervision. 2021. arXiv: 2103.00020.
[RB21] Abigail Rai and Samarjeet Borah. “Study of Various Methods for Tokenization”. In:
Applications of Internet of Things. 2021, pp. 193–200. doi: 10.1007/978-981-15-
6198-6_18.
[RCR22] Vignav Ramesh, Nathan Chi, and Pranav Rajpurkar. “Improving Radiology Report
Generation Systems by Removing Hallucinated References to Non-existent Priors”.
In: Machine Learning Research. Vol. 193. 2022, pp. 456–473.
[RBK21] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. “Vision Transformers for
Dense Prediction”. In: IEEE/CVF International Conference on Computer Vision
(ICCV). 2021, pp. 12159–12168. doi: 10.1109/ICCV48922.2021.01196.
[Ran+23a] Veenu Rani et al. “Self-supervised Learning: A Succinct Review”. In: Archives of
Computational Methods in Engineering 30 (2023). doi: 10.1007/s11831-023-09884-
2.
[Ran+23b] Mercy Ranjit et al. Retrieval Augmented Chest X-Ray Report Generation using Ope-
nAI GPT models. 2023. arXiv: 2305.03660.
[Ren+24] Mengjie Ren et al. Learning or Self-aligning? Rethinking Instruction Fine-tuning.
2024. arXiv: 2402.18243 [cs.CL].
[Rez+19] Seyed Hamid Rezatofighi et al. “Generalized Intersection Over Union: A Metric and a
Loss for Bounding Box Regression”. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) (2019), pp. 658–666. doi: 10.1109/cvpr.2019.
00075.
[Rob51] Herbert E. Robbins. “A Stochastic Approximation Method”. In: Annals of Mathe-
matical Statistics 22 (1951), pp. 400–407. doi: 10.1214/aoms/1177729586.
[RS18] Alexey Romanov and Chaitanya Shivade. “Lessons from Natural Language Inference
in the Clinical Domain”. In: Conference on Empirical Methods in Natural Language
Processing. 2018, pp. 1586–1596. doi: 10.18653/v1/D18-1187.
39
[Rüc+22] Johannes Rückert et al. “Overview of ImageCLEFmedical 2022 – Caption Prediction
and Concept Detection”. In: CEUR Workshop Proceedings. Vol. 3180. 2022, pp. 1294–
1307.
[Sch19] Robin M. Schmidt. Recurrent Neural Networks (RNNs): A gentle Introduction and
Overview. 2019. arXiv: 1912.05911.
[See+22] Lalithkumar Seenivasan et al. “Surgical-VQA: Visual Question Answering in Surgical
Scenes Using Transformer”. In: Medical Image Computing and Computer Assisted
Intervention – MICCAI. 2022, pp. 33–43. doi: 10.1007/978-3-031-16449-1_4.
[SB23] Saurav Sengupta and Donald E. Brown. Automatic Report Generation for Histopathol-
ogy images using pre-trained Vision Transformers and BERT. 2023. arXiv: 2312 .
01435.
[SHB16] Rico Sennrich, Barry Haddow, and Alexandra Birch. “Neural Machine Translation
of Rare Words with Subword Units”. In: 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers). 2016, pp. 1715–1725. doi:
10.18653/v1/P16-1162.
[SDK23] Dhruv Sharma, Chhavi Dhiman, and Dinesh Kumar. “Evolution of Visual Data Cap-
tioning Methods, Datasets, and Evaluation Metrics: a Comprehensive Survey”. In:
Expert Systems with Applications 221 (2023), p. 119773. issn: 0957-4174. doi: 10.
1016/j.eswa.2023.119773.
[Sho+19] Mohammad Shoeybi et al. Megatron-LM: Training Multi-Billion Parameter Language
Models Using Model Parallelism. 2019. arXiv: 1909.08053.
[Shr+23] Prashant Shrestha et al. Medical Vision Language Pretraining: A survey. 2023. arXiv:
2312.06224.
[Shu+23] Chang Shu et al. Visual Med-Alpaca: A Parameter-Efficient Biomedical LLM with Vi-
sual Capabilities. [Online; accessed 20-Feb-2024]. 2023. url: https://cambridgeltl.
github.io/visual-med-alpaca/.
[Sin+23] Karan Singhal et al. “Large Language Models Encode Clinical Knowledge”. In: Nature
620 (2023), pp. 172–180. doi: 10.1038/s41586-023-06291-2.
[Smi+20] Akshay Smit et al. CheXbert: Combining Automatic Labelers and Expert Annotations
for Accurate Radiology Report Labeling Using BERT. 2020. arXiv: 2004.09167.
[Sov+21] Petru Soviany et al. “Curriculum Learning: A Survey”. In: International Journal of
Computer Vision 130 (2021), pp. 1526–1565. doi: 10.1007/s11263-022-01611-x.
[Sub+20] Sanjay Subramanian et al. “MedICaT: A Dataset of Medical Images, Captions, and
Textual References”. In: Findings of the Association for Computational Linguistics:
EMNLP. 2020, pp. 2112–2120. doi: 10.18653/v1/2020.findings-emnlp.191.
[Sun+23] Zhiqing Sun et al. Aligning Large Multimodal Models with Factually Augmented RLHF.
2023. arXiv: 2309.14525.
[SB98] R.S. Sutton and A.G. Barto. “Reinforcement Learning: An Introduction”. In: IEEE
Transactions on Neural Networks 9.5 (1998), pp. 1054–1054. doi: 10 . 1109 / TNN .
1998.712192.
[TL20] Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolu-
tional Neural Networks. 2020. arXiv: 1905.11946.
40
[TBF22] Ajay K. Tanwani, Joelle Barral, and Daniel Freedman. “RepsNet: Combining Vision
with Language for Automated Medical Reports”. In: Medical Image Computing and
Computer Assisted Intervention (MICCAI). 2022, pp. 714–724. doi: 10.1007/978-
3-031-16443-9_68.
[Tay53] Wilson L. Taylor. ““Cloze Procedure”: A New Tool for Measuring Readability”. In:
Journalism & Mass Communication Quarterly 30 (1953), pp. 415–433. doi: 10.1177/
107769905303000401.
[Tha+23] Omkar Thawkar et al. XrayGPT: Chest Radiographs Summarization using Medical
Vision-Language Models. 2023. arXiv: 2306.07971.
[TLZ23] Pang Ting, Peigao Li, and Lijie Zhao. “A Survey on Automatic Generation of Medical
Imaging Reports based on Deep Learning”. In: BioMedical Engineering OnLine 22
(May 2023). doi: 10.1186/s12938-023-01113-y.
[Tou+23a] Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. 2023.
arXiv: 2307.09288.
[Tou+23b] Hugo Touvron et al. LLaMA: Open and Efficient Foundation Language Models. 2023.
arXiv: 2302.13971.
[Tou+21] Hugo Touvron et al. “Training Data-Efficient Image Transformers & Distillation
through Attention”. In: The International Conference on Machine Learning. Vol. 139.
2021, pp. 10347–10357.
[Tri+23] Aakash Tripathi et al. Building Flexible, Scalable, and Machine Learning-ready Mul-
timodal Oncology Datasets. 2023. arXiv: 2310.01438.
[Tya+21] Khushal Tyagi et al. “Detecting Pneumonia using Vision Transformer and Comparing
with Other Techniques”. In: International Conference on Electronics, Communication
and Aerospace Technology (ICECA). 2021, pp. 12–16. doi: 10.1109/ICECA52323.
2021.9676146.
[Vas+17] Ashish Vaswani et al. “Attention Is All You Need”. In: Advances in Neural Informa-
tion Processing Systems. Vol. 30. 2017, pp. 5998–6008.
[VC13] Karin Verspoor and Kevin Bretonnel Cohen. “Encyclopedia of Systems Biology”. In:
Springer New York, 2013. Chap. Natural Language Processing, pp. 1495–1498. doi:
10.1007/978-1-4419-9863-7_158.
[WCG20] Changhan Wang, Kyunghyun Cho, and Jiatao Gu. “Neural Machine Translation with
Byte-Level Subwords”. In: AAAI Conference on Artificial Intelligence. 2020, pp. 9154–
9160. doi: 10.1609/aaai.v34i05.6451.
[Wan+22a] Jianfeng Wang et al. GIT: A Generative Image-to-text Transformer for Vision and
Language. 2022. arXiv: 2205.14100.
[Wan+23a] Liyuan Wang et al. A Comprehensive Survey of Continual Learning: Theory, Method
and Application. 2023. arXiv: 2302.00487.
[Wan+23b] Yizhong Wang et al. Self-Instruct: Aligning Language Models with Self-Generated
Instructions. 2023. arXiv: 2212.10560.
[Wan+22b] Zifeng Wang et al. MedCLIP: Contrastive Learning from Unpaired Medical Images
and Text. 2022. arXiv: 2210.10163.
[Wan+22c] Zirui Wang et al. “SimVLM: Simple Visual Language Model Pretraining with Weak
Supervision”. In: International Conference on Learning Representations (ICLR). 2022.
41
[Waq+23a] Asim Waqas et al. Multimodal Data Integration for Oncology in the Era of Deep
Neural Networks: A Review. 2023. arXiv: 2303.06471.
[Waq+23b] Asim Waqas et al. “Revolutionizing Digital Pathology With the Power of Generative
Artificial Intelligence and Foundation Models”. In: Laboratory Investigation 103.11
(2023), p. 100255. doi: https://doi.org/10.1016/j.labinv.2023.100255.
[Wu+16] Yonghui Wu et al. Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation. 2016. arXiv: 1609.08144.
[Xie+16] Saining Xie et al. “Aggregated Residual Transformations for Deep Neural Networks”.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016),
pp. 5987–5995. doi: 10.1109/CVPR.2017.634.
[Xie+22] Zhenda Xie et al. “SimMIM: A Simple Framework for Masked Image Modeling”.
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
(2022), pp. 9643–9653.
[Xin+22] Chao Xin et al. “An Improved Transformer Network for Skin Cancer Classification”.
In: Computers in Biology and Medicine 149 (2022), p. 105939. doi: 10 . 1016 / j .
compbiomed.2022.105939.
[Xu+21] Mengya Xu et al. “Learning Domain Adaptation with Model Calibration for Sur-
gical Report Generation in Robotic Surgery”. In: 2021 IEEE International Confer-
ence on Robotics and Automation (ICRA) (2021), pp. 12350–12356. doi: 10.1109/
ICRA48506.2021.9561569..
[Yam+18] Rikiya Yamashita et al. “Convolutional Neural Networks: an Overview and Applica-
tion in Radiology”. In: Insights into Imaging 9 (June 2018). doi: 10.1007/s13244-
018-0639-9.
[Yan+22] Xi Yang et al. “A Large Language Model for Electronic Health Records”. In: NPJ
Digital Medicine 5.194 (2022). doi: 10.1038/s41746-022-00742-2.
[Yan+16] Zichao Yang et al. “Hierarchical Attention Networks for Document Classification”.
In: Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies. 2016, pp. 1480–1489. doi: 10.18653/v1/
N16-1174.
[Yu+23] Feiyang Yu et al. “Evaluating Progress in Automatic Chest X-ray Radiology Report
Generation”. In: Patterns 4 (2023), p. 100802. doi: 10.1016/j.patter.2023.100802.
[Yua+23] Zheng Yuan et al. RAMM: Retrieval-augmented Biomedical Visual Question Answer-
ing with Multi-modal Pre-training. 2023. arXiv: 2303.00534.
[Zel+19] Rowan Zellers et al. “From Recognition to Cognition: Visual Commonsense Rea-
soning”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). 2019, pp. 6713–6724. doi: 10.1109/CVPR.2019.00688.
[Zen+20] Guangtao Zeng et al. “MedDialog: Large-scale Medical Dialogue Datasets”. In: Con-
ference on Empirical Methods in Natural Language Processing (EMNLP). 2020, pp. 9241–
9250. doi: 10.18653/v1/2020.emnlp-main.743.
[Zha+23a] Yuexiang Zhai et al. Investigating the Catastrophic Forgetting in Multimodal Large
Language Models. 2023. arXiv: 2309.10313.
[Zha+20a] Li-Ming Zhan et al. “Medical Visual Question Answering via Conditional Reasoning”.
In: The 28th ACM International Conference on Multimedia. 2020, pp. 2345–2354. doi:
10.1145/3394171.3413761.
42
[Zha+21] Chen Zhang et al. “A Survey on Federated Learning”. In: Knowledge-Based Systems
216 (2021), p. 106775. doi: 10.1016/j.knosys.2021.106775.
[ZNC18] Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. “Grounding Referring Expressions
in Images by Variational Context”. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2018, pp. 4158–4166. doi: 10.1109/CVPR.2018.00437.
[Zha+23b] Sheng Zhang et al. Large-Scale Domain-Specific Pretraining for Biomedical Vision-
Language Processing. 2023. arXiv: 2303.00915.
[Zha+20b] Tianyi Zhang et al. “BERTScore: Evaluating Text Generation with BERT”. In: In-
ternational Conference on Learning Representations. 2020.
[Zha+23c] Wentao Zhang et al. Adapter Learning in Pretrained Feature Extractor for Continual
Learning of Diseases. 2023. arXiv: 2304.09042.
[Zha+19] Yijia Zhang et al. “BioWordVec, Improving Biomedical Word Embeddings with Sub-
word Information and MeSH”. In: Scientific Data 6.52 (2019). doi: 10.1038/s41597-
019-0055-0.
[Zha+23d] Ruochen Zhao et al. Retrieving Multimodal Information for Augmented Generation:
A Survey. 2023. arXiv: 2303.10868.
[Zhe+19] Liangli Zhen et al. “Deep Supervised Cross-Modal Retrieval”. In: IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR). 2019, pp. 10386–10395.
doi: 10.1109/CVPR.2019.01064.
[Zho+23a] Hongjian Zhou et al. A Survey of Large Language Models in Medicine: Progress,
Application, and Challenge. 2023. arXiv: 2311.05112.
[Zho+23b] Da-Wei Zhou et al. Learning without Forgetting for Vision-Language Models. 2023.
arXiv: 2305.19270.
[Zie+20] Daniel M. Ziegler et al. Fine-Tuning Language Models from Human Preferences. 2020.
arXiv: 1909.08593 [cs.CL].
43