0% found this document useful (0 votes)
103 views43 pages

Vision-Language Models For Medical Report Generation and Visual Question Answering: A Review

Uploaded by

alit220446
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views43 pages

Vision-Language Models For Medical Report Generation and Visual Question Answering: A Review

Uploaded by

alit220446
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Vision-Language Models for Medical Report

Generation and Visual Question Answering: A Review


Iryna Hartsock1 and Ghulam Rasool1
1
Department of Machine Learning, H. Lee Moffitt Cancer Center & Research Institute
arXiv:2403.02469v2 [cs.CV] 15 Apr 2024

Abstract
Medical vision-language models (VLMs) combine computer vision (CV) and natural lan-
guage processing (NLP) to analyze visual and textual medical data. Our paper reviews recent
advancements in developing VLMs specialized for healthcare, focusing on models designed for
medical report generation and visual question answering (VQA). We provide background on
NLP and CV, explaining how techniques from both fields are integrated into VLMs to enable
learning from multimodal data. Key areas we address include the exploration of medical vision-
language datasets, in-depth analyses of architectures and pre-training strategies employed in
recent noteworthy medical VLMs, and comprehensive discussion on evaluation metrics for as-
sessing VLMs’ performance in medical report generation and VQA. We also highlight current
challenges and propose future directions, including enhancing clinical validity and addressing
patient privacy concerns. Overall, our review summarizes recent progress in developing VLMs
to harness multimodal medical data for improved healthcare applications.

1 Introduction
The last decade has witnessed enormous progress in artificial intelligence (AI) and machine learn-
ing (ML), including the development of foundation models (FMs), large language models (LLMs),
and vision-language models (VLMs). These AI/ML developments have started transforming sev-
eral aspects of our daily lives, including healthcare. AI/ML can potentially transform the whole
healthcare continuum by significantly optimizing and improving disease screening and diagnostic
procedures, treatment planning, and post-treatment surveillance and care [Baj+21]. Various com-
puter vision (CV) and natural language processing (NLP) models, more recently LLMs, have been
instrumental in driving this transformative trend [He+23; Zho+23a]. CV models have been trained
and validated for various screening and diagnosis use cases leveraging radiology data from X-rays,
mammograms, magnetic resonance imaging (MRI), computed tomography (CT), and others. Re-
cently, AI models focused on digital pathology using histopathology and immunohistochemistry
data have also shown significant advances in accurate disease diagnosis, prognosis, and biomarker
identification [Waq+23b]. On the other hand, by training models using large datasets of med-
ical literature, clinical notes, and other healthcare-related text, LLMs can extract insights from
electronic health records (EHR) efficiently, assist healthcare professionals in generating concise
summary reports, and facilitate the interpretation of patient information. Noteworthy examples
of such LLMs include GatorTron [Yan+22], ChatDoctor [Li+23c], Med-PaLM (Medical Pathways
Language Model) [Sin+23] and Med-Alpaca [Han+23].
The healthcare data is inherently multimodal, and consequently, the AI/ML models often need
to be trained using multiple data modalities, including text (e.g., clinical notes, radiology reports,
surgical pathology reports, etc.), imaging (e.g., radiological scans, digitized histopathology slides,

1
etc.), and tabular data (e.g., numerical data such as vitals or labs and categorical data such as
race, gender, and others) [Aco+22; Shr+23; Waq+23a; Tri+23; Moh23]. In routine clinical prac-
tice, healthcare professionals utilize a combination of these data modalities for diagnosing and
treating various conditions. Integrating information from diverse data modalities enhances the pre-
cision and thoroughness of disease assessments, diagnoses, treatment planning, and post-treatment
surveillance. The need for AI/ML models to ingest, integrate, and learn from information stemming
from varied data sources is the driving force for multimodal learning [Hua+21; Waq+23a].
The recent progress in multimodal learning has been driven by the development of vision-
language models (VLMs) [Gan+22; Che+23; Moh23]. These cutting-edge models can analyze,
interpret, and derive insights from both visual and textual data. In the medical domain, these
models contribute to developing a more holistic understanding of patient information and improv-
ing the performance of ML models in various clinical tasks. Many of these models, like CLIP
(Contrastive Language–Image Pre-training) [Rad+21], LLaVa (Large Language and Vision Assis-
tant) [Liu+23b], and Flamingo [Ala+22] are tailored to healthcare domain through training on
extensive medical datasets. Adapting VLMs for medical visual question-answering [Lin+23b] is
particularly noteworthy, empowering healthcare professionals to pose queries regarding medical
images such as CT scans, MRIs, mammograms, ultrasound, X-rays, and more. The question-
answering capability elevates the interactive nature of the AI/ML models in healthcare, facilitating
dynamic and informative exchanges between healthcare providers and the AI system. Furthermore,
adapting VLMs for medical report generation enables them to produce detailed and contextually
relevant reports by amalgamating information from both visual and textual sources. This not only
streamlines the documentation process but also ensures that the generated reports are comprehen-
sive and accurately reflect the subtleties present in the data, further enhancing healthcare workflow
efficiency.
In contrast to previous related surveys [Lin+23b; TLZ23; Shr+23], this review focuses on the
latest advancements in VLMs tailored for medical report generation and visual question-answering.
The overall structure of this review is shown in Fig. 1 and is outlined as follows. In Section 2,
we provide essential background on neural networks, CV, and NLP. In Section 3, we delve into
the exploration of VLMs’ architectures, training strategies, and downstream tasks. The goal of
Section 2 and Section 3 is to ensure the accessibility of this review for readers, irrespective of their
ML background. We split Section 4 into three key sub-sections. In Section 4.1, we describe 17
publicly available vision-language datasets. These datasets encompass medical image-text pairs or
question-answer pairs related to medical images. Next, in Section 4.2, we meticulously outline the
metrics and their formulas, where applicable, employed for evaluating VLMs in the context of report
generation and visual question-answering tasks. In Section 4.3, we conduct a thorough review of 15
recent medical VLMs, with 14 of them being publicly available. To the best of our knowledge, most
of these models have not been reviewed in any previous surveys. Finally, in Section 5, we discuss
the current challenges within the field of medical VLMs, offering insights into potential research
directions that could profoundly influence their future development. The list of medical VLMs and
datasets can also be found on GitHub.

2 Machine Learning (ML) - A Brief Review


2.1 Neural Networks (NNs)
ML and AI, as we understand them today, began to take shape in the late 1940s and early 1950s
[Bal21]. NNs stand out as classical ML models, drawing inspiration from the structure and function-
ing of the human brain. They are composed of layers of interconnected nodes or neurons arranged

2
Tokenization

ROCO
Introduction Token Embeddings

MIMIC-CXR
NNs RNNs

MIMIC-CXR-JPG
ML - A Brief Review NLP Transformers
MedViLL
MIMIC-NLE
CV CNNs Transfer Learning
PubMedCLIP
CXR-PRO
ViTs Curriculum Learning
RepsNet
IU-Xray
Self-Supervised Learning
Single- vs. Dual-Stream BiomedCLIP
VLMs MedICaT
Model Architecture Pre-Training Process
Encoder vs. Encoder- & Tasks UniXGen
Decoder VLMs PMC-OA
Model Training Fine-Tuning Techniques
RAMM
MS-CXR
Parameter-Efficient
VLMs Fine-Tuning LoRA X-REM
Prompt Engineering SLAKE
In-Context Learning Prompt Tuning Visual Med-Alpaca
VQA-RAD
RAG
Prefix Token Tuning CXR-RePaiR-Gen
PathVQA
Downstream Tasks Report Generation
LLaVa-Med
Visual Question VQA-Med 2019
Answering XrayGPT
VQA-Med 2020
Other Tasks
CAT-ViL DeiT
Medical Datasets
VQA-Med 2021
for VLMs
Evaluation Metrics MUMC
for RG EndoVis 2017
VLM Evaluation
Medical VLMs
Metrics Evaluation Metrics Med-Flamingo
for VQA EndoVis 2018
Medical Models RaDialog
Challenges & Potential
Future Directions

Figure 1: Organization of the review paper.

into an input layer, an output layer, and multiple intermediate layers called hidden layers. The
basic NN is a “feedforward NN”, where neurons can be numbered in such a way that a connection
from neuron i to neuron j can exist if and only if i < j [Bal21]. In any NN, the connections
between nodes carry weight, and neurons utilize “activation functions” on their inputs. Activation
functions play a crucial role by introducing non-linearity to the model, enabling it to learn com-
plex nonlinear mappings between inputs and outputs. Common activation functions include the
sigmoid, hyperbolic tangent (tanh), and Rectified Linear Unit (ReLU). NNs utilize a loss function
to quantify the difference between the predicted outputs and the actual targets. The loss function
produces a scalar value, and the goal during training is to minimize this loss value.
Backpropagation, short for backward propagation of errors, is a key algorithm for training deep
neural networks. During the forward pass, input data is fed through the network, predictions are
generated, and a scalar loss value is calculated using the loss function. During backpropagation, we
calculate the gradient of the loss function with respect to the weights of the network. This gradient
information is then used to update the weights in an effort to minimize the difference between
predicted and target values. Backpropagation is an application of the chain rule for computing
derivatives [Bal21]. After the backward pass, the optimization algorithm takes these gradients and
adjusts the learnable parameters (weights and biases) of the NN, which, in turn, will result in the
minimization of the loss value in the next batch. Common optimization methods include gradient
descent, stochastic gradient descent (SGD), [Rob51], Adam (Adaptive Moment Estimation [KB14],
and many others.

2.2 Natural Language Processing (NLP)


NLP is the analysis of linguistic data, most commonly in the form of textual data such as documents
or publications, using computational methods [VC13]. NLP encompasses a variety of tasks aimed

3
at understanding, processing, and generating human language. Named entity recognition (NER),
a prominent NLP task, focuses on identifying and classifying entities within the text, such as
names of individuals, medical conditions, etc. For instance, in medical literature, NER can assist
in extracting crucial information from documents. Text summarization in NLP is widely used for
generating coherent summaries of lengthy texts. Sentiment analysis is a task that determines the
emotional tone expressed in a given text, providing valuable insights for applications like social
media monitoring or customer feedback analysis. Machine translation is a fundamental NLP task
in breaking down language barriers by automatically translating text from one language to another.
Question answering in NLP is directed at comprehending and responding to user queries, propelling
advancements in virtual assistants and information retrieval.

2.2.1 Tokenization
The first step in NLP is tokenization, which is the mechanism of splitting or fragmenting the
sentences and words to their possible smallest morpheme, called a token. A morpheme is the
smallest possible word after which it cannot be broken further [RB21]. One example of a word-
level tokenization method is whitespace tokenization, which segments text based on whitespace
characters. In many NLP applications, subword tokenization methods are preferred due to their
effectiveness in handling out-of-vocabulary words. WordPiece [Wu+16] begins by treating each
character as a token, creating an initial vocabulary. Employing a flexible and adaptive merging
strategy, WordPiece considers any pair of adjacent characters or subword units that enhance the
overall likelihood of the training data. This likelihood reflects the model’s probability of accurately
representing the training data given its current state. In contrast, Byte-Pair Encoding (BPE)
[SHB16] shares similarities with WordPiece but adheres to a more deterministic merging strategy.
In each iteration, BPE merges the most frequent pair of adjacent characters or subword units,
progressing toward a predefined vocabulary size. Byte-level BPE [WCG20] operates at an even
finer granularity, considering individual bytes rather than characters. Byte-level BPE extends the
concept of subword tokenization to bytes, allowing it to capture more nuanced patterns at the byte
level.

2.2.2 Token Embeddings


Tokens are then often transformed into numerical vectors that capture semantic relationships be-
tween tokens, which are referred to as word or token embeddings. Word2Vec [Mik+13a] is a widely
used word embedding technique that uses two models: Skip-Gram [Mik+13a] and Continuous Bag
of Words (CBOW) [Mik+13b]. In skip-gram, the model predicts context words given a target word,
capturing semantic associations. Conversely, CBOW predicts the target word based on its context,
emphasizing syntactic structures. In both models, the “context word” refers to words within a
specified window around the target word, and the “target word” is the word for which predic-
tions are made. Word2Vec is computationally efficient, making it well-suited for large datasets and
general-purpose applications. Global Vectors (GloVe) [PSM14] is a word embedding model that
distinguishes itself by capturing global semantic relationships. It focuses on the entire corpus rather
than local context windows. The model builds a co-occurrence matrix based on the global statistics
of word pairs and then employs an objective function to generate word vectors that reflect the ratio
of co-occurrence probabilities. GloVe uses an implicit skip-gram approach, capturing word relation-
ships globally, making it ideal for tasks that demand a holistic understanding of word connections.
FastText [Boj+17] is another word embedding that is particularly effective for handling out-of-
vocabulary words and morphologically rich languages. It adopts a sub-word approach, breaking

4
words into n-grams, and employs a skip-gram training method similar to Word2Vec [Mik+13a] to
learn embeddings for these sub-word units. There are also word embeddings tailored to represent
biomedical and clinical terms better. In tasks where the order of words is not essential, other fea-
ture extraction techniques can be effective, for example, bag-of-words (BoW) in text classification
or term frequency-inverse document frequency (tf-idf ) for information retrieval.
In addition to general-purpose word embeddings, there are ones designed for biomedical and
clinical terms. BioWordVec [Zha+19] incorporates MeSH (Medical Subject Headings) terms along
with text from PubMed abstracts and employs the fastText [Boj+17] algorithm to learn improved
biomedical word embeddings. Another prominent approach is Cui2vec [Bea+20], which leverages
diverse multi-modal data derived from medical publications and clinical notes. Cui2vec system-
atically maps medical terms onto a common Concept Unique Identifier (CUI) space, followed by
the construction of a co-occurrence matrix. This matrix captures instances where different CUIs
appear together, which is a foundation for generating word embeddings using techniques such as
GloVe [PSM14] or Word2Vec [Mik+13a]. In most cases, it is common to add positional encodings
to capture the order of tokens in a sequence. Positional encoding vectors, often based on sinusoidal
functions, systematically encode token positions, enriching embeddings with positional information
for utilization in ML models tailored to specific NLP tasks [Ahm+23].

2.2.3 Recurrent Neural Networks (RNNs)


RNNs are widely employed for pattern detection in sequential data, encompassing diverse types
such as genomic sequences, text, or numerical time series [Sch19]. Operating on the principle of pre-
serving a form of memory, RNNs incorporate a cyclic structure by looping the output of a specific
layer back to the input, facilitating the prediction of subsequent layer outputs. This mechanism em-
powers RNNs to adeptly model sequential and temporal dependencies, capturing information from
preceding time steps within hidden states. Despite their capacity to retain information from past
inputs, RNNs encounter challenges in preserving long-term dependencies within input sequences
due to the vanishing gradient problem. To address this, several RNN variants, including Long
Short-Term Memory (LSTM) [HS97] and Gated Recurrent Unit (GRU) [Cho+14] networks, have
been designed to enhance their ability to capture and utilize long-range dependencies in sequential
data.

2.2.4 Transformers
In recent years, there has been a remarkable advancement in NLP mainly due to the development of
the Transformer models [Vas+17]. Beyond incorporating embeddings and positional encodings, the
Transformer architecture consists of an encoder that processes input data, represented by vectors
obtained from embedded and positionally encoded tokens. The encoder-generated representation
then serves as the input for the subsequent decoder, which transforms these vector representations
into a relevant output tailored to the specific task at hand. A defining characteristic of the Trans-
former lies in its self-attention mechanism, notably the scaled dot-product attention, which proves
instrumental in capturing intricate dependencies within sequences. This mechanism, employed in
both the encoder and decoder, utilizes queries and keys during the attention process. Queries
serve as projections of the input sequence, encapsulating information for attending to other posi-
tions, while keys represent the positions within the sequence. Enhanced by multi-head attention
for parallelization, the self-attention mechanism enables the model to dynamically weigh different
parts of the input sequence, fostering a nuanced understanding of contextual relationships. Each
layer in both the encoder and decoder encompasses sub-layers, including a feedforward NN, fur-

5
ther augmenting the model’s capacity to capture intricate patterns within the data. In practice,
Transformers face limitations in effectively processing long sequences and exhibit less selectivity
about relevant information when considering all positions in the sequence. Various techniques
have been proposed to address these issues. One such approach, known as hierarchical attention
[Yan+16], strategically reduces computational complexity and enhances contextual sensitivity by
initially computing attention at the word level and then at the sentence level. Another notable ad-
vancement in attention algorithms is the FlashAttention [Dao+22] and FlashAttention-2 [Dao23],
designed to accelerate attention computations significantly.
The synergy between the enhanced computational power provided by Graphical Processing
Units (GPUs) and the advancements in attention mechanisms has played a pivotal role in the
development of large language models (LLMs). These models are meticulously trained on vast
datasets with a very large number of parameters. Initial LLMs include but are not limited to,
BERT (Bidirectional Encoder Representations from Transformers) [Dev+19] (the largest version
comprising 235 M parameters), ALBERT (A Lite BERT) [Lan+19] (with the largest variant at 12 M
parameters), and Megatron-LM [Sho+19] (the largest version featuring 1.2 B parameters). The era
of even larger LLMs began in 2020, introducing models like GPT-3 (the 3-generation Generative
Pre-trained Transformer) [Bro+20] (175 B parameters) and PaLM (Pathways Language Model)
[Cho+22] (540 B parameters). Some of the most recent LLMs are LLaMA (Large Language Model
Meta AI) [Tou+23b], Vicuna [Chi+23], Llama 2 [Tou+23a], and Mistral [Jia+23]. Note that
encoder-only LLMs can be used to generate token embeddings (e.g., BERT [Dev+19] or GatorTron
[Yan+22]).

2.3 Computer Vision (CV)


CV involves interpreting and understanding the world from their images or videos [Ji20]. Data in
CV is encoded as numerical values representing the intensity or brightness of pixels. The extraction
of visual patterns like edges, textures, and objects in images or video frames serves as building
blocks for various CV tasks. Image classification is the task of assigning a label to an entire image,
determining the main object or scene. Object detection involves identifying and locating multiple
objects within an image, providing both labels and bounding boxes. Image segmentation divides
an image into meaningful segments, assigning a label to each pixel and outlining the boundaries of
distinct objects or regions. Various ML techniques and models are utilized for these tasks [Mah+22].

2.3.1 Convolutional Neural Networks (CNNs)


CNNs represent a significant advancement in CV [Yam+18]. Besides pooling and fully connected
layers, CNNs also have convolution layers, which apply convolution operations to input data. Dur-
ing a convolution operation, a small filter or kernel slides over the input data. At each position, the
filter performs element-wise multiplications with the local regions of the input. The results of these
multiplications are then summed, creating a new value in the output feature map. This process
is repeated across the entire input, capturing patterns and features at different spatial locations.
The well-known CNNs include Residual Network (ResNet) [He+16], Dense Convolutional Network
(DenseNet) [Hua+22], Efficient Network (EfficientNet) [TL20] and many others.

2.3.2 Vision Transformers (ViTs)


Transformer models, which were originally proposed for NLP tasks, have also found valuable ap-
plications in CV. For instance, the ViT model [Dos+21] can capture intricate relationships and
dependencies across the entire image. This is achieved by leveraging the Transformer architecture

6
and treating images as sequences of smaller patches. Each image patch undergoes a process of
flattening into a vector, followed by passage through an embedding layer. The embedding layer
enriches the flattened image patches, providing a more expressive and continuous representation.
Next, positional encodings are incorporated into the embeddings, conveying information about
the spatial arrangement of the image patches. A distinctive feature of ViTs is the introduction
of a special token designed to capture global information about the entire image. This special
token has an associated learnable token embedding, represented by a vector with its unique set
of parameters. ViTs have achieved notable success in semantic segmentation [RBK21], anomaly
detection [Mis+21], medical image classification [Man+23] and even outperformed CNNs in some
cases [Tya+21; Xin+22].

3 Vision-Language Models (VLMs)


Many real-world situations inherently involve a variety of data modalities. For example, autonomous
cars must process information from various sensors like cameras, RADAR, LiDAR, and/or GPS
to ensure safe and effective navigation [Par+22]. Similarly, in cancer care, the fusion of radiology
images with genomic data, digitized histopathology slides, and clinical reports has the potential
to improve diagnosis, treatment planning, and post-treatment surveillance [Boe+21; Waq+23a;
Moh23]. This motivated the development of VLMs, which can handle and understand both NLP
and CV data simultaneously.

3.1 Model Architecture


3.1.1 Single- vs. Dual-Stream VLMs
Based on how different data modalities are fused together in VLMs, they are generally catego-
rized into two groups [Che+23]: (1) single-stream (e.g., VisualBERT [Li+19] and UNITER or
UNiversal Image-TExt Representation Learning [Che+20b]), and (2) dual-stream models (e.g.,
ViLBERT Vision-and-Language BERT [Lu+19] and CLIP or Contrastive Language-Image Pre-
training [Rad+21]).

Single-Stream Models A single-stream VLM adopts an efficient architecture for processing both
visual and textual information within a unified module. This architecture incorporates an early
fusion of distinct data modalities, where feature vectors from various data sources are concatenated
into a single vector (e.g., MedViLL [Moo+22]). Subsequently, this combined representation is fed
into a single stream. One notable advantage of the single-stream design is its parameter efficiency,
achieved by employing the same set of parameters for all modalities. This not only simplifies the
model but also contributes to computational efficiency during both training and inference phases
[Che+23].

Dual-Stream Models A dual-stream VLM extracts visual and textual representations sepa-
rately in parallel streams that do not share parameters. This architecture usually has higher
computational complexity than single-stream architectures. Visual features are generated from
pre-trained vision encoders, such as CNNs or ViTs, and textual features are obtained from pre-
trained text encoders, usually based on a Transformer architecture (e.g., PubMedCLIP [EMD23]).
Both features are then fed into a multimodal fusion module, often leveraging attention mechanisms,
to integrate information from both data modalities and to learn cross-modal representations. This
late fusion approach allows for more intricate interactions between visual and textual information,

7
enabling the model to capture complex cross-modal dependencies. However, it comes at the cost
of increased computational complexity compared to single-stream architecture.

3.1.2 Encoder vs. Encoder-Decoder VLMs


The learned cross-modal representations can be optionally processed by a decoder before producing
the final output. Consequently, VLMs are classified into two groups: (1) encoder-only (e.g., ALIGN
(A Large-scale ImaGe and Noisy-text embedding) [Jia+21]), and (2) encoder-decoder models (e.g.,
SimVLM (Simple Visual Language Model) [Wan+22c]).

Encoder-only Models These models are advantageous in scenarios where the primary objec-
tive is efficient representation learning. They often exhibit streamlined processing and reduced
computational complexity, making them suitable for tasks requiring compact and informative rep-
resentations. However, these models might lack the capability to generate intricate and detailed
outputs, limiting their use in tasks demanding nuanced responses or creative generation.

Encoder-Decoder Models These models offer the flexibility to generate complex and diverse
outputs, making them well-suited for tasks like image captioning, translation, or any application re-
quiring creative responses. The decoding step allows for the transformation of joint representations
into meaningful outputs. However, this versatility comes at the cost of increased computational
load and complexity.

3.2 Model Training


3.2.1 Transfer Learning
A commonly adopted strategy in ML is to employ pre-trained models and customize them to specific
downstream tasks — a method commonly known as transfer learning. This process typically
involves fine-tuning the model’s parameters using smaller task-specific datasets to address the
intricacies of the target task [Bom+22]. Transfer learning can also be considered as the process
of starting the parameters optimization for a task using a set of already optimized parameters on
another task instead of using random initialization. Transfer learning may involve some modification
of the original model’s architecture. This can include modifications to the final layers or the
introduction of new layers, such as classification or regression layers, tailored to meet the specific
requirements of the task at hand [Bom+22]. The underlying idea is to adapt the pre-trained
model to the specifics of the new task while retaining the knowledge it gained during the initial
pre-training.

3.2.2 Curriculum Learning


Curriculum learning presents an innovative approach when dealing with tasks or data exhibiting a
natural progression or hierarchy. This method involves strategically presenting training examples or
tasks in a designed order, typically based on difficulty or complexity measures [Sov+21]. The recent
medical VLM, LLaVa-Med [Li+23a], adopts curriculum learning during its training phase. This
allows the model to learn gradually, starting with simpler examples and progressing to more intricate
ones. This orchestrated learning sequence enhances the model’s adaptability and performance.

8
3.2.3 Self-Supervised Learning (SSL)
SSL is a fundamental paradigm in training VLMs, offering a powerful alternative to traditional
supervised learning by allowing models to generate their own labels from the data [Ran+23a]. This
is particularly beneficial when obtaining large amounts of labeled data is challenging or expensive.
In self-supervised learning for VLMs, the models formulate tasks that leverage inherent structures
within the data, enabling them to learn meaningful representations across modalities without ex-
plicit external labels. Contrastive learning, masked language modeling, and masked image modeling
(described in the following sub-section) are examples of self-supervised learning tasks.

3.2.4 Pre-Training Process and Tasks


The pre-training process plays a pivotal role in equipping VLMs with a foundational understanding
of the intricate interplay between visual and textual data. A prevalent strategy involves intensive
pre-training on datasets where images/videos are paired with their corresponding textual descrip-
tions. During pre-training, various tasks guide the model in learning versatile representations for
downstream tasks.

Contrastive Learning (CL) CL encourages the model to learn meaningful representations by


contrasting positive pairs with negative pairs of both visual and textual data [Li+21]. During CL,
the model is trained to map both positive and negative pairs into a shared embedding space. Positive
pairs consist of examples where the visual and textual content are related, such as an image paired
with its corresponding textual description. Conversely, negative pairs consist of examples where the
visual and textual content are unrelated, like an image paired with a randomly selected different
textual description. The objective is to bring positive pairs closer together while pushing negative
pairs farther apart in the shared embedding space. Various contrastive loss functions are employed
to achieve this objective, with the InfoNCE (Noise-Contrastive Estimation) loss [OLV19] being a
common choice. InfoNCE formulates a probabilistic task where the model is trained to maximize
the likelihood of observing positive pairs and minimize the likelihood of observing negative pairs.
The negative log-likelihood of the positive pair is used as the loss. CLIP [Rad+21] employs InfoNCE
loss with cosine similarity. On the other hand, ALIGN [Jia+21] uses a normalized softmax loss. This
loss computes the softmax over cosine similarities between the normalized embeddings of positive
and negative pairs, aiming to boost positive similarity while diminishing negative similarities.

Masked Language Modeling (MLM) MLM is a widely used task in NLP [Tay53]. It was
first introduced and applied in the BERT model [Dev+19]. MLM involves randomly selecting a
percentage of tokens within textual data and replacing them with a special token, often denoted as
MASK. The model predicts these masked tokens by taking into account the context on both sides
of them, allowing the model to grasp nuanced contextual information. VLMs such as UNITER
[Che+20b] and VisualBERT [Li+19] leverage MLM for pre-training.

Masked Image Modeling (MIM) Extending the idea of MLM to images gave rise to MIM
[Xie+22]. In MIM, certain patches are masked, prompting the model to predict the contents of
masked regions. This process enables the model to draw context from the entirety of the image, en-
couraging the integration of both local and global visual features. VLMs like UNITER [Che+20b]
and ViLBERT [Lu+19] leverage MIM for enhanced performance. The cross-entropy loss is em-
ployed in MLM and MIM tasks to measure the difference between predicted and actual probability

9
distributions for the masked elements. Additionally, MLM can be combined with MIM, allowing the
reconstruction of the masked signal in one modality with support from another modality [Kwo+23].

Image-Text Matching (ITM) ITM is another common vision-language pre-training task. Through-
out the training, the model learns to map images and corresponding textual descriptions into a
shared semantic space, where closely aligned vectors represent similar content in both modali-
ties. In single-stream VLMs, the special token [CLS] represents the joint representation for both
modalities. In contrast, in dual-stream VLMs, the visual and textual representations of [CLS]V
and [CLS]T are concatenated. This joint representation is fed into a fully-connected layer followed
by the sigmoid function, predicting a score indicating match or mismatch [Che+23]. Models like
CLIP [Rad+21], ALBEF (ALign the image and text representations BEfore Fusing) [Li+21], and
METER [Dou+22] leverage ITM during pre-training.

Combining Multiple Tasks In VLM pre-training, multiple tasks are often combined in a uni-
fied framework, allowing models to grasp nuanced contextual information across modalities. The
final loss function can combine contrastive loss, cross-entropy loss for masked token prediction, and
other task-specific losses. This comprehensive pre-training approach equips VLMs with versatile
representations for diverse downstream tasks. For example, ALBEF [Li+21] adopts a comprehen-
sive pre-training objective encompassing three tasks: CL, MLM, and ITM. The overall loss is then
computed as the sum of these individual components.

3.2.5 Fine-Tuning Techniques


Following the training, a common practice involves fine-tuning VLMs on smaller datasets tailored
to specific downstream tasks.

Supervised Fine-Tuning (SFT) Before employing SFT, the VLM is pre-training on an ex-
tensive image-text dataset, establishing a foundational understanding of the complex relationship
between visual and textual representations. SFT involves meticulous fine-tuning on a more fo-
cused dataset, curated to match the nuances of the targeted application. This dual-phase strategy,
encompassing broad pre-training and task-specific fine-tuning, enables the model to benefit from
large-scale generalization while seamlessly adapting to the intricacies of particular applications
[Ouy+22].

Reinforcement Learning from Human Feedback (RLHF) RLHF is a distinct fine-tuning


approach employed to enhance VLMs through the incorporation of human preferences during fine-
tuning [Ouy+22; Lam+22; Zie+20]. RLHF initiates with an initial model, incorporating human-
generated rankings of its outputs to construct a detailed reward model. In contrast to traditional
reinforcement learning (RL) [SB98; Cor+20], which relies solely on environmental interactions,
RLHF strategically integrates human feedback. This human-in-the-loop approach provides a more
nuanced and expert-informed methodology, allowing for the fine-tuning of VLMs in alignment with
human preferences, ultimately leading to improved model outcomes.

Instruction Fine-Tuning (IFT) IFT refers to the process of refining a pre-trained language
model by providing specific instructions or guidance tailored to a particular task or application
[Ren+24]. This process typically involves exposing the model to examples or prompts related to
the desired instructions and updating its parameters based on the feedback received during this
task-specific training phase. Medical VLM, RaDialog [Pel+23], employs this fine-tuning technique.

10
3.3 Parameter-Efficient Fine-Tuning (PEFT)
In this section, we explore strategies for adapting VLMs while keeping the model’s parameters frozen
and only updating newly added layers. In recent years, PEFT has gained prominence, encompassing
various techniques and strategies that aim to make the most effective use of parameters during the
fine-tuning process, particularly in scenarios with limited labeled data for the target task. The
main strategy of PEFT is incorporating task-specific parameters, known as adapters, into a pre-
trained model while retaining its original parameters. The architecture of adapter modules typically
incorporates a bottleneck structure, projecting original features into a reduced dimension, applying
non-linearity, and then projecting back to the original dimension. This thoughtful design ensures
parameter efficiency by limiting the number of added parameters per task. Integrated after each
layer of the pre-trained model, adapter modules capture task-specific details while preserving shared
parameters, enabling the model’s seamless extension to new tasks without significant interference
with previously acquired knowledge.

3.3.1 Low-Rank Adaptation (LoRA)


LoRA is a common adapter-based method is [Hu+22]. The adaptation process involves fine-tuning
two smaller low-rank matrices that are decompositions of the larger weight matrix of the pre-trained
model. These smaller matrices constitute the LoRA adapter modules, and the approach focuses
on making low-rank modifications to adapt the model for specific tasks efficiently. Pre-trained
LLMs that are part of medical VLMs architecture are often fine-tuned using LoRA (e.g., Visual
Med-Alpaca [Shu+23] and RaDialog [Pel+23]).

3.3.2 Prompt Tuning


Prompt tuning involves creating continuous vector representations as input hints [LAC21]. This
allows the model to generate effective prompts during training dynamically. Continuous refinement
of prompts significantly enhances the model’s ability to generate accurate and contextually relevant
responses. This iterative process allows the model to adapt its behavior based on an evolving
understanding of the task. VLMs like Qwen-VL and InstructBLIP used prompt tuning [Bai+23;
Dai+23].

3.3.3 Prefix Token Tuning


Prefix token tuning adds task-specific vectors to the input, specifically to the initial tokens known
as prefix tokens, to guide the model’s behavior for a given task [LL21]. For instance, VL-T5 utilized
different prefixes for questions from various datasets [Cho+21] . These vectors can be trained and
updated independently while keeping the remaining pre-trained model parameters frozen. Prefix
token tuning allows for task-specific adaptation without compromising the pre-trained knowledge
encoded in the majority of the model’s parameters.

3.4 In-Context Learning


In this section, we explore strategies for adapting VLMs using the context only, keeping the model’s
parameters (and PEFT/LoRA adapters, if any) frozen. In our settings, in-context learning may be
considered as using LLMs or VLMs for inference only.

11
3.4.1 Prompt Engineering
Prompt engineering is a technique that involves enhancing a large pre-trained model with task-
specific instructions, referred to as prompts, to tailor the model’s output for specific tasks [Gu+23].
Examples include instructing the model to generate a radiology report for a specific image (e.g.,
RAMM [Pel+23]). Prompt engineering can also expose the VLM to a sequence of interconnected
examples or prompts, guiding it to a desired output. Another approach incorporates progressively
structured instructions or questions, refining focus and enhancing the model’s ability to generate
coherent and contextually relevant responses [Gu+23].

3.4.2 Retrieval augmented generation (RAG)


RAG is a form of prompt engineering that involves strategically crafting prompts for both retrieval
and generation phases, allowing for an adaptive and efficient process that leverages external knowl-
edge sources to enhance generative tasks. While the original concept of RAG was developed in
the context of NLP [Lew+20], the principles behind retrieval and generation can be extended to
multimodal learning [Zha+23d], including VLMs. RAG has been used in medical VLMs for tasks
like VQA (e.g., RAMM [Yua+23]) and RG (e.g., CXR-RePaiR-Gen [Ran+23b]). RAG begins with
a retrieval component, which is usually a pre-trained model designed for information retrieval. This
versatile component excels in extracting pertinent information from extensive datasets, catering to
various modalities such as images, text, codes, video, or audio when presented with diverse inputs
[Zha+23d]. Following the retrieval phase, the model returns a set of contexts related to the given
input. The second component is a generative LLM. This component takes the input and the re-
trieved context and generates the final output. The generated output is conditioned not only on
the input but also on the information extracted from the retrieved context. An intrinsic advantage
of RAG lies in its capacity to reduce the reliance on extensive labeled datasets. While the base
model is typically frozen during RAG, there are instances, as seen in RAMM [Yua+23], where
model parameters are updated in the process.

3.5 Downstream Tasks


Multimodal downstream tasks leverage the acquired knowledge from pre-training VLMs to excel
in diverse applications that require a joint understanding of visual and textual data.

3.5.1 Report Generation (RG)


RG is a prominent example of a typical medical VLM task, which centers on creating a compre-
hensive summary report of visual data. RG plays a crucial role in automatically summarizing
diagnostic imaging results and reducing the workload of report writing [MPC20; TLZ23; Moh23].
For instance, in radiology, a report generation system could analyze a set of medical images such
as X-rays, CT scans, or MRIs and generate a detailed report summarizing the observed abnormal-
ities, their locations, and potential implications for diagnosis or treatment [LTS23]. A radiology
report usually has several sections: (1) Examination (type of exam), (2) Indication (reasons for
the examination), (3) Comparison (prior exams), (4) Technique (scanning method) (5) Findings
(detailed observations made by a radiologist), and (6) Impression (summary of the major findings)
[MHC20]. In the context of RG, VLMs are usually designed to generate Findings and Impression
sections [Tha+23]. Currently, VLMs tailored for RG are predominantly utilized for radiology im-
ages, with lesser application in other medical imaging domains such as pathology [SB23], robotic
surgery [Xu+21], and ophthalmology [Li+22].

12
3.5.2 Visual Question Answering (VQA)
VQA is another important visual-language understanding task, where the model needs to com-
prehend images or videos and the posed question to provide a relevant and accurate response
[Ant+15]. The spectrum of questions encountered in VQA is broad, encompassing inquiries about
the presence of specific objects, their locations, or distinctive properties within the image. In the
medical context [Lin+23b], this may involve questions regarding the presence of medical conditions
or abnormalities, such as “What abnormality is seen in the image?” [Ion+21] or “Is there gastric
fullness?” [Lau+18]. Other queries may delve into details like the imaging method used [Aba+19],
the organ system involved [Lau+18], or the presence of specific anatomical structures [Liu+21a].
Questions in VQA fall into two categories. Open-ended questions elicit responses in the form of
phrases or sentences, fostering detailed and nuanced answers [Tha+23]. On the other hand, closed-
ended questions are designed to prompt limited responses, often with predetermined options, such
as a short list of multiple choices, a yes/no response, or a numeric rating [Baz+23]. The task of
VQA is commonly approached as either a classification task, a generation task, or both [Lin+23b].
In the classification approach, models select the correct answer from a predefined set, while in the
generation task, models produce free-form textual responses unconstrained by predefined options.

3.5.3 Other Tasks


Beyond VQA and RG, a spectrum of VLM tasks exist for the vision-language understanding
[Che+23]. For instance, referring expression comprehension entails a model locating the specific
area or object in an image that the given phrase or sentence refers to [ZNC18]. Visual commonsense
reasoning involves answering questions about an image, typically presented in a multiple-choice for-
mat, and justifying the answer based on the model’s understanding of the image and common sense
knowledge [Zel+19]. Vision-language retrieval focuses on either generating or retrieving relevant
information from images using textual data, or vice versa, obtaining information from text using
visual data [Zhe+19]. In the context of visual captioning, the model’s role is to generate a concise,
text-based description of either an image [SDK23]. It is worth highlighting that some of these
tasks can seamlessly transition from images to videos, showcasing the adaptability and versatility
of VLMs across diverse visual contexts [Gan+22].

4 Medical VLMs
4.1 Medical Datasets for VLMs
The adaptation of VLMs to various medical tasks is achieved through their pre-training and fine-
tuning using specialized task-specific datasets. Below is the list of vision-language datasets available
in the public domain that contain medical image-text pairs or question-answer (QA) pairs. Most
of them are employed by medical VLMs described in Section 4.3 for pre-training, fine-tuning, and
evaluating VQA and RG tasks. The comparative analysis of these datasets is presented in Table 1.
The last column in Table 1 provides a link to the source of the data on the web with the following
abbreviations: GH - GitHub, PN - PhysioNet, and HF - Hugging Face.

4.1.1 Radiology Objects in Context (ROCO)


ROCO is a dataset composed of image-caption pairs extracted from the open-access biomedical
literature database PubMed Central (PMC) [Pel+18]. ROCO is stratified into two categories: ra-
diology and out-of-class. The radiology group includes 81, 825 radiology images, including computer

13
Table 1: A list of datasets used for developing medical VLMs.
Dataset # image-text pairs # QA pairs Other components Link
ROCO
81, 825 – – GH
[Pel+18]

MIMIC-CXR
377, 110 – – PN
[Joh+19b]

MIMIC-CXR-JPG
377, 110 – pathology labels PN
[Joh+19a]

MIMIC-NLE diagnosis labels,


38, 003 – GH
[Kay+22] evidence labels

CXR-PRO 374, 139 radiographs and


– – PN
[RCR22] 374, 139 reports but not paired

MS-CXR
1, 162 – bounding box annotations PN
[Boe+22]

IU-Xray or Open-I
7, 470 – labels Site
[Dem+15]

MedICaT annotations; inline


224, 567 – GH
[Sub+20] references to ROCO figures

PMC-OA
1, 650, 000 – – HF
[Lin+23a]

SLAKE 642 annotated images,


– 14, 028 Site
[Liu+21a] 5, 232 medical triplets

VQA-RAD
– 3, 515 315 radiology images Site
[Lau+18]

PathVQA
– 32, 799 4, 998 pathology images GH
[He+20]

VQA-Med 2019
– 15, 292 4, 200 radiology images GH
[Aba+19]

VQA-Med 2020 5, 000 radiology images for VQA;


– 5, 000 GH
[Aba+20] images and questions for VQG

VQA-Med 2021 5, 500 radiology images for VQA;


– 5, 500 GH
[Ion+21] images and questions for VQG

EndoVis 2017 bounding box annotations;


– 472 GH
[All+19] 97 frames

EndoVis 2018 bounding box annotations;


– 11, 783 GH + Site
[All+20] 2,007 frames

Note: Abbreviations used are: GH - GitHub, HF - Hugging Face, and PN - PhysioNet

14
tomography (CT), ultrasound, x-ray, fluoroscopy, positron emission tomography (PET), mammog-
raphy, magnetic resonance imaging (MRI), angiography, and PET-CT. The out-of-class group has
6, 127 images, including synthetic radiology images, clinical photos, portraits, compound radiology
images, and digital art. Each image is accompanied by a corresponding caption, keyword, Unified
Medical Language System (UMLS) semantic types (SemTypes), UMLS concept unique identifiers
(CUIs), and a download link. To facilitate model training, the dataset is randomly split into a
training set (65, 460 radiology and 4, 902 out-of-class images), a validation set (8, 183 radiology
and 612 out-of-class images), and a test set (8, 182 radiology and 613 out-of-class images) using an
80/10/10 split ratio, respectively.

4.1.2 Medical Information Mart for Intensive Care - Chest X-Ray (MIMIC-CXR)
MIMIC-CXR collection encompasses 377, 110 chest X-rays paired with 227, 835 associated free-
text radiology reports [Joh+19b]. The dataset is derived from de-identified radiographic studies
conducted at the Beth Israel Deaconess Medical Center in Boston, MA. Each imaging study within
the MIMIC-CXR dataset consists of one or more images, typically featuring lateral and from back-
to-front (posteroanterior, PA) views in Digital Imaging and Communications in Medicine (DICOM)
format.

4.1.3 MIMIC-CXR-JPG
MIMIC-CXR-JPG [Joh+19a] is a pre-processed variant of the MIMIC-CXR dataset [Joh+19b]. In
this version, the original 377, 110 images are converted into compressed JPG format. The 227, 827
reports associated with these images are enriched with labels for various common pathologies. The
labels are derived from the analysis of the impression, findings, or final sections of the radiology
reports, facilitated by the use of NegBio [Pen+17] and CheXpert (Chest eXpert) [Irv+19] tools.

4.1.4 MIMIC-NLE
MIMIC-NLE dataset is specifically designed for the task of generating natural language explanations
(NLEs) to justify predictions made on medical images, particularly in the context of thoracic
pathologies and chest X-ray findings [Kay+22]. The dataset consists of 38, 003 image-NLE pairs
or 44, 935 image-diagnosis-NLE triplets, acknowledging instances where a single NLE may explain
multiple diagnoses. NLEs are extracted from MIMIC-CXR [Joh+19b] radiology reports. The
dataset exclusively considers X-ray views from front-to-back (anteroposterior, AP) and back-to-
front (posteroanterior, PA). All NLEs come with diagnosis and evidence (for a diagnosis) labels.
The dataset is split into the training set with 37, 016 images, a test set with 273 images, and a
validation set with 714 images.

4.1.5 CXR with Prior References Omitted (CXR-PRO)


CXR-PRO dataset is derived from MIMIC-CXR [Joh+19b]. The dataset consists of 374, 139 free-
text radiology reports containing only the impression sections [RCR22]. It also incorporates as-
sociated chest radiographs; however, the radiology reports and chest X-rays are not paired. This
dataset is designed to mitigate the problem of hallucinated references to prior reports often gener-
ated by radiology report generation ML models. The omission of prior references in this dataset
aims to provide a cleaner and more reliable dataset for radiology RG.

15
4.1.6 Indiana University chest X-rays (IU-Xray)
IU-Xray dataset, also known as the Open-I dataset, is accessible through the National Library
of Medicine’s Open-i service [Dem+15]. The dataset originates from two hospital systems within
the Indiana Network for Patient Care database. This dataset comprises 7, 470 DICOM chest X-
rays paired with 3, 955 associated radiology reports. The reports typically include sections such as
indications, findings, and impressions, and they are manually annotated using MeSH and RadLex
(Radiology Lexicon) codes to represent clinical findings and diagnoses. Throughout this review, we
will refer to the dataset interchangeably as IU-Xray and Open-I, maintaining consistency with the
nomenclature used in related literature.

4.1.7 Medical Images, Captions, and Textual References (MedICaT)


MedICaT dataset contains 217, 060 figures from 131, 410 open-access PMC papers focused on radi-
ology images and other medical imagery types [Sub+20]. Excluding figures from ROCO [Pel+18],
the dataset integrates inline references from the S2ORC (Semantic Scholar Open Research Corpus)
[Lo+20] corpus, establishing connections between references and corresponding figures. Addition-
ally, the inline references to ROCO figures are provided separately. MedICaT also contains 7, 507
subcaption-subfigure pairs with annotations derived from 2, 069 compound figures.

4.1.8 PubMedCentral’s OpenAccess (PMC-OA)


PMC-OA dataset comprises 1.65 M image-caption pairs, derived from PMC papers [Lin+23a]. It
encompasses a variety of diagnostic procedures, including common ones such as ultrasound, MRI,
PET, and radioisotope, and rarer procedures like mitotic and fMRI. Additionally, the dataset covers
a broad spectrum of diseases, with induced cataracts, ear diseases, and low vision being among the
most frequently represented conditions.

4.1.9 MS-CXR
MS-CXR dataset contains image bounding box labels paired with radiology findings, annotated
and verified by two board-certified radiologists [Boe+22]. The dataset consists of 1, 162 image-text
pairs of bounding boxes and corresponding text descriptions. The annotations cover 8 different car-
diopulmonary radiological findings and are extracted from MIMIC-CXR [Joh+19b] and REFLACX
(Reports and Eye-tracking data For Localization of Abnormalities in Chest X-rays) [Big+22] (based
on MIMIC-CXR) datasets. The findings include atelectasis, cardiomegaly, consolidation, edema,
lung opacity, pleural effusion, pneumonia, and pneumothorax.

4.1.10 Semantically-Labeled Knowledge-Enhanced (SLAKE)


SLAKE is an English-Chinese bilingual dataset [Liu+21a]. It contains 642 images, including 12
diseases and 39 organs of the whole body. Each image is meticulously annotated with two types
of visual information: masks for semantic segmentation and bounding boxes for object detection.
The dataset includes a total of 14, 028 QA pairs, categorized into vision-only or knowledge-based
types and labeled accordingly, encompassing both open- and closed-ended questions. Moreover,
SLAKE incorporates 5, 232 medical knowledge triplets in the form of < head, relation, tail >,
where head and tail denote entities (e.g., organ, disease), and relation signify the relationship
between these entities (e.g., function, treatment). An illustrative example of such a triplet is
<pneumonia, location, lung>.

16
4.1.11 VQA-RAD
VQA-RAD dataset contains 104 head axial single-slice CTs or MRIs, 107 chest x-rays, and 104
abdominal axial CTs [Lau+18]. The images are meticulously chosen from MedPix, an open-access
online medical image database, ensuring each image corresponds to a unique patient. Furthermore,
every selected image has an associated caption and is deliberately devoid of any radiology markings.
Every caption provides details about the imaging plane, modality, and findings generated and
reviewed by expert radiologists. Also, VQA-RAD contains 3, 515 QA pairs, with an average of
10 questions per image. Among them, 1, 515 are free-form questions and answers, allowing for
unrestricted inquiry. Additionally, 733 pairs involve rephrased questions and answers, introducing
linguistic diversity. Another 1, 267 pairs are framed, featuring questions presented in a structured
format, offering consistency and systematic evaluation. Additionally, QA pairs are split into 637
open-ended and 878 closed-ended types. Within the closed-ended group, a predominant focus is on
yes/no questions.

4.1.12 PathVQA
PathVQA is a dataset that encompasses 4, 998 pathology images accompanied by a total of 32, 799
QA pairs derived from these images [He+20]. The images are sourced from pathology books:
“Textbook of Pathology” and “Basic Pathology”, and the digital library “Pathology Education
Informational Resource”. Out of all QA pairs, 16, 465 are of the open-ended type, while the
remaining pairs are of the closed-ended yes/no type. On average, each image is associated with 6.6
questions, which cover a broad spectrum of visual contents, encompassing aspects such as color,
location, appearance, shape, etc.

4.1.13 VQA-Med 2019


VQA-Med 2019 dataset contains 4, 200 radiology images obtained from MedPix, an open-access
online medical image database, and 15, 292 QA pairs [Aba+19]. The training set consists of 3, 200
images and 12, 792 QA pairs, with each image having 3 to 4 associated questions. The validation
set includes 500 images and 2, 000 QA pairs, and the test set comprises 500 images and 500 QA
pairs. The questions are mainly about modality, imaging plane, organ system, and abnormality.

4.1.14 VQA-Med 2020


VQA-Med 2020 dataset contains 5, 000 radiology images obtained from MedPix, an open-access
online medical image database, and 5, 000 QA pairs [Aba+20]. The training set consists of 4, 000
images and 4, 000 QA pairs. The validation set comprises 500 images and 500 QA pairs, and the
test set includes 500 images and 500 QA pairs. The questions are focused on abnormalities present
in the images. Additionally, the dataset contains radiology images and questions for the Visual
Question Generation (VQG) task. The training set consists of 780 images and 2, 156 associated
questions. The validation set comprises 141 images with 164 questions, and the test set includes
80 images.

4.1.15 VQA-Med 2021


VQA-Med 2021 dataset contains 5, 500 radiology images obtained from MedPix, an open-access
online medical image database, and 5, 500 QA pairs [Ion+21]. The training set consists of 4, 500
images and 4, 5000 QA pairs. The validation set comprises 500 images and 500 QA pairs, and
the test set includes 500 images and 500 QA pairs. The questions are focused on abnormalities

17
present in the images. Similarly to VQA-Med 2019, the dataset also contains radiology images and
questions for the VQG task. The validation set comprises 85 images with 200 questions, and the
test set includes 100 images.

4.1.16 Endoscopic Vision (EndoVis) 2017


EndoVis 2017 dataset contains 5 robotic surgery videos (two videos with 8 frames each, one with
18, one with 14, and one with 39 frames) from the MICCAI (Medical Image Computing and
Computer Assisted Interventions) Endoscopic Vision 2017 Challenge [All+19]. It also includes 472
QA pairs with bounding box annotations. These QA pairs are carefully crafted to involve specific
inquiries related to the surgical procedure. Examples of questions include queries such as ”What
is the state of prograsp forceps?” and “Where is the large needle driver located?”. The inclusion
of bounding box annotations enhances the dataset’s utility for tasks such as object detection or
answer localization.

4.1.17 EndoVis 2018


EndoVis 2018 dataset contains 14 robotic surgery videos (2, 007 frames in total) from the MICCAI
Endoscopic Vision 2018 Challenge [All+20]. It also includes 11, 783 QA pairs regarding organs,
surgical tools, and organ-tool interactions. When the question is about organ-tool interactions, the
bounding box will contain both the organ and the tool.

4.2 VLM Evaluation Metrics


This section delves into the evaluation process of medical VLMs. The initiation of this process
involves meticulously selecting benchmark datasets and defining evaluation metrics tailored to the
specific vision-language tasks at hand.

4.2.1 Evaluation Metrics for Report Generation


The prevalent benchmark datasets for medical RG are MIMIC-CXR [Joh+19b] and Open-I [Dem+15].
For more information on these datasets, see Section 4.1. Several metrics are used to evaluate the
effectiveness of VLMs on RG tasks. The more frequently used metrics are outlined below.

Bilingual Evaluation Understudy (BLEU) The BLEU score was originally designed for ma-
chine translation evaluation, but it has been adapted for RG and even VQA in a modified form.
BLEU provides a quantitative measure of how well the machine-generated text aligns with human-
generated reference text [Pap+02]. First, the precision of different n-grams, which are consecutive
sequences of n words, is calculated using the formula:
#overlapping n-grams
Precision(n) = , (1)
#all n-grams in a model-generated text
where ‘overlapping n-grams’ refer to n-grams in the model-generated text that share common
elements with at least one n-gram in the reference text. To ensure the precision score remains
robust and is not disproportionately affected by repeated n-grams in the model-generated text, a
modification known as clipping is often introduced. This process involves capping the count of each
n-gram in the model-generated text to a maximum count. This maximum count is determined by

18
the highest count observed in any single reference text for the same n-gram. The final BLEU-n
score is defined as:
n
!
1 X
BLEU-n = BP × exp log [Precision(k)] . (2)
n
k=1
In eq. 2, BP is referred to as the brevity penalty and is calculated as:
(
1 if c ≥ r
BP = (1−r/c)
(3)
e if c < r,
where c is the length of the model-generated text, and r is the length of the reference text. It is
common to use n = 4. The BLEU score ranges from 0 to 1, where a higher score suggests better
agreement with the reference text. The overall BLEU score of the model is the average of BLEU
scores for each pair of reports.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) ROUGE is a set of metrics


that evaluate the overlap between the model-generated text and human-generated reference text
[Lin04]. ROUGE-n assesses the overlap of n-grams between model-generated text and reference
text, and it is defined as:
#overlapping n-grams
ROUGE-n = . (4)
#all n-grams in a reference text
ROUGE-L focuses on measuring the longest common subsequence between model-generated
text Y and reference text X, and it is calculated using the following relationship:
(1 + β 2 ) × R × P
ROUGE-L = , (5)
(R + P × β 2 )
where R = LCS(X, Y )/m, P = LCS(X, Y )/n, m is the length of X, n is the length of Y ,
LCS(X, Y ) is the length of a longest common subsequence of X and Y , and β is a parameter that
depends on the specific task and the relative importance of precision (P) and recall (R). There are
other ROUGE score variants. The ROUGE scores range from 0 to 1, where higher scores indicate
similarity between the model-generated text and the reference text. For each ROUGE variant, the
overall score of the model is the average of scores for each instance.

Metric for Evaluation of Translation with Explicit ORrdering (METEOR) METEOR


is an evaluation metric designed to be more forgiving than some other metrics and takes into
account the fluency and meaning of the generated text [BL05]. The METEOR score is computed
as follows:

10 × P × R
METEOR = (1 − Penalty) (6)
R+9×P
where
#overlapping 1-grams
R= , (7)
#1-grams in a reference text
#overlapping 1-grams
P = , (8)
#1-grams in a model-generated text
 3
1 #chunks
Penalty = × , (9)
2 #overlapping 1-grams

19
and chunks are groups of adjacent 1-grams in the model-generated text that overlap with adjacent
1-grams in the reference text. The METEOR score ranges from 0 to 1, with higher scores indicating
better alignment between the model-generated text and the reference text. The overall METEOR
score of a model is the average of scores for each instance.

Perplexity Perplexity measures the average uncertainty of a model in predicting each word in a
text [Hao+20]. The formula for perplexity is defined as:
n
!
1X
Perplexity = exp − ln P (wk |w1 , w2 , . . . , wk−1 ) , (10)
n
k=1
where n is the total number of words in the text. The value of the perplexity metric can range from
1 to +∞, and lower values signify a more accurate and confident model in capturing the language
patterns within the given text.

BERTScore BERTScore metric was initially designed for evaluating models that use BERT
[Dev+19] embeddings [Zha+20b]. However, it can also leverage other word embeddings to evaluate
the similarity between model-generated and reference text. The BERTScore of a single text pair is
calculated according to the relationship:
2×P ×R
BERTScore = , (11)
P +R
where P represents the ratio of the maximum cosine similarity score between tokens in the model-
generated text and the reference text to the numbers of tokens in the model-generated text and R
represents the ratio of the maximum cosine similarity score between tokens in the model-generated
text and the reference text to the numbers of tokens in the reference text. The BERTScore of the
model is the average of BERTScores across all text pairs.

RadGraph F1 RadGraph F1 is a novel metric that measures overlap in clinical entities and
relations extracted from radiology reports [Yu+23]. The RadGraph F1 score is computed in the
following way. First, the RadGraph model maps model-generated and reference reports into graph
representations with clinical entities represented as nodes and their relations as edges between
them. Second, the number of nodes that match between the two graphs based on clinical entity
text and labels (entity type) is determined. Third, the number of edges that match between the
two graphs based on their start and end entities and labels (relation type) is calculated. Lastly, the
F1 score is separately computed for clinical entities and relations, and then the RadGraph F1 score
for a report pair is the average of these two scores. The overall model performance is determined
by averaging RadGraph F1 scores across all report pairs.

Human evaluation Human evaluation plays a crucial role in assessing VLMS’s quality in medical
RG. Human evaluation can be performed for RG in various ways. For instance, in [Jeo+23], expert
radiologists evaluate the performance of the X-REM model in the RG task as follows. Initially,
each report is segmented into lines, and radiologists assign scores to each line based on five error
categories. These scores reflect the severity of errors, with higher values indicating more severe
errors. Two metrics are utilized to obtain a comprehensive measure of the overall severity of errors
in a report. Maximum Error Severity (MES) represents the highest score across all lines in the
report. In contrast, Average Error Severity (AES) is calculated by averaging the scores across all
lines in the report. According to radiologists, 18% of model-generated reports received an MES
score of 0, while 24% received an AES score of 0.

20
Additional Evaluation Metrics for Report Generation The next few metrics are designed
for classification evaluation, and RG can be viewed as such a task. In [Moo+22], [Lee+23], and
[Pel+23], these metrics are computed based on the 14 labels obtained from applying the CheXpert
[Irv+19] or CheXbert [Smi+20] labeler to the reference reports as well as the model-generated
reports. In this context, reports bearing accurate diagnosis labels are categorized as positive, while
those with inaccurate labels are regarded as negative. The following metrics are also called clinical
efficacy metrics.
• Accuracy measures the ratio of all positive predictions to the total number of predictions.

• Precision measures the accuracy of positive predictions made by a model. The precision
score is calculated by considering the ratio of true positive predictions to the total number of
instances that the model predicted as positive:
True Positivies
Presicion = . (12)
True Positivies + False Positivies
High precision indicates that the model has a low rate of false positives.

• Recall is a metric that assesses the ability of a model to predict all positive classes. Recall is
defined as the ratio of correctly predicted positive observations to the total actual positives:
True Positives
Recall = . (13)
True Positives + False Negatives
The high recall means that the model effectively identifies most of the actual positive in-
stances.

• F1 Score assesses the overall model’s performance by balancing precision and recall into a
single value. The F1 score is defined as:
2 × Precision × Recall
F1 = . (14)
Precision + False Recall
The F1 scores range from 0 to 1, with higher values indicating better performance. In multi-
class classification, it is common to compute the macro-F1 score by averaging the F1 scores
calculated independently for each class. This method ensures an unbiased evaluation of the
model’s performance across all classes, assigning equal importance to each class, irrespective
of its size or prevalence in the dataset.

4.2.2 Evaluation Metrics for VQA


The common benchmark datasets for medical VQA include VQA-RAD [Lau+18], SLAKE [Liu+21a],
and PathVQA [He+20]. For more information on these datasets, see Section 4.1. Various metrics
are available for VQA evaluation, and many of those used for RG can also be applied to VQA. To
avoid redundancy with already mentioned metrics, only a few are highlighted below.

Accuracy Accuracy is a fundamental metric for gauging overall model correctness in VQA eval-
uation. It is determined by calculating the proportion of correctly predicted answers to the total
number of questions. Sometimes, the average accuracy is computed by applying the model to
various testing datasets, providing a comprehensive assessment of its performance across diverse
scenarios. For a detailed comparison of accuracies among different medical VLMs discussed in
Section 4.3, refer to Table 3.

21
Exact Match Exact match metric computes the ratio of generated answers that match exactly
(excluding punctuation) the correct answer. However, this measure is rather strict, as it may not
give credit to valuable answers that, despite being semantically correct, diverge from an exact lexical
match with the correct answer. This metric is more suitable for evaluating answers to close-ended
questions than open-ended ones.

Human Evaluation Human evaluation is valuable for assessing a model’s performance and
applies not only to tasks such as VQA but also to RG. Human evaluation can be performed for
VQA in various ways. For instance, in [Moo+23], the human evaluation process of MedFlamingo
model employs an application featuring a user-friendly interface. Within this interface, medical
experts are empowered to evaluate each VQA problem individually, assigning scores ranging from
0 to 10. The final scores of the few-shot performance are 5.61 on VQA-RAD, 1.81 on PathVQA,
and 4.33 on the specifically curated Visual USMLE dataset. In contrast, the scores for zero-shot
performance are lower, with 3.82 on RAD-VQA, 1.72 on PathVQA, and 4.18 on Visual USMLE.

4.3 Medical Models


In this part of the review paper, we provide an overview of existing medical VLMs tailored for
VQA and/or RG. The information is organized chronologically based on the first appearance of
the model. Our focus is mainly on recently introduced open-source or publicly available models. A
summary of these VLMs is presented in Table 2.

4.3.1 Medical Vision Language Learner (MedViLL)


MedViLL can process medical images to generate associated reports [Moo+22]. The model employs
ResNet-50 [He+16], trained on ImageNet [Den+09], for extracting visual features v. The model also
leverages a base BERT [Dev+19] embedding layers to extract textual features t from clinical reports,
which are initially segmented into a sequence of tokens using a WordPiece [Wu+16] tokenizer. Both
textual and visual features incorporate positional information to capture the spatial relationships
and sequential order of elements in the input data. To generate a cross-modal representation,
vectors v and t, along with special tokens [CLS], [SEP]V , [SEP]L are concatenated in a single
vector as follows: (CLS, v, SEPV , t, SEPL ). The cross-modal representations are then fed into
the BERT model. The MedViLL is pre-training on two tasks: MLM and ITM. The MLM task
employs a bidirectional auto-regressive (BAR) self-attention mask, promoting the integration of
image and language features. For MLM, a negative log-likelihood loss function is used. The ITM
task encourages learning visual and textual features by predicting matching pairs and employs
a loss function based on predictions for matching and non-matching pairs. The model is pre-
trained on 89, 395 image-report pairs from the MIMIC-CXR [Joh+19b] dataset and then fine-tuned
for downstream tasks on 3, 547 pairs from the Open-I [Dem+15] dataset, an additional dataset
comprising radiographic image-report pairs. Only AP view X-rays are included in the analysis of
both datasets. VQA is performed on the VQA-RAD [Lau+18] dataset (see Table 3), including open
and close-ended questions, where the output representation of [CLS] is used to predict a one-hot
encoded answer. For radiology RG fine-tuning, the model uses a sequence-to-sequence (S2S) mask
instead of BAR and generates reports by sequentially recovering MASK tokens. RG is evaluated
on MIMIC-CXR [Joh+19b] and Open-I [Dem+15]. MedViLL achieves a BLEU-4 score of 0.066, a
perplexity value of 4.185, and using a CheXpert labeler [Irv+19] an accuracy of 84.1%, a precision
value of 0.698, a recall value of 0.559, and an F1 score of 0.621 on MIMIC-CXR. Additionally, it

22
Table 2: A list of medical VLMs developed for VQA and RG.
Model Stream Decoder Architecture VQA RG Datasets Code

MedViLL MIMIC-CXR,
single No RN50 + BERT + + GH
[Moo+22] Open-I, VQA-RAD
ViT-B/32 or RN50 or
PubMedCLIP ROCO, SLAKE,
dual No RN50×4 + Transformer + – GH
[EMD23] VQA-RAD
+ BAN
RepsNet ResNeXt-101 + BERT VQA-RAD,
dual No + + Site
[TBF22] + BAN + GPT-2 IU-Xray
ViT-B/16
BiomedCLIP PMC-15, SLAKE,
dual No + PubMedBERT + – HF
[Zha+23b] VQA-RAD
+ METER
UniXGen
single No VQGAN + Transformer – + MIMIC-CXR GH
[Lee+23]
Swiss Transformer PMCPM, ROCO
+ PubMedBERT MIMIC-CXR,
RAMM
dual No + multimodal encoder w/ + – SLAKE, VQA-RAD, GH
[Yua+23]
retrieval-atten. module VQA-Med 2019,
VQA-Med 2021
ALBEF
X-REM MIMIC-CXR,
dual No (ViT-B/16 + BERT – + GH
[Jeo+23] MedNLI, RadNLI
+ multimodal encoder)
ROCO; MedDialog,
Visual DePlot or Med-GIT MEDIQA QA,
Med-Alpaca single No + prompt manager + – MEDIQA RQE, GH
[Shu+23] +LLaMa-7B MedQA, PubMedQA
+ GPT-3.5-Turbo
ALBEF
+ FAISS retriever
CXR-RePaiR-Gen + prompt manager CXR-PRO,
dual No – + –
[Ran+23b] + text-davinci-003 MS-CXR
or GPT-3.5-Turbo
or GPT-4
PMC-15 + GPT-4,
LLaVa-Med ViT-L/14 + projection
single No + – VQA-RAD, SLAKE, GH
[Li+23a] layer + LLaMa-7B
PathVQA
MedCLIP + linear
XrayGPT MIMIC-CXR
single No transformation layer + + GH
[Tha+23] Open-I
+ Vicuna-7B
RN18 + tokenizer
CAT-ViL DeiT EndoVis 2017,
single No + CAT-ViL fusion + – GH
[BIR23] EndoVis 2018
module + DeiT
ROCO, MedICaT,
ViT-B/12 + BERT
MUMC ImageCLEF Caption,
dual Yes + multimodal encoder + – GH
[Li+23b] VQA-RAD, SLAKE
+ answer decoder
PathVQA
MTB, PMC-OA,
Med-Flamingo ViT-L/14 + perceiver
single No + – VQA-RAD, PathVQA, GH
[Moo+23] resampler + LLaMa-7B
Visual USMLE
BioViL-T + BERT
RaDialog MIMIC-CXR,
single No + prompt manager + + GH
[Pel+23] Instruct
+ Vicuna-7B

23
achieves a BLEU-4 score of 0.049, a perplexity value of 5.637, an accuracy of 73.4%, a precision
value of 0.512, a recall value of 0.594, and an F1 score of 0.550 on Open-I.

4.3.2 PubMedCLIP
PubMedCLIP [EMD23] is a CLIP-based [Rad+21] model pre-trained on ROCO [Pel+18] dataset,
consisting of over 80K image-caption pairs sourced from PMC articles. The model utilizes a CLIP
text encoder, which is based on the Transformer [Vas+17] architecture, and three distinct CLIP
visual encoders: ViT-B/32 [Dos+21], ResNet-50, and ResNet-50×4 [He+16]. Following the con-
trastive learning approach in CLIP, the model generates joint representations by computing cosine
similarity between textual and visual features. The pre-training objective involves the computa-
tion of cross-entropy loss values for both vision and language. These losses are then averaged to
derive an overall loss value. Following pre-training, the model is repurposed as a pre-trained visual
encoder for VQA. The visual feature in VQA is the concatenation of the model’s output with a
convolutional denoising autoencoder (CDAE) [Mas+11] output, an image denoising module. The
question is encoded using a GloVe [PSM14] word embedding followed by an LSTM [HS97]. The im-
age and question features are combined using bilinear attention networks (BAN) [KJZ18], and the
resulting representations are passed through an answer classifier, which is a two-layer feedforward
NN. The VQA loss is determined by combining the classification and image reconstruction losses.
During the VQA fine-tuning, the SLAKE (English) [Liu+21a] and VQA-RAD [Lau+18] datasets,
comprising both open- and close-ended questions, are employed. The model’s effectiveness is evalu-
ated in the context of two existing Medical VQA (MedVQA) methods: Mixture of Enhanced Visual
Features (MEVF) [Zha+20a] and question-conditioned reasoning (QCR) [Liu+23a]. The assess-
ment involved replacing the visual encoder component in MEVF and QCR with PubMedCLIP and
subsequently evaluating the model’s performance. PubMedCLIP in the QCR framework achieves
better accuracies on VQA-RAD and SLAKE datasets than in the MEVF framework. The highest
accuracies of PubMedCLIP in the QCR framework on both datasets are shown in Table 3.

4.3.3 RepsNet
RepsNet is designed for VQA tasks. It can generate automated medical reports and interpret med-
ical images. The model employs a modified version of the pre-trained ResNeXt-101 [Xie+16] as its
image encoder and utilizes pre-trained BERT [Dev+19] as the text encoder, with text tokenization
done through WordPiece [Wu+16]. Fusion of image and question features is achieved using BAN
[KJZ18]. To align images with textual descriptions, the model employs bidirectional contrastive
learning [Che+20a]. For VQA tasks, the model is fine-tuned and evaluated on VQA-RAD [Lau+18]
(see Table 3). In contrast, for RG, fine-tuning and evaluation are done using IU-Xray [Dem+15]
dataset. The model categorizes answers through classification for close-ended questions and gener-
ates answers using the modified version of GPT-2 language decoder based on image features and
prior context. The BLEU-2 and BLEU-4 scores of RepsNet on the IU-Xray dataset are 0.44 and
0.27, respectively.

4.3.4 BiomedCLIP
BiomedCLIP is pre-trained on the specifically curated PMC-15 dataset that consists of 15 M
figure-caption pairs derived from the PMC articles [Zha+23b]. However, the models is not publicly
available. The model architecture is similar to CLIP [Rad+21], except that the text encoder is a
pre-trained PubMedBERT [Gu+21] model with WordPiece tokenizer [Wu+16]. The model uses
ViT-B/16 [Dos+21] as the visual data encoder. During pre-training, the model adopts a contrastive

24
learning approach, and to mitigate memory usage, it utilizes the sharding contrastive loss [Che+22].
For adaptation to VQA, the model incorporates the METER [Dou+22] framework. This involves
deploying a Transformer-based co-attention multimodal fusion module that produces cross-modal
representations. These representations are then fed into a classifier for the final prediction of
answers. The model is evaluated on VQA-RAD [Lau+18] and SLAKE (English) [Liu+21a] datasets
(see Table 3).

4.3.5 Unified chest X-ray and report Generation model (UniXGen)


UniXGen is a unified model that can generate both reports and view-specific X-rays [Lee+23]. The
model tokenizes chest X-rays leveraging VQGAN [ERO21], a generative model that amalgamates
generative adversarial networks (GANs) with vector quantization (VQ) techniques. VQGAN em-
ploys an encoder to transform input images into continuous representations, subsequently using
vector quantization to discretize them into learnable codebook vectors. Additionally, VQGAN
incorporates a decoder, translating these discrete codes back into images during the generation
process. For chest X-rays, multiple views from the same study are tokenized into sequences of
discrete visual tokens, demarcated by special tokens to distinguish perspectives. In the case of
radiology reports, the model uses the byte-level BPE [WCG20] tokenizer, augmented with sinu-
soid positional embedding for enhanced representation. The model is based on the Transformer
architecture [Vas+17] with a multimodal causal attention mask, ensuring that each position in the
sequence attends to all previous positions and not future ones. During training, multiple views of
chest X-rays and a report embedding are concatenated randomly and fed into the Transformer.
The model is optimized using the negative log-likelihood loss function. The model is trained on
208, 534 studies sampled from the MIMIC-CXR [Joh+19b] dataset. Each study contains at most
three chest X-rays representing PA (from back to front), AP (from front to back), and lateral views.
UniXGen achieves a BLEU-4 score of 0.050, and using a CheXpert labeler [Irv+19] a precision score
of 0.431, a recall value of 0.410, and an F1 score of 0.420 on MIMIC-CXR dataset.

4.3.6 Retrieval-Augmented bioMedical Multi-modal Pretrain-and-Finetune Paradigm


(RAMM)
RAMM is a retrieval-augmented VLM tailored for biomedical VQA [Yua+23]. The model uses
Swin Transformer [Liu+21b] as the image encoder and PubMedBERT [Gu+21] as the text encoder.
The visual and textual features are then fused by the multimodal encoder, a 6-layer Transformer
[Vas+17]. The model is pre-trained on the MIMIC-CXR [Joh+19b] and ROCO [Pel+18] datasets
along with a newly curated PMC-Patients-Multi-modal (PMCPM) dataset, consisting of 398, 000
image-text pairs sampled from PMC-OA [Lin+23a] dataset. The pre-training objective function
of the model is the sum of three tasks: contrastive learning, ITM, and MLM. Using contrastive
learning, the model aligns images and texts using the cosine similarity metric. The VQA task
is viewed as a classification problem, and the model is optimized using the cross-entropy loss
function. During model fine-tuning, the retrieval-attention module fuses the representations of
the image-question input with four representations of the retrieved image-text pairs from the pre-
trained datasets. This allows the model to focus on relevant parts of the retrieved information
when generating answers. The model is evaluated on VQA-Med 2019 [Aba+19], VQA-Med 2021
[Ion+21], VQA-RAD [Lau+18], and SLAKE [Liu+21a] datasets (see Table 3).

25
4.3.7 Contrastive X-Ray REport Match (X-REM)
X-REM is a retrieval-based radiology RG model that uses an ITM score to measure the similarity
of a chest X-ray image and radiology report for report retrieval [Jeo+23]. The VLM backbone
of the model is ALBEF [Li+21]. ALBEF utilizes ViT-B/16 [Dos+21] as its image encoder and
initializes the text encoder with the first 6 layers of the BERT [Dev+19] base model. The multi-
modal encoder in ALBEF, responsible for combining visual and textual features to generate ITM
scores, is initialized using the final six layers of the BERT base model. X-REM leverages ALBEF’s
pre-trained weights and performs further pre-training on X-rays paired with extracted impression
sections (2, 192 pairs), findings sections (1, 597 pairs), or both (2, 192 pairs) from the MIMIC-CXR
[Joh+19b] dataset. Subsequently, the model is fine-tuned on the ITM task, where the scoring mech-
anism involves using the logit value for the positive class as the similarity score for image-text pairs.
To address the positive skewness in medical datasets, 14 clinical labels obtained from the CheXbert
[Smi+20] labeler are utilized. The model efficiently manages the computational burden associated
with ITM scores by employing ALBEF’s pre-aligned unimodal embeddings. This involves narrowing
down the candidate reports based on high cosine similarity with the input image before computing
ITM scores. Additionally, the text encoder undergoes fine-tuning on natural language inference
(NLI) task, utilizing datasets such as MedNLI [RS18] and RadNLI [Miu+21]. This step is crucial
for preventing the retrieval of multiple reports with overlapping or conflicting information. X-REM
achieves a BLEU-2 score of 0.186 on the MIMIC-CXR (Findings only) dataset. The BERTScore
of the model is 0.386 on MIMIC-CXR (Findings only) and is 0.287 on MIMIC-CXR (Impressions
and Findings). The human evaluation of X-REM is described in Section 4.2.

4.3.8 Visual Med-Alpaca


Visual Med-Alpaca is a biomedical foundational model designed for addressing multimodal biomed-
ical tasks like VQA [Shu+23]. The model is constructed in the following way. First, image inputs go
through a classifier to determine the appropriate module for transforming visual information into
an intermediate text format. The currently supported modules include DePlot [Liu+22], utilized
for interpreting plots and charts, and Med-GIT [Wan+22a], fine-tuned specifically on the ROCO
[Pel+18] dataset for understanding radiology images. The prompt manager then amalgamates
textual information extracted from images and text inputs to construct the prompt for the LLM
model, LLaMA-7B [Tou+23b]. However, before generating responses, LLaMa-7B undergoes both
standard fine-tuning and LoRA [Hu+22] fine-tuning on a carefully curated set of 54, 000 medical
question-answer pairs. The questions within this set are derived from question-answering datasets
such as MEDIQA QA [BSD19], MEDIQA RQE [BSD19], MedQA [Jin+21], MedDialog [Zen+20],
and PubMedQA [Jin+19], with their corresponding answers synthesized using GPT-3.5-Turbo in
the self-instruct [Wan+23b] manner. Human experts then meticulously filter and edit the obtained
question-answer pairs to ensure quality and relevance. The evaluation of this model is still ongoing
[Shu+23].

4.3.9 Contrastive X-ray-Report Pair Retrieval based Generation (CXR-RePaiR-Gen)


CXR-RePaiR-Gen is designed for radiology RG that incorporates the RAG framework to miti-
gate the issue of hallucinated references [Ran+23b]. The model leverages the pre-trained ALBEF
[Lan+19] previously utilized in CXR-ReDonE [RCR22]. The ALBEF model consists of a ViT-B/16
[Dos+21] image encoder and the first 6 layers of BERT [Dev+19] as the text encoder, producing
contrastively aligned image and text embeddings. Textual features are indexed in a vector database,
Facebook AI Similarity Search (FAISS). When given a radiology image input, embeddings from

26
the reports or sentences corpus with the highest dot-product similarity to the image embedding are
retrieved. The CXR-PRO [RCR22] dataset is employed for text retrieval to gather relevant impres-
sions for generating the radiology report. The retrieved impression sections from the CXR-PRO
dataset serve as the context for the prompt to an LLM, along with instructions to generate the
radiology report. Two distinct prompts are employed for generating free-text reports: one for the
text-davinci-003 model and another for RG in a conversational setting with the GPT-3.5-Turbo
and GPT-4 models. The model is evaluated on MS-CXR [Boe+22] and CXR-PRO datasets. There
is no code provided for this model yet. CXR-RePaiR-Gen reaches a BERTScore score of 0.2865 on
the CXR-PRO dataset when based on GPT-4. Additionally, CXR-RePaiR-Gen achieves a score of
0.1970 on MS-CXR when based on text-davinci-003. The model attains a RadGraph F1 score of
0.1061 on the CXR-PRO dataset when based on GPT-4 and 0.0617 on the MS-CXR dataset when
it is based on text-davinci-003. In these instances, the CXR-RePaiR-Gen utilizes three retrieval
samples per input during the RAG process.

4.3.10 Large Language and Vision Assistant for BioMedicine (LLaVa-Med)


LLaVa-Med is an adaptation of the VLM LLaVa [Liu+23b], specifically tailored for the medical
domain through training on instruction-following datasets [Li+23a]. The visual features are gen-
erated by the pre-trainedCLIP [Rad+21] visual encoder ViT-L/14 [Dos+21]. The encoder can be
substituted with BiomedCLIP [Zha+23b]. These features are passed through a linear projection
layer, which converts them into tokens, and then, together with the tokenized instructions, are fed
into the LLM LLaMa-7B [Tou+23b]. The LLM can be substituted with Vicuna [Chi+23]. Af-
ter initializing with the general-domain LLaVA, the model undergoes fine-tuning using curriculum
learning. First, the model tries to understand and connect visual elements in biomedical images
to the corresponding words or descriptions in the language model’s knowledge. To achieve that, a
dataset consisting of 600, 000 image-caption pairs from the PMC-15 dataset, which was originally
employed in the training of BiomedCLIP, is utilized. These image-caption pairs are transformed
into an instruction-following dataset, where the instructions prompt the model to describe the
corresponding image concisely or in detail. Given the language instruction and image input, the
model is then prompted to predict the original caption. During this stage, the visual encoder and
language model weights are kept frozen, with updates exclusively applied to the linear projection
layer. The second stage of training focuses on aligning the model to follow diverse instructions. For
this purpose, another instruction-following dataset is generated from PMC-15. For this dataset,
instructions are designed to guide the GPT-4 model to generate multi-round questions and an-
swers from the image caption and sentences from the original PMC paper that mentions the image
[Li+23a]. In this training phase, the model undergoes training on a set of 60, 000 images, each
accompanied by its respective caption and multi-round questions and answers. Throughout this
process, the weights of the visual encoder remain unchanged, preserving the previously acquired
visual features. Meanwhile, the pre-trained weights of both the projection layer and the language
model undergo continuous updates. This approach enables the model to effectively respond to a
variety of instructions and perform well in generating dynamic and informative multi-round conver-
sational content. Lastly, for VQA, the model is fine-tuned and evaluated on VQA-RAD [Lau+18],
SLAKE [Liu+21a], and PathVQA [He+20] (see Table 3).

4.3.11 XrayGPT
XrayGPT is a conversational medical VLM specifically developed for analyzing chest radiographs
[Tha+23]. The VLM uses MedCLIP [Wan+22b] as a vision encoder to generate visual features.

27
These features undergo a meticulous transformation process: initially, they are mapped to a lower-
dimensional space through a linear projection head and subsequently translated into tokens via a
linear transformation layer. At its core, the model incorporates two text queries: (1) the assistant
query plays a role in contextualizing the model’s behavior and defining its purpose as “You are a
helpful healthcare virtual assistant”, (2) the doctor’s query serves as a prompt that guides the model
in providing information relevant to chest X-ray analysis. Tokens generated from a visual input are
concatenated with the tokenized queries and then fed into the medical LLM, which generates the
summary of the chest x-ray. The LLM employed in this architecture is Vicuna-7B [Chi+23], fine-
tuned on a rich dataset consisting of 100, 000 real conversations between patients and doctors, along
with 20, 000 radiology conversations sourced from ShareGPT.com. During training, the weights of
both the vision encoder and the LLM remain frozen while the weights in the linear transformation
layer undergo updates. The model is first trained on 213, 514 image-text pairs from pre-processed
MIMIC-CXR [Joh+19b] dataset and then on 3, 000 image-text pairs from Open-I [Dem+15] dataset.
XrayGPT achieves ROUGE-1 = 0.3213, ROUGE-2 = 0.0912, and ROUGE-L = 0.1997 on MIMIC-
CXR dataset.

4.3.12 Co-Attention gaTed Vision-Language Data-efficient image Transformer (CAT-


ViL DeiT)
CAT-ViL DeiT stands out as a specialized VLM tailored for VQA within surgical scenarios, with
a unique focus on answer localization [BIR23]. The architecture incorporates a ResNet-18 [He+16]
as the visual encoder, pre-trained on ImageNet [Den+09], and a customized pre-trained BERT
tokenizer [See+22] for the text encoder. Central to its functionality is the Co-Attention gaTed
Vision-Language (CAT-ViL) module, which enables interaction between visual and textual fea-
tures and fuses them via a gating mechanism to obtain optimized multimodal embeddings. These
features are then fused using a gating mechanism, yielding optimized multimodal embeddings. The
model further integrates a pre-trained Data-efficient image Transformer (DeiT) [Tou+21] module
to process these multimodal embeddings, aiming to acquire an optimal joint representation for
comprehensive visual and textual understanding. In the context of VQA, the model adopts a stan-
dard classification head, while for answer localization within images, it employs the detection with
transformers (DETR) [Car+20] head. The overall loss function comprises cross-entropy as the clas-
sification loss and L1-norm, along with the generalized intersection over union (GIoU) [Rez+19],
serving as the localization loss. The model is trained on 1, 560 frames and 9, 014 QA pairs from
the surgical datasets EndoVis 2018 [All+20]. The model achieved an accuracy of 61.92% on the
remaining data from EndoVis 2018 and 45.55% on EndoVis 2017 [All+19] dataset.

4.3.13 Masked image and text modeling with Unimodal and Multimodal Contrastive
losses (MUMC)
MUMC utilizes a ViT-B/12 [Dos+21] as its image encoder, the first 6 layers of BERT [Dev+19]
as its text encoder, and the last 6 layers of BERT as its multimodal encoder [Li+23b]. The
multimodal encoder incorporates cross-attention layers to align visual and textual features. For
pre-training, the model employs a combination of contrastive learning, MLM, and ITM objectives.
Also, the model utilizes a newly introduced masked image strategy, randomly masking 25% of image
patches as a data augmentation technique. This exposes the model to a greater variety of visual
contexts and enables learning representations that are more robust to partially occluded inputs.
The pre-training is performed on the ROCO [Rad+21], MedICaT [Sub+20], and Image Retrieval
in Cross-Language Evaluation Forum (ImageCLEF) caption [Rüc+22] datasets. For downstream

28
VQA tasks, an answering decoder is added on top of the multimodal encoder to generate answer
text tokens. The encoder weights are initialized from pre-training, and the model is fine-tuned and
evaluated on VQA-RAD [Lau+18], SLAKE [Liu+21a], and PathVQA [He+20] (see Table 3).

4.3.14 Med-Flamingo
Med-Flamingo is a multimodal few-shot learner model based on the Flamingo [Ala+22] architec-
ture, adapted to the medical domain [Moo+23]. The model is pre-trained on the MTB [Moo+23]
dataset, a newly curated collection comprising 4, 721 segments from various Medical TextBooks,
encompassing both textual content and images. Each segment is designed to contain at least one
image and up to 10 images, with a specified maximum length. Also, it is pre-trained on 1.3 M
image-caption pairs from the PMC-OA [Lin+23a] dataset. The Model’s few-shot capabilities are
achieved through training on these mixed text and image datasets, enabling it to generalize and
perform diverse multimodal tasks with only a few examples. The model utilizes a pre-trained frozen
CLIP vision encoder ViT-L/14 for visual feature generation. To convert these visual features into
a fixed number of tokens, the model employs a module known as the perceiver resampler, which
is trained from scratch. Subsequently, these tokens, along with tokenized text inputs, undergo
further processing in a pre-trained frozen LLM LLaMA-7B [Tou+23b], enhanced with strategically
inserted gated cross-attention layers that are also trained from scratch. This augmentation not
only facilitates the learning of novel relationships but also bolsters training stability. The model’s
performance is evaluated on established benchmarks such as VQA-RAD [Lau+18] and PathVQA
[He+20], demonstrating its effectiveness in medical visual question-answering. The exact match
scores for MedFlamingo demonstrate a few-shot performance of 0.200 on VQA-RAD and 0.303 on
PathVQA. In contrast, the zero-shot performance yields an exact match score of 0.000 on VQA-
RAD and 0.120 on PathVQA. Additionally, it is evaluated on a specifically created Visual United
States Medical Licensing Examination (USMLE) dataset, comprising 618 challenging open-ended
USMLE-style questions augmented with images, case vignettes, and tables of laboratory measure-
ments, covering a diverse range of medical specialties. The human evaluation of the Med-Flamingo
model on VQA-RAD, PathVQA, and Visual USMLE datasets is described in Section 4.2.

4.3.15 RaDialog
RaDialog is a VLM that integrates automated radiology RG with conversational assistance [Pel+23].
The model incorporates BioViL-T [Ban+23], a hybrid model that fuses the strengths of ResNet-50
[He+16] and Transformer [Vas+17] architectures. Pre-trained on radiology images and reports,
BioViL-T serves as a vision encoder that generates patch-wise visual features. The extracted
features undergo alignment through a BERT [Dev+19] model, transforming them into a concise
representation of 32 tokens. The model incorporates the CheXpert classifier to offer organized find-
ings in medical images. These findings are generated based on labels obtained from the CheXbert
[Smi+20] model. The classifier is trained independently using labels predicted by CheXbert from
the findings section of radiology reports. The model integrates visual features, structured findings,
and the directive “Write a radiology report” into a singular prompt, which is used as input for the
LLM, a Vicuna-7B [Chi+23] model fine-tuned using LoRA [Hu+22]. The training is performed
on X-ray image-report pairs from MIMIC-CXR [Joh+19b] dataset. RaDialog achieves a BLEU-4
score of 0.095, ROUGE-L score of 0.2710, METEOR score of 0.14, and BERTScore of 0.400 on
the MIMIC-CXR dataset. To address the challenge of catastrophic forgetting during training and
ensure the model’s capability across diverse downstream tasks, it is specifically trained on the newly
created Instruct [Pel+23] dataset. This dataset is meticulously curated to encompass a spectrum

29
Table 3: The comparison of medical VLMs’ accuracies on VQA tasks. The underlined accuracies
are the highest for a specific dataset.

SLAKE SLAKE VQA-RAD VQA-RAD PathVQA PathVQA


VQA-Med VQA-Med
Model open close open close open close
2019 2021
-ended -ended -ended -ended -ended -ended

MedViLL
– – 59.50% 77.70% – – – –
[Moo+22]

PubMedCLIP
78.40% 82.50% 60.10% 80.00% – – – –
[EMD23]

RepsNet
– – – 87.05% – – – –
[TBF22]

BioMedCLIP
82.50% 89.70% 67.60% 79.80% – – – –
[Zha+23b]

RAMM
82.48% 91.59% 67.60% 85.29% – – 82.13% 39.20%
[Yua+23]

LLaVa-Med
– 84.19% – 85.34% – 91.21% – –
[Li+23a]

MUMC
– – 71.50% 84.20% 39.00% 90.4% – –
[Li+23b]

of 8 diverse tasks: RG, NLE, complete CheXpert QA, binary CheXpert QA, region QA, summa-
rization, report correction, and reformulation report using simple language. Carefully formulated
prompts accompany each task, tailored to elicit specific responses from the model. For instance,
some prompts involve answering questions about particular X-ray regions. RaDialog trained on
the Instruct dataset achieves an F1 score of 0.397 on the binary CheXpert QA task and 0.403 on
the complete CheXpert QA task. In contrast, RaDialog without being trained on Instruct achieves
lower F1 scores of 0.018 and 0.098, respectively.

5 Challenges and Potential Future Directions


In medical AI, the future holds great promise and, concurrently, poses notable challenges [Aco+22].
As technology advances, the integration of VLMs into the healthcare sector has the potential to
revolutionize diagnostics, treatment planning, and patient care. Future medical VLMs may offer
enhanced capabilities in understanding complex clinical scenarios, generating detailed reports of
medical images, and facilitating seamless communication between healthcare professionals and AI
systems. However, these advancements come with challenges.
A significant challenge in developing effective medical VLMs is the limited availability of ML-
ready diverse and representative medical datasets. This limitation restricts the comprehensive
training of VLMs, impeding their ability to understand the complexities of diverse and rare clinical
scenarios [Moo+23]. VLMs with large context windows and RAG present a potential solution by
increasing the model’s context through the incorporation of retrieved relevant information. While
RAG usually involves a frozen model during training, exploring the pre-training of VLMs within
the RAG framework opens up a new avenue of research [Zha+23d]. This innovative approach could
potentially enhance the robustness of VLMs, especially in handling new and unforeseen medical
cases. Furthermore, the pressing concerns surrounding patient data privacy highlight the need
for innovative solutions, like federated learning (FL). FL offers a promising strategy to alleviate
the scarcity of medical data while prioritizing patient privacy [Zha+21]. In this decentralized

30
learning method, models are trained across multiple institutions, and only model weights are shared,
not the data. Thus, it effectively addresses major concerns about patient privacy while enabling
collaborative model training across diverse datasets.
Traditional metrics may fall short in capturing the nuanced complexities of clinical language,
posing a barrier to reliable evaluations of VLM performance [Yu+23]. This issue becomes partic-
ularly evident when evaluating the accuracy of medical reports or addressing open-ended medical
queries, where metrics need to discern clinically relevant distinctions. Therefore, the development
and adoption of specialized metrics tailored for medical RG and VQA is imperative. Such metrics
are pivotal not only for evaluating model performance but also for assessing aspects like generaliza-
tion, efficiency, and robustness. Establishing these metrics will significantly contribute to fostering
precise evaluations and continual advancements in the capabilities of medical VLMs.
The issue of hallucinations in generative VLMs poses a significant challenge to their reliability
and practical application [Liu+24]. Hallucinations refer to instances where VLMs generate outputs
that are not grounded in the provided images or are inconsistent with the established knowledge.
In medical contexts, these hallucinations can have serious consequences, leading to inaccurate
diagnostic information or treatment recommendations. One identified root cause of hallucinations
is the lack of alignment between visual and textual information [Sun+23]. Training VLMs to
effectively align these data modalities is crucial in mitigating the risk of hallucinations. For instance,
LLaVA-RLHF [Sun+23] achieved hallucination reduction by incorporating RLHF to align different
modalities. Further research is needed into building medical VLMs that base their generation on
the factual medical knowledge of minimal hallucination.
Overcoming catastrophic forgetting poses an additional challenge in the development of medical
VLMs. Catastrophic forgetting occurs when a model learning new information inadvertently erases
or distorts previously acquired knowledge, potentially compromising its overall competence. Strik-
ing a balance during fine-tuning can be crucial; moderate fine-tuning can be helpful to adapt the
model to a specific task, while excessive fine-tuning can lead to catastrophic forgetting [Zha+23a;
KBR23]. Leveraging methodologies from continual learning [Wan+23a; Zho+23b; CR24; KBR23;
KBR24] might be useful in the context of medical VLMs, where the ability to adapt and accumulate
knowledge across diverse clinical tasks is paramount. Continual learning focuses on training models
to sequentially learn from and adapt to new data over time while retaining knowledge from previ-
ously encountered tasks [KBR24]. Also, incorporating adapters within the framework of continual
learning can be a valuable tool in mitigating catastrophic forgetting [Zha+23c].
Finally, clinical validation and adoption of VLMs necessitate a collaborative bridge between
medical experts and AI/ML researchers. Trust, alignment with clinical needs, and ethical deploy-
ment are critical components for successfully integrating these models into healthcare workflows.
Establishing robust collaborations ensures a dynamic synergy, combining domain expertise with
technological advancements. This synergy is essential for the responsible and effective deployment
of medical VLMs in healthcare.

Declaration of Competing Interest


The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

Acknowledgements
This work was partly supported by NSF awards 1903466, 2234836, and 2234468.

31
References
[Aba+20] Asma Ben Abacha et al. “Overview of the VQA-Med Task at ImageCLEF 2020:
Visual Question Answering and Generation in the Medical Domain”. In: CLEF 2020
Working Notes. CEUR Workshop Proceedings. 2020.
[Aba+19] Asma Ben Abacha et al. “VQA-Med: Overview of the Medical Visual Question An-
swering Task at ImageCLEF 2019”. In: Conference and Labs of the Evaluation Forum.
2019.
[Aco+22] Julián N Acosta et al. “Multimodal Biomedical AI”. In: Nature Medicine 28.9 (2022),
pp. 1773–1784. doi: 10.1038/s41591-022-01981-2.
[Ahm+23] Sabeen Ahmed et al. “Transformers in time-series analysis: A tutorial”. In: Circuits,
Systems, and Signal Processing 42.12 (2023), pp. 7433–7466.
[Ala+22] Jean-Baptiste Alayrac et al. “Flamingo: A Visual Language Model for Few-Shot
Learning”. In: Advances in Neural Information Processing Systems. Vol. 35. 2022,
pp. 23716–23736.
[All+19] Max Allan et al. 2017 Robotic Instrument Segmentation Challenge. 2019. arXiv: 1902.
06426.
[All+20] Max Allan et al. 2018 Robotic Scene Segmentation Challenge. 2020. arXiv: 2001 .
11190.
[Ant+15] Stanislaw Antol et al. “VQA: Visual Question Answering”. In: IEEE International
Conference on Computer Vision (ICCV). 2015, pp. 2425–2433. doi: 10.1109/ICCV.
2015.279.
[Bai+23] Jinze Bai et al. Qwen-VL: A Versatile Vision-Language Model for Understanding,
Localization, Text Reading, and Beyond. 2023. arXiv: 2308.12966.
[BIR23] Long Bai, Mobarakol Islam, and Hongliang Ren. “CAT-ViL: Co-attention Gated
Vision-Language Embedding for Visual Question Localized-Answering in Robotic
Surgery”. In: Medical Image Computing and Computer Assisted Intervention – MIC-
CAI. 2023, pp. 397–407. doi: 10.1007/978-3-031-43996-4_38.
[Baj+21] Junaid Bajwa et al. “Artificial Intelligence in Healthcare: Transforming the Practice
of Medicine”. In: Future Healthcare Journal 8.2 (2021), e188–e194. doi: 10.7861/
fhj.2021-0095.
[Bal21] Pierre Baldi. Deep Learning in Science. Cambridge University Press, 2021. doi: 10.
1017/9781108955652.
[BL05] Satanjeev Banerjee and Alon Lavie. “METEOR: An Automatic Metric for MT Eval-
uation with Improved Correlation with Human Judgments”. In: ACL Workshop on
Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Sum-
marization. 2005, pp. 65–72.
[Ban+23] Shruthi Bannur et al. Learning to Exploit Temporal Structure for Biomedical Vision-
Language Processing. 2023. arXiv: 2301.04558.
[Baz+23] Yakoub Bazi et al. “Vision–Language Model for Visual Question Answering in Medical
Imagery”. In: Bioengineering 10.3 (2023), p. 380. doi: 10.3390/bioengineering10030380.
[Bea+20] Andrew Beam et al. “Clinical Concept Embeddings Learned from Massive Sources
of Multimodal Medical Data”. In: Pacific Symposium on Biocomputing 25 (2020),
pp. 295–306. doi: 10.1142/9789811215636_0027.

32
[BSD19] Asma Ben Abacha, Chaitanya Shivade, and Dina Demner-Fushman. “Overview of the
MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question
Answering”. In: BioNLP Workshop and Shared Task. 2019, pp. 370–379.
[Big+22] Ricardo Bigolin Lanfredi et al. “REFLACX, a Dataset of Reports and Eye-tracking
Data for Localization of Abnormalities in Chest X-rays”. In: Scientific Data 9.1
(2022). doi: 10.1038/s41597-022-01441-z.
[Boe+22] Benedikt Boecking et al. “Making the Most of Text Semantics to Improve Biomedical
Vision–Language Processing”. In: Computer Vision – ECCV. 2022, pp. 1–21. doi:
10.1007/978-3-031-20059-5_1.
[Boe+21] Kevin Michael Boehm et al. “Harnessing Multimodal Data Integration to Advance
Precision Oncology”. In: Nature Reviews Cancer 22 (2021), pp. 114–126. doi: 10.
1038/s41568-021-00408-3.
[Boj+17] Piotr Bojanowski et al. “Enriching Word Vectors with Subword Information”. In:
Transactions of the Association for Computational Linguistics 5 (2017), pp. 135–146.
issn: 2307-387X. doi: 10.1162/tacl_a_00051.
[Bom+22] Rishi Bommasani et al. On the Opportunities and Risks of Foundation Models. 2022.
arXiv: 2108.07258 [cs.LG].
[Bro+20] Tom B. Brown et al. “Language Models are Few-Shot Learners”. In: Advances in
Neural Information Processing Systems. Vol. 33. 2020, pp. 1877–1901.
[CR24] Yuliang Cai and Mohammad Rostami. Dynamic Transformer Architecture for Con-
tinual Learning of Multimodal Tasks. 2024. arXiv: 2401.15275.
[Car+20] Nicolas Carion et al. “End-to-End Object Detection with Transformers”. In: European
conference on computer vision. 2020, pp. 213–229. doi: 10.1007/978-3-030-58452-
8_13.
[Che+23] Feilong Chen et al. “VLP: A Survey on Vision-Language Pre-Training”. In: Machine
Intelligence Research 20 (2023), pp. 38–56. doi: 10.1007/s11633-022-1369-5.
[Che+20a] Ting Chen et al. A Simple Framework for Contrastive Learning of Visual Represen-
tations. 2020. arXiv: 2002.05709.
[Che+20b] Yen-Chun Chen et al. “UNITER: UNiversal Image-TExt Representation Learning”.
In: European Conference on Computer Vision. 2020, pp. 104–120. doi: 10.1007/978-
3-030-58577-8_7.
[Che+22] Mehdi Cherti et al. Reproducible Scaling Laws for Contrastive Language-Image Learn-
ing. 2022. arXiv: 2212.07143.
[Chi+23] Wei-Lin Chiang et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%*
ChatGPT Quality. 2023. url: https://lmsys.org/blog/2023-03-30-vicuna/.
[Cho+21] Jaemin Cho et al. “Unifying Vision-and-Language Tasks via Text Generation”. In:
International Conference on Machine Learning. Vol. 139. 2021, pp. 1931–1942.
[Cho+14] Kyunghyun Cho et al. “Learning Phrase Representations using RNN Encoder–Decoder
for Statistical Machine Translation”. In: Conference on Empirical Methods in Natural
Language Processing. 2014, pp. 1724–1734. doi: 10.3115/v1/D14-1179.
[Cho+22] Aakanksha Chowdhery et al. “PaLM: Scaling Language Modeling with Pathways”.
In: Journal of Machine Learning Research 24.240 (2022), pp. 1–113.

33
[Cor+20] Antonio Coronato et al. “Reinforcement Learning for Intelligent Healthcare Applica-
tions: A Survey”. In: Artificial Intelligence in Medicine 109 (2020), p. 101964. issn:
0933-3657. doi: 10.1016/j.artmed.2020.101964.
[Dai+23] Wenliang Dai et al. InstructBLIP: Towards General-purpose Vision-Language Models
with Instruction Tuning. 2023. arXiv: 2305.06500.
[Dao23] Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Par-
titioning. 2023. arXiv: 2307.08691.
[Dao+22] Tri Dao et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-
Awareness”. In: Advances in Neural Information Processing Systems. 2022.
[Dem+15] Dina Demner-Fushman et al. “Preparing a Collection of Radiology Examinations
for Distribution and Retrieval”. In: Journal of the American Medical Informatics
Association (JAMIA) 23.2 (2015), pp. 304–310. doi: 10.1093/jamia/ocv080.
[Den+09] Jia Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In: 2009
IEEE Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255.
doi: 10.1109/CVPR.2009.5206848.
[Dev+19] Jacob Devlin et al. “BERT: Pre-Training of Deep Bidirectional Transformers for Lan-
guage Understanding”. In: Conference of the North American Chapter of the Associ-
ation for Computational Linguistics. Vol. 1. 2019, pp. 4171–4186. doi: 10.18653/v1/
N19-1423.
[Dos+21] Alexey Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Im-
age Recognition at Scale”. In: International Conference on Learning Representations.
2021.
[Dou+22] Zi-Yi Dou et al. “An Empirical Study of Training End-to-End Vision-and-Language
Transformers”. In: IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR). 2022, pp. 18145–18155. doi: 10.1109/CVPR52688.2022.01763.
[EMD23] Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. “PubMedCLIP: How Much
Does CLIP Benefit Visual Question Answering in the Medical Domain?” In: Find-
ings of the Association for Computational Linguistics. 2023, pp. 1181–1193. doi:
10.18653/v1/2023.findings-eacl.88.
[ERO21] Patrick Esser, Robin Rombach, and Björn Ommer. “Taming Transformers for High-
Resolution Image Synthesis”. In: 2021 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR). 2021, pp. 12868–12878. doi: 10.1109/CVPR46437.
2021.01268.
[Gan+22] Zhe Gan et al. Vision-Language Pre-training: Basics, Recent Advances, and Future
Trends. 2022. arXiv: 2210.09263.
[Gu+23] Jindong Gu et al. A Systematic Survey of Prompt Engineering on Vision-Language
Foundation Models. 2023. arXiv: 2307.12980.
[Gu+21] Yu Gu et al. “Domain-Specific Language Model Pretraining for Biomedical Natu-
ral Language Processing”. In: ACM Transactions on Computing for Healthcare 3.1
(2021), p. 23. doi: 10.1145/3458754.
[Han+23] Tianyu Han et al. MedAlpaca – An Open-Source Collection of Medical Conversational
AI Models and Training Data. 2023. arXiv: 2304.08247.

34
[Hao+20] Yiding Hao et al. Probabilistic Predictions of People Perusing: Evaluating Metrics
of Language Model Performance for Psycholinguistic Modeling. 2020. arXiv: 2009.
03954.
[He+23] Kai He et al. A Survey of Large Language Models for Healthcare: from Data, Tech-
nology, and Applications to Accountability and Ethics. 2023. arXiv: 2310.05694.
[He+16] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: IEEE Con-
ference on Computer Vision and Pattern Recognition. 2016, pp. 770–778. doi: 10.
1109/CVPR.2016.90.
[He+20] Xuehai He et al. PathVQA: 30000+ Questions for Medical Visual Question Answer-
ing. 2020. arXiv: 2003.10286.
[HS97] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neural
Computation 9.8 (1997), pp. 1735–1780. doi: 10.1162/neco.1997.9.8.1735.
[Hu+22] Edward J Hu et al. “LoRA: Low-Rank Adaptation of Large Language Models”. In:
International Conference on Learning Representations. 2022.
[Hua+22] Gao Huang et al. “Convolutional Networks with Dense Connectivity”. In: IEEE
Transactions on Pattern Analysis and Machine Intelligence 44.12 (2022), pp. 8704–
8716. doi: 10.1109/TPAMI.2019.2918284.
[Hua+21] Yu Huang et al. “What Makes Multimodal Learning Better than Single (Provably)”.
In: Advances in Neural Information Processing Systems. 2021.
[Ion+21] Bogdan Ionescu et al. “Overview of the ImageCLEF 2021: Multimedia Retrieval in
Medical, Nature, Internet and Social Media Applications”. In: Experimental IR Meets
Multilinguality, Multimodality, and Interaction. 2021, pp. 345–370. doi: 10 . 1007 /
978-3-030-85251-1_23.
[Irv+19] Jeremy A. Irvin et al. “CheXpert: A Large Chest Radiograph Dataset with Uncer-
tainty Labels and Expert Comparison”. In: AAAI Conference on Artificial Intelli-
gence. Vol. 33. 2019, pp. 590–597. doi: 10.1609/aaai.v33i01.3301590.
[Jeo+23] Jaehwan Jeong et al. Multimodal Image-Text Matching Improves Retrieval-based Chest
X-Ray Report Generation. 2023. arXiv: 2303.17579.
[Ji20] Qiang Ji. “5 - Computer vision applications”. In: Probabilistic Graphical Models for
Computer Vision. Computer Vision and Pattern Recognition. Academic Press, 2020,
pp. 191–297. doi: 10.1016/B978-0-12-803467-5.00010-1.
[Jia+21] Chao Jia et al. “Scaling Up Visual and Vision-Language Representation Learning
with Noisy Text Supervision”. In: International Conference on Machine Learning.
Vol. 139. 2021, pp. 4904–4916.
[Jia+23] Albert Q. Jiang et al. Mistral 7B. 2023. arXiv: 2310.06825.
[Jin+21] Di Jin et al. “What Disease does This Patient Have? A Large-Scale Open Domain
Question Answering Dataset from Medical Exams”. In: Applied Sciences 11.14 (2021),
p. 6421. doi: 10.3390/app11146421.
[Jin+19] Qiao Jin et al. “PubMedQA: A Dataset for Biomedical Research Question Answer-
ing”. In: Conference on Empirical Methods in Natural Language Processing. 2019,
pp. 2567–2577. doi: 10.18653/v1/D19-1259.
[Joh+19a] Alistair E. W. Johnson et al. MIMIC-CXR-JPG, a Large Publicly Available Database
of Labeled Chest Radiographs. 2019. arXiv: 1901.07042.

35
[Joh+19b] Alistair EW Johnson et al. “MIMIC-CXR, a De-Identified Publicly Available Database
of Chest Radiographs with Free-Text Reports”. In: Scientific Data 6.317 (2019). doi:
10.1038/s41597-019-0322-0.
[Kay+22] Maxime Kayser et al. “Explaining Chest X-ray Pathologies in Natural Language”.
In: International Conference on Medical Image Computing and Computer-Assisted
Intervention (MICCAI). Vol. 13435. 2022, pp. 701–713. doi: 10.1007/978-3-031-
16443-9_67.
[KBR23] Hikmat Khan, Nidhal C Bouaynaya, and Ghulam Rasool. “The Importance of Ro-
bust Features in Mitigating Catastrophic Forgetting”. In: 2023 IEEE Symposium on
Computers and Communications (ISCC). IEEE. 2023, pp. 752–757.
[KBR24] Hikmat Khan, Nidhal Carla Bouaynaya, and Ghulam Rasool. “Brain-Inspired Contin-
ual Learning: Robust Feature Distillation and Re-consolidation for Class Incremental
Learning”. In: IEEE Access (2024).
[KJZ18] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. “Bilinear Attention Networks”.
In: Advances in Neural Information Processing Systems 31. Vol. 31. 2018, pp. 1564–
1574.
[KB14] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”.
In: International Conference on Learning Representations (2014).
[Kwo+23] Gukyeong Kwon et al. Masked Vision and Language Modeling for Multi-modal Rep-
resentation Learning. 2023. arXiv: 2208.02131.
[Lam+22] Nathan Lambert et al. “Illustrating Reinforcement Learning from Human Feedback
(RLHF)”. In: Hugging Face Blog (2022). https://huggingface.co/blog/rlhf.
[Lan+19] Zhenzhong Lan et al. “ALBERT: A Lite BERT for Self-Supervised Learning of Lan-
guage Representations”. In: International Conference on Learning Representations.
2019.
[Lau+18] Jason J Lau et al. “A Dataset of Clinically Generated Visual Questions and Answers
about Radiology Images”. In: Scientific data 5 (2018), p. 180251. doi: 10 . 1038 /
sdata.2018.251.
[Lee+23] Hyungyung Lee et al. UniXGen: A Unified Vision-Language Model for Multi-View
Chest X-ray Generation and Report Generation. 2023. arXiv: 2302.12172.
[LAC21] Brian Lester, Rami Al-Rfou, and Noah Constant. The Power of Scale for Parameter-
Efficient Prompt Tuning. 2021. arXiv: 2104.08691.
[Lew+20] Patrick Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP
Tasks”. In: Neural Information Processing Systems. Vol. 33. 2020, pp. 9459–9474.
[Li+23a] Chunyuan Li et al. LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day. 2023. arXiv: 2306.00890.
[Li+21] Junnan Li et al. “Align Before Fuse: Vision and Language Representation Learning
with Momentum Distillation”. In: Advances in Neural Information Processing Sys-
tems. 2021.
[Li+19] Liunian Harold Li et al. VisualBERT: A Simple and Performant Baseline for Vision
and Language. 2019. arXiv: 1908.03557.

36
[Li+22] Mingjie Li et al. “Cross-modal Clinical Graph Transformer for Ophthalmic Report
Generation”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). 2022, pp. 20624–20633. doi: 10.1109/CVPR52688.2022.02000.
[Li+23b] Pengfei Li et al. “Masked Vision and Language Pre-Training with Unimodal and
Multimodal Contrastive Losses for Medical Visual Question Answering”. In: Medical
Image Computing and Computer Assisted Intervention (MICCAI). 2023, pp. 374–383.
doi: 10.1007/978-3-031-43907-0_36.
[LL21] Xiang Lisa Li and Percy Liang. Prefix-Tuning: Optimizing Continuous Prompts for
Generation. 2021. arXiv: 2101.00190.
[Li+23c] Yunxiang Li et al. “ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Lan-
guage Model Meta-AI (LLaMA) Using Medical Domain Knowledge”. In: Cureus 15.6
(2023). doi: 10.7759/cureus.40895.
[Lin04] Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Summaries”. In:
Text Summarization Branches Out. 2004, pp. 74–81.
[Lin+23a] Weixiong Lin et al. PMC-CLIP: Contrastive Language-Image Pre-Training using
Biomedical Documents. 2023. arXiv: 2303.07240.
[Lin+23b] Zhihong Lin et al. “Medical Visual Question Answering: A Survey”. In: Artificial
Intelligence in Medicine 143 (2023), p. 102611. doi: 10 . 1016 / j . artmed . 2023 .
102611.
[Liu+23a] Bo Liu et al. “Medical Visual Question Answering via Conditional Reasoning and Con-
trastive Learning”. In: IEEE Transactions on Medical Imaging 42.5 (2023), pp. 1532–
1545. doi: 10.1109/TMI.2022.3232411.
[Liu+21a] Bo Liu et al. “Slake: A Semantically-Labeled Knowledge-Enhanced Dataset for Medi-
cal Visual Question Answering”. In: IEEE 18th International Symposium on Biomed-
ical Imaging (ISBI) (2021), pp. 1650–1654. doi: 10.1109/ISBI48211.2021.9434010.
[LTS23] Chang Liu, Yuanhe Tian, and Yan Song. A Systematic Review of Deep Learning-based
Research on Radiology Report Generation. 2023. arXiv: 2311.14199.
[Liu+22] Fangyu Liu et al. DePlot: One-Shot Visual Language Reasoning by Plot-to-Table
Translation. 2022. arXiv: 2212.10505.
[Liu+24] Hanchao Liu et al. A Survey on Hallucination in Large Vision-Language Models. 2024.
arXiv: 2402.00253.
[Liu+23b] Haotian Liu et al. Visual Instruction Tuning. 2023. arXiv: 2304.08485.
[Liu+21b] Ze Liu et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Win-
dows”. In: International Conference on Computer Vision (ICCV). 2021, pp. 9992–
10002. doi: 10.1109/ICCV48922.2021.00986.
[Lo+20] Kyle Lo et al. “S2ORC: The Semantic Scholar Open Research Corpus”. In: Annual
Meeting of the Association for Computational Linguistics. 2020, pp. 4969–4983. doi:
10.18653/v1/2020.acl-main.447.
[Lu+19] Jiasen Lu et al. “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representa-
tions for Vision-and-Language Tasks”. In: Advances in Neural Information Processing
Systems. 2019, pp. 13–23.

37
[MHC20] Thusitha Mabotuwana, Christopher S Hall, and Nathan Cross. “Framework for Ex-
tracting Critical Findings in Radiology Reports”. In: Journal of Digital Imaging 33.4
(2020), pp. 988–995. doi: 10.1007/s10278-020-00349-7.
[Mah+22] Supriya Mahadevkar et al. “A Review on Machine Learning Styles in Computer Vision
- Techniques and Future Directions”. In: IEEE Access 10 (2022), pp. 107293–107329.
doi: 10.1109/ACCESS.2022.3209825.
[Man+23] Omid Nejati Manzari et al. “MedViT: A Robust Vision Transformer for Generalized
Medical Image Classification”. In: Computers in Biology and Medicine 157 (2023),
p. 106791. doi: 10.1016/j.compbiomed.2023.106791.
[Mas+11] Jonathan Masci et al. “Stacked Convolutional Auto-Encoders for Hierarchical Feature
Extraction”. In: International Conference on Artificial Neural Networks. Vol. 6791.
2011, pp. 52–59.
[Mik+13a] Tomas Mikolov et al. “Distributed Representations of Words and Phrases and Their
Compositionality”. In: Advances in Neural Information Processing Systems. Vol. 26.
2013, pp. 3111–3119.
[Mik+13b] Tomas Mikolov et al. Efficient Estimation of Word Representations in Vector Space.
2013. arXiv: 1301.3781.
[Mis+21] Pankaj Mishra et al. “VT-ADL: A Vision Transformer Network for Image Anomaly
Detection and Localization”. In: IEEE International Symposium on Industrial Elec-
tronics (ISIE). 2021, pp. 01–06. doi: 10.1109/ISIE45552.2021.9576231.
[Miu+21] Yasuhide Miura et al. “Improving Factual Completeness and Consistency of Image-to-
Text Radiology Report Generation”. In: North American Chapter of the Association
for Computational Linguistics. 2021, pp. 5288–5304. doi: 10.18653/v1/2021.naacl-
main.416.
[Moh23] Mohsan, Mashood Mohammad and Akram, Muhammad Usman and Rasool, Ghu-
lam and Alghamdi, Norah Saleh and Baqai, Muhammad Abdullah Aamer and Ab-
bas, Muhammad. “Vision Transformer and Language Model Based Radiology Report
Generation”. In: IEEE Access 11 (2023), pp. 1814–1824. doi: 10.1109/ACCESS.2022.
3232719.
[MPC20] Maram Mahmoud A. Monshi, Josiah Poon, and Vera Chung. “Deep Learning in Gen-
erating Radiology Reports: A Survey”. In: Artificial Intelligence in Medicine 106
(2020), p. 101878. issn: 0933-3657. doi: 10.1016/j.artmed.2020.101878.
[Moo+22] Jong Hak Moon et al. “Multi-Modal Understanding and Generation for Medical Im-
ages and Text via Vision-Language Pre-Training”. In: IEEE Journal of Biomedical
and Health Informatics 26.12 (2022), pp. 6070–6080. doi: 10 . 1109 / JBHI . 2022 .
3207502.
[Moo+23] Michael Moor et al. Med-Flamingo: A Multimodal Medical Few-Shot Learner. 2023.
arXiv: 2307.15189.
[OLV19] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with
Contrastive Predictive Coding. 2019. arXiv: 1807.03748.
[Ouy+22] Long Ouyang et al. “Training language models to follow instructions with human feed-
back”. In: Advances in neural information processing systems 35 (2022), pp. 27730–
27744.

38
[Pap+02] Kishore Papineni et al. “Bleu: a Method for Automatic Evaluation of Machine Trans-
lation”. In: Annual Meeting of the Association for Computational Linguistics. 2002,
pp. 311–318. doi: 10.3115/1073083.1073135.
[Par+22] Darsh Parekh et al. “A Review on Autonomous Vehicles: Progress, Methods and
Challenges”. In: Electronics 11.14 (2022). doi: 10.3390/electronics11142162.
[Pel+18] Obioma Pelka et al. “Radiology Objects in COntext (ROCO): A Multimodal Image
Dataset”. In: Intravascular Imaging and Computer Assisted Stenting and Large-Scale
Annotation of Biomedical Data and Expert Label Synthesis. Vol. 11043. Springer In-
ternational Publishing, 2018, pp. 180–189. doi: 10.1007/978-3-030-01364-6_20.
[Pel+23] Chantal Pellegrini et al. RaDialog: A Large Vision-Language Model for Radiology
Report Generation and Conversational Assistance. 2023. arXiv: 2311.18681.
[Pen+17] Yifan Peng et al. “NegBio: A High-Performance Tool for Negation and Uncertainty
Detection in Radiology Reports”. In: AMIA Summits on Translational Science Pro-
ceedings 2018 (2017), pp. 188–196.
[PSM14] Jeffrey Pennington, Richard Socher, and Christopher Manning. “Glove: Global Vec-
tors for Word Representation”. In: Empirical Methods in Natural Language Processing.
Vol. 14. 2014, pp. 1532–1543. doi: 10.3115/v1/D14-1162.
[Rad+21] Alec Radford et al. Learning Transferable Visual Models from Natural Language Su-
pervision. 2021. arXiv: 2103.00020.
[RB21] Abigail Rai and Samarjeet Borah. “Study of Various Methods for Tokenization”. In:
Applications of Internet of Things. 2021, pp. 193–200. doi: 10.1007/978-981-15-
6198-6_18.
[RCR22] Vignav Ramesh, Nathan Chi, and Pranav Rajpurkar. “Improving Radiology Report
Generation Systems by Removing Hallucinated References to Non-existent Priors”.
In: Machine Learning Research. Vol. 193. 2022, pp. 456–473.
[RBK21] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. “Vision Transformers for
Dense Prediction”. In: IEEE/CVF International Conference on Computer Vision
(ICCV). 2021, pp. 12159–12168. doi: 10.1109/ICCV48922.2021.01196.
[Ran+23a] Veenu Rani et al. “Self-supervised Learning: A Succinct Review”. In: Archives of
Computational Methods in Engineering 30 (2023). doi: 10.1007/s11831-023-09884-
2.
[Ran+23b] Mercy Ranjit et al. Retrieval Augmented Chest X-Ray Report Generation using Ope-
nAI GPT models. 2023. arXiv: 2305.03660.
[Ren+24] Mengjie Ren et al. Learning or Self-aligning? Rethinking Instruction Fine-tuning.
2024. arXiv: 2402.18243 [cs.CL].
[Rez+19] Seyed Hamid Rezatofighi et al. “Generalized Intersection Over Union: A Metric and a
Loss for Bounding Box Regression”. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) (2019), pp. 658–666. doi: 10.1109/cvpr.2019.
00075.
[Rob51] Herbert E. Robbins. “A Stochastic Approximation Method”. In: Annals of Mathe-
matical Statistics 22 (1951), pp. 400–407. doi: 10.1214/aoms/1177729586.
[RS18] Alexey Romanov and Chaitanya Shivade. “Lessons from Natural Language Inference
in the Clinical Domain”. In: Conference on Empirical Methods in Natural Language
Processing. 2018, pp. 1586–1596. doi: 10.18653/v1/D18-1187.

39
[Rüc+22] Johannes Rückert et al. “Overview of ImageCLEFmedical 2022 – Caption Prediction
and Concept Detection”. In: CEUR Workshop Proceedings. Vol. 3180. 2022, pp. 1294–
1307.
[Sch19] Robin M. Schmidt. Recurrent Neural Networks (RNNs): A gentle Introduction and
Overview. 2019. arXiv: 1912.05911.
[See+22] Lalithkumar Seenivasan et al. “Surgical-VQA: Visual Question Answering in Surgical
Scenes Using Transformer”. In: Medical Image Computing and Computer Assisted
Intervention – MICCAI. 2022, pp. 33–43. doi: 10.1007/978-3-031-16449-1_4.
[SB23] Saurav Sengupta and Donald E. Brown. Automatic Report Generation for Histopathol-
ogy images using pre-trained Vision Transformers and BERT. 2023. arXiv: 2312 .
01435.
[SHB16] Rico Sennrich, Barry Haddow, and Alexandra Birch. “Neural Machine Translation
of Rare Words with Subword Units”. In: 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers). 2016, pp. 1715–1725. doi:
10.18653/v1/P16-1162.
[SDK23] Dhruv Sharma, Chhavi Dhiman, and Dinesh Kumar. “Evolution of Visual Data Cap-
tioning Methods, Datasets, and Evaluation Metrics: a Comprehensive Survey”. In:
Expert Systems with Applications 221 (2023), p. 119773. issn: 0957-4174. doi: 10.
1016/j.eswa.2023.119773.
[Sho+19] Mohammad Shoeybi et al. Megatron-LM: Training Multi-Billion Parameter Language
Models Using Model Parallelism. 2019. arXiv: 1909.08053.
[Shr+23] Prashant Shrestha et al. Medical Vision Language Pretraining: A survey. 2023. arXiv:
2312.06224.
[Shu+23] Chang Shu et al. Visual Med-Alpaca: A Parameter-Efficient Biomedical LLM with Vi-
sual Capabilities. [Online; accessed 20-Feb-2024]. 2023. url: https://cambridgeltl.
github.io/visual-med-alpaca/.
[Sin+23] Karan Singhal et al. “Large Language Models Encode Clinical Knowledge”. In: Nature
620 (2023), pp. 172–180. doi: 10.1038/s41586-023-06291-2.
[Smi+20] Akshay Smit et al. CheXbert: Combining Automatic Labelers and Expert Annotations
for Accurate Radiology Report Labeling Using BERT. 2020. arXiv: 2004.09167.
[Sov+21] Petru Soviany et al. “Curriculum Learning: A Survey”. In: International Journal of
Computer Vision 130 (2021), pp. 1526–1565. doi: 10.1007/s11263-022-01611-x.
[Sub+20] Sanjay Subramanian et al. “MedICaT: A Dataset of Medical Images, Captions, and
Textual References”. In: Findings of the Association for Computational Linguistics:
EMNLP. 2020, pp. 2112–2120. doi: 10.18653/v1/2020.findings-emnlp.191.
[Sun+23] Zhiqing Sun et al. Aligning Large Multimodal Models with Factually Augmented RLHF.
2023. arXiv: 2309.14525.
[SB98] R.S. Sutton and A.G. Barto. “Reinforcement Learning: An Introduction”. In: IEEE
Transactions on Neural Networks 9.5 (1998), pp. 1054–1054. doi: 10 . 1109 / TNN .
1998.712192.
[TL20] Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolu-
tional Neural Networks. 2020. arXiv: 1905.11946.

40
[TBF22] Ajay K. Tanwani, Joelle Barral, and Daniel Freedman. “RepsNet: Combining Vision
with Language for Automated Medical Reports”. In: Medical Image Computing and
Computer Assisted Intervention (MICCAI). 2022, pp. 714–724. doi: 10.1007/978-
3-031-16443-9_68.
[Tay53] Wilson L. Taylor. ““Cloze Procedure”: A New Tool for Measuring Readability”. In:
Journalism & Mass Communication Quarterly 30 (1953), pp. 415–433. doi: 10.1177/
107769905303000401.
[Tha+23] Omkar Thawkar et al. XrayGPT: Chest Radiographs Summarization using Medical
Vision-Language Models. 2023. arXiv: 2306.07971.
[TLZ23] Pang Ting, Peigao Li, and Lijie Zhao. “A Survey on Automatic Generation of Medical
Imaging Reports based on Deep Learning”. In: BioMedical Engineering OnLine 22
(May 2023). doi: 10.1186/s12938-023-01113-y.
[Tou+23a] Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. 2023.
arXiv: 2307.09288.
[Tou+23b] Hugo Touvron et al. LLaMA: Open and Efficient Foundation Language Models. 2023.
arXiv: 2302.13971.
[Tou+21] Hugo Touvron et al. “Training Data-Efficient Image Transformers & Distillation
through Attention”. In: The International Conference on Machine Learning. Vol. 139.
2021, pp. 10347–10357.
[Tri+23] Aakash Tripathi et al. Building Flexible, Scalable, and Machine Learning-ready Mul-
timodal Oncology Datasets. 2023. arXiv: 2310.01438.
[Tya+21] Khushal Tyagi et al. “Detecting Pneumonia using Vision Transformer and Comparing
with Other Techniques”. In: International Conference on Electronics, Communication
and Aerospace Technology (ICECA). 2021, pp. 12–16. doi: 10.1109/ICECA52323.
2021.9676146.
[Vas+17] Ashish Vaswani et al. “Attention Is All You Need”. In: Advances in Neural Informa-
tion Processing Systems. Vol. 30. 2017, pp. 5998–6008.
[VC13] Karin Verspoor and Kevin Bretonnel Cohen. “Encyclopedia of Systems Biology”. In:
Springer New York, 2013. Chap. Natural Language Processing, pp. 1495–1498. doi:
10.1007/978-1-4419-9863-7_158.
[WCG20] Changhan Wang, Kyunghyun Cho, and Jiatao Gu. “Neural Machine Translation with
Byte-Level Subwords”. In: AAAI Conference on Artificial Intelligence. 2020, pp. 9154–
9160. doi: 10.1609/aaai.v34i05.6451.
[Wan+22a] Jianfeng Wang et al. GIT: A Generative Image-to-text Transformer for Vision and
Language. 2022. arXiv: 2205.14100.
[Wan+23a] Liyuan Wang et al. A Comprehensive Survey of Continual Learning: Theory, Method
and Application. 2023. arXiv: 2302.00487.
[Wan+23b] Yizhong Wang et al. Self-Instruct: Aligning Language Models with Self-Generated
Instructions. 2023. arXiv: 2212.10560.
[Wan+22b] Zifeng Wang et al. MedCLIP: Contrastive Learning from Unpaired Medical Images
and Text. 2022. arXiv: 2210.10163.
[Wan+22c] Zirui Wang et al. “SimVLM: Simple Visual Language Model Pretraining with Weak
Supervision”. In: International Conference on Learning Representations (ICLR). 2022.

41
[Waq+23a] Asim Waqas et al. Multimodal Data Integration for Oncology in the Era of Deep
Neural Networks: A Review. 2023. arXiv: 2303.06471.
[Waq+23b] Asim Waqas et al. “Revolutionizing Digital Pathology With the Power of Generative
Artificial Intelligence and Foundation Models”. In: Laboratory Investigation 103.11
(2023), p. 100255. doi: https://doi.org/10.1016/j.labinv.2023.100255.
[Wu+16] Yonghui Wu et al. Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation. 2016. arXiv: 1609.08144.
[Xie+16] Saining Xie et al. “Aggregated Residual Transformations for Deep Neural Networks”.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016),
pp. 5987–5995. doi: 10.1109/CVPR.2017.634.
[Xie+22] Zhenda Xie et al. “SimMIM: A Simple Framework for Masked Image Modeling”.
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
(2022), pp. 9643–9653.
[Xin+22] Chao Xin et al. “An Improved Transformer Network for Skin Cancer Classification”.
In: Computers in Biology and Medicine 149 (2022), p. 105939. doi: 10 . 1016 / j .
compbiomed.2022.105939.
[Xu+21] Mengya Xu et al. “Learning Domain Adaptation with Model Calibration for Sur-
gical Report Generation in Robotic Surgery”. In: 2021 IEEE International Confer-
ence on Robotics and Automation (ICRA) (2021), pp. 12350–12356. doi: 10.1109/
ICRA48506.2021.9561569..
[Yam+18] Rikiya Yamashita et al. “Convolutional Neural Networks: an Overview and Applica-
tion in Radiology”. In: Insights into Imaging 9 (June 2018). doi: 10.1007/s13244-
018-0639-9.
[Yan+22] Xi Yang et al. “A Large Language Model for Electronic Health Records”. In: NPJ
Digital Medicine 5.194 (2022). doi: 10.1038/s41746-022-00742-2.
[Yan+16] Zichao Yang et al. “Hierarchical Attention Networks for Document Classification”.
In: Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies. 2016, pp. 1480–1489. doi: 10.18653/v1/
N16-1174.
[Yu+23] Feiyang Yu et al. “Evaluating Progress in Automatic Chest X-ray Radiology Report
Generation”. In: Patterns 4 (2023), p. 100802. doi: 10.1016/j.patter.2023.100802.
[Yua+23] Zheng Yuan et al. RAMM: Retrieval-augmented Biomedical Visual Question Answer-
ing with Multi-modal Pre-training. 2023. arXiv: 2303.00534.
[Zel+19] Rowan Zellers et al. “From Recognition to Cognition: Visual Commonsense Rea-
soning”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). 2019, pp. 6713–6724. doi: 10.1109/CVPR.2019.00688.
[Zen+20] Guangtao Zeng et al. “MedDialog: Large-scale Medical Dialogue Datasets”. In: Con-
ference on Empirical Methods in Natural Language Processing (EMNLP). 2020, pp. 9241–
9250. doi: 10.18653/v1/2020.emnlp-main.743.
[Zha+23a] Yuexiang Zhai et al. Investigating the Catastrophic Forgetting in Multimodal Large
Language Models. 2023. arXiv: 2309.10313.
[Zha+20a] Li-Ming Zhan et al. “Medical Visual Question Answering via Conditional Reasoning”.
In: The 28th ACM International Conference on Multimedia. 2020, pp. 2345–2354. doi:
10.1145/3394171.3413761.

42
[Zha+21] Chen Zhang et al. “A Survey on Federated Learning”. In: Knowledge-Based Systems
216 (2021), p. 106775. doi: 10.1016/j.knosys.2021.106775.
[ZNC18] Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. “Grounding Referring Expressions
in Images by Variational Context”. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2018, pp. 4158–4166. doi: 10.1109/CVPR.2018.00437.
[Zha+23b] Sheng Zhang et al. Large-Scale Domain-Specific Pretraining for Biomedical Vision-
Language Processing. 2023. arXiv: 2303.00915.
[Zha+20b] Tianyi Zhang et al. “BERTScore: Evaluating Text Generation with BERT”. In: In-
ternational Conference on Learning Representations. 2020.
[Zha+23c] Wentao Zhang et al. Adapter Learning in Pretrained Feature Extractor for Continual
Learning of Diseases. 2023. arXiv: 2304.09042.
[Zha+19] Yijia Zhang et al. “BioWordVec, Improving Biomedical Word Embeddings with Sub-
word Information and MeSH”. In: Scientific Data 6.52 (2019). doi: 10.1038/s41597-
019-0055-0.
[Zha+23d] Ruochen Zhao et al. Retrieving Multimodal Information for Augmented Generation:
A Survey. 2023. arXiv: 2303.10868.
[Zhe+19] Liangli Zhen et al. “Deep Supervised Cross-Modal Retrieval”. In: IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (CVPR). 2019, pp. 10386–10395.
doi: 10.1109/CVPR.2019.01064.
[Zho+23a] Hongjian Zhou et al. A Survey of Large Language Models in Medicine: Progress,
Application, and Challenge. 2023. arXiv: 2311.05112.
[Zho+23b] Da-Wei Zhou et al. Learning without Forgetting for Vision-Language Models. 2023.
arXiv: 2305.19270.
[Zie+20] Daniel M. Ziegler et al. Fine-Tuning Language Models from Human Preferences. 2020.
arXiv: 1909.08593 [cs.CL].

43

You might also like