0% found this document useful (0 votes)
25 views24 pages

MGPATH: Vision-Language Model With Multi-Granular Prompt Learning For Few-Shot WSI Classification

The document presents MGPath, a novel vision-language model designed for few-shot whole slide image (WSI) classification in pathology, addressing challenges related to large image sizes and limited annotations. It enhances the Prov-GigaPath model by integrating multi-granular prompt learning and optimal transport methods to improve model robustness and accuracy in recognizing complex patterns. Empirical results demonstrate that MGPath outperforms several state-of-the-art models across various datasets, showcasing its effectiveness in the field of pathology classification.

Uploaded by

Shounak Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views24 pages

MGPATH: Vision-Language Model With Multi-Granular Prompt Learning For Few-Shot WSI Classification

The document presents MGPath, a novel vision-language model designed for few-shot whole slide image (WSI) classification in pathology, addressing challenges related to large image sizes and limited annotations. It enhances the Prov-GigaPath model by integrating multi-granular prompt learning and optimal transport methods to improve model robustness and accuracy in recognizing complex patterns. Empirical results demonstrate that MGPath outperforms several state-of-the-art models across various datasets, showcasing its effectiveness in the field of pathology classification.

Uploaded by

Shounak Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

MGPATH: Vision-Language Model with Multi-Granular

Prompt Learning for Few-Shot WSI Classification

Anh-Tien Nguyen1,2,3 Duy Minh Ho Nguyen6,7,8∗ Nghiem Tuong Diep8,∗


Trung Quoc Nguyen8 Nhat Ho5 Jacqueline Michelle Metsch1,2
Miriam Cindy Maurer1,2 Daniel Sonntag 4,8 Hanibal Bohnenberger 1,2
Anne-Christin Hauschild 1,2

1University of Göttingen, Germany


arXiv:2502.07409v1 [cs.CV] 11 Feb 2025

2University Medical Center Göttingen, Germany


3 Max Planck Institute for Multidisciplinary Sciences, Germany
4 University of Oldenburg, Germany
5 The University of Texas at Austin† , USA
6 Max Planck Research School for Intelligent Systems (IMPRS-IS), Germany
7 University of Stuttgart, Germany
8 German Research Center for Artificial Intelligence (DFKI), Germany

Abstract
Whole slide pathology image classification presents challenges due to gigapixel image sizes
and limited annotation labels, hindering model generalization. This paper introduces a prompt
learning method to adapt large vision-language models for few-shot pathology classification. We
first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology
image tiles, into a vision-language model by adding adaptors and aligning it with medical
text encoders via contrastive learning on 923K image-text pairs. The model is then used to
extract visual features and text embeddings from few-shot annotations and fine-tunes with
learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features
using prefix embeddings or self-attention, we propose multi-granular attention that compares
interactions between learnable prompts with individual image patches and groups of them.
This approach improves the model’s ability to capture both fine-grained details and broader
context, enhancing its recognition of complex patterns across sub-regions. To further improve
accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model
robustness by mitigating perturbations that might occur during the data augmentation process.
Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness
of our approach; thereby, we surpass several of the latest competitors and consistently improve
performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated
PLIP. We release our implementations and pre-trained models at this MGPATH.

1 Introduction
Whole slide imaging (WSI) [41] has become essential in modern pathology for capturing high-
resolution digital representations of entire tissue samples, enabling easier digital storage, sharing,
and remote analysis [43]. Unlike conventional methods that depend on examining slides under a

Equal second contribution

1
microscope, WSI provides faster, detailed structural and cellular insights essential for disease diagnosis
across multiple tissue layers, which is particularly valuable in cancer screening [3, 8]. Nevertheless,
WSIs are massive images, often containing billions of pixels [13, 57], making detailed annotations
and analysis difficult and expensive. To tackle these challenges, machine learning techniques
incorporating few-shot and weakly supervised learning have been developed [36, 27, 31, 50, 53].
Among these, multiple instance learning (MIL) and vision-language models (VLMs) have gained
particular attention for their ability to effectively manage limited annotations and interpret complex
whole-slide pathology images.

In MIL [21, 65, 27, 31, 58, 54], each 🔥 Prompts


WSI is first divided into smaller Prior Work 🔥
patches or instances. These instances ❄️ t+3 t+7 t+9 t+5

Classifier
Vision Patch-level
are extracted feature embeddings us- Encoder Attention
ing pre-trained vision encoders before Prompt- Prediction
being grouped into a "bag", i.e., a Prefix Tuning guided
WSI Patches slide feature
whole slide-level representation for the
🔥 Prompts
🔥
entire WSI. The MIL model mainly MGPATH ❄️

Classifier
focuses on learning ensemble func- Vision
Encoder t+9 t+1 t+2 t+3
tions to identify patterns in specific Multi-
patches, contributing to the overall la- Granular Prompt- Prediction
Attention guided
bel prediction for each bag (e.g., can- WSI Patches slide feature
cerous or non-cancerous), hence re-
Figure 1: Unlike previous methods that add prompts at pre-
ducing the need for detailed annota-
fix positions or patch-level attention - disrupting structural
tions. Nonetheless, these methods of-
correlations - our MGPath framework integrates prompts
ten struggle to select relevant patches
at both regional and individual patch levels (multi-granular
due to complex correlations and tissue
attention).
variability [14, 47]. To overcome those
obstacles, VLMs [33, 19, 20, 53] have emerged as a promising solution, combining slide-level visual
features with textual descriptions to enrich contextual understanding and support predictions
in sparse data scenarios with approaches such as zero-shot learning [66, 1]. Specifically, VLMs
incorporate multi-scale images [53, 18], permitting the extraction of global and local WSI features
at different resolutions. To adapt the pre-trained vision-language model efficiently, prompt learning
[75, 16] is employed where learnable prompts are treated as part of the input text to guide the
model, and contextual prompts [28, 68] are integrated into feature embeddings using a self-attention
mechanism [62]. Despite their strong classification performance across diverse tasks, these approaches
still encounter certain limitations.

First, (i) adapting prompt learning with frozen visual features often neglects the hierarchical
relationships among learnable prompts and the visual features they interact with - specifically,
the multi-granular attention between prompts to individual patches and groups of patches. This
limitation lessens the model’s ability to capture interdependence across distinct scales — from
fine-grained local features to broader contextual information, leading to less accurate comprehension
of complex patterns in pathology images. Second, (ii) many VLMs rely on the CLIP architecture
[48], which was not explicitly pre-trained on pathology images, thereby limiting its adaptability
in few-shot settings, especially when the architecture is primarily frozen and prompt learning is
applied. While there exist recent works that have incorporated PLIP [19], a model pre-trained

2
on 200k pathology image-text pairs curated from Twitter and showed significant improvements,
an open question remains whether scaling pre-training to millions or billions of pathology-specific
samples could further boost performance. Lastly, (iii) most VLM models for whole-slide pathology
rely on cosine similarity to align visual and textual features. This metric, however, can struggle
with multiple text descriptions for sub-regions [5] and with augmented data perturbations [39], as it
lacks the precision to capture fine-grained alignments between varied image-text pairs.

In this work, we present MGPath, a novel VLM method developed to address the challenges in
whole-slide pathology classification. Our approach begins by adapting Prov-GigaPath [66] - one
of the largest pre-trained vision models trained on 1.3 billion pathology image patches - into a
vision-language model. We accomplish this through contrastive learning with a pre-trained text
encoder from the PLIP model [19], which was trained on approximately 200K pathology image-text
pairs. To strengthen this alignment, we collected an additional 923K image-text pairs from ARCH
[15], PatchGastricADC22 [61] and Quilt-1M [20] and trained adaptor-based cross-alignment [16, 4]
between Prov-GigaPath’s visual encoder and PLIP’s text encoder. Crucially, only lightweight
adaptors are updated, making this process highly parameter-efficient. To the best of our knowledge,
MGPath is the first parameter-efficient vision-language model trained for pathology at this data
scale — utilizing 923K image-text pairs compared to the 200K in PLIP, and further benefiting from
Prov-GigaPath’s 1.3 billion sample pre-training.

Next, we leverage these pre-trained models for few-shot WSI tasks by introducing multi-granular
prompt learning. First, visual embeddings and descriptive text prompts are generated for image
patches at different resolutions using large language models, which have been shown to improve
performance [18, 53, 46]. Unlike prior methods that concatenate or use basic attention on individual
patches [28, 75, 68, 53], our attention integrates learnable prompts with frozen visual features at
both fine- and coarse-grained perspectives (Figure 1). We represent image patches from each WSI
as a spatial graph, using bounding box coordinates to enable region-level aggregation through
message passing along local connections. This spatial structure is encoded as tokens within the
Key-Value matrices, which interact with Query matrices derived from prompt embeddings. By
directing attention from Query to Key-Value matrices across both patch and region levels, our
approach effectively captures hierarchical information, enriching feature representation and selectively
emphasizing features across diverse tissue areas.

Finally, to measure the distance between prompt-fused visual embedding and multiple text prompts,
we resort to the optimal transport (OT) method [40, 45, 51, 5, 12, 39, 69], providing flexibility
in aligning heterogeneous data distributions. This property is beneficial in few-shot WSI classifi-
cation when it can (i) handle data augmentation with noise, as OT can adapt to perturbations
without losing meaningful structural relationships, and (ii) capture imbalances masses between
two modality embeddings when text prompts only describe sub-regions in WSI samples. Through
extensive evaluations of three datasets with various architectures (CLIP-ResNet50, CLIP-ViTB16,
PLIP, and (Prov-GigaPath)-integrated PLIP), we observe that MGPath demonstrate consistent
improvements over several state-of-the-art MIL and VLM in literature (14 competitors). As an
example, MGPath with (Prov-GigaPath)-PLIP variant outperforms MSCPT [18] by 5% in F1 and
8% in AUC on the TCGA-BRCA dataset. Additionally, it also surpasses two state-of-the-art VLMs
models, CONCH [32] and QUILT [20], by approximately 6% in accuracy on TCGA-BRCA.

3
2 Related Work
2.1 Large-scale Pre-trained Models for Pathology
Recent advancements in large-scale pre-trained models for pathology can be broadly classified into
two categories. Vision models, such as Virchow [20], Hibou [37], UNI [6], and Prov-GigaPath [66]
leverage massive pathology image datasets to learn robust visual representations. Among these,
Prov-GigaPath stands out as the largest model, trained on 1.3 billion pathology image patches, and
excels in resolving complex tissue patterns at high resolution. On the other hand, vision-language
models (VLMs) like PLIP [19] (trained 200K image-text pairs), CONCH [32] (1.17M), or QUILTNET[20]
(1M), integrate visual and textual information to enhance contextual understanding and improve
pathology slide interpretation. In contrast, our MGPath combines the strengths of both approaches by
using a parameter-efficient adaptor to link Prov-GigaPath (the largest pre-trained vision encoder)
with a text encoder from VLMs like PLIP or CONCH, leveraging both rich visual features and semantic
textual embeddings. Although we use the PLIP text encoder in our experiments due to its common
use in baselines, the method can be extended to other larger pre-trained text models.

2.2 Few-shot learning in WSI


MIL treats a WSI as a bag of patches and aggregates these instances into a bag of features, with early
methods using non-parametric techniques like mean or max pooling. However, since disease-related
patches are rare, these methods can overwhelm useful information with irrelevant data. To address
this, attention-based methods, graph neural Networks (GNNs), and Transformer-based methods
have been introduced [34, 7, 21, 26, 52, 73]. In contrast, VLMs have gained popularity through
contrastive learning, aligning image-text pairs to enhance performance on a variety of tasks. While
collecting large-scale pathology image-text pairs remains challenging, models like MI-Zero, PLIP,
and CONCH have been trained on hundreds of thousands to over a million pathology image-text pairs
[33, 19, 32]. Some approaches also integrate multi-magnification images and multi-scale text to
mimic pathologists’ diagnostic processes, especially for detecting subtle abnormalities [53, 18]. Our
MGPath extends on the VLMs strategy by further amplifying the benefits of using a large pre-trained
pathology VLM model and introducing a new parameter-efficient multi-granular prompt learning to
adapt these models to few-shot settings.

2.3 Prompt Learning for Vision-Language Adaptations


Prompt tuning is proposed to transfer large pre-trained model task-specific downstream tasks and
has shown strong results in multimodal models like CLIP. Rather than design a heuristic template,
several methods like CoOp [75], CoCoOp [74], or MaPLe [22] among others [49, 55] have allowed models
to determine optimal prompts from multiple perspectives, such as domain generalization [17, 68],
knowledge prototype [71, 29], or diversity [35, 55]. However, these approaches focus on natural
images and do not address the unique challenges of whole-slide pathology images, which require
multi-scale and structural contextual information. While a few current methods typically integrate
prompts with frozen visual features via self-attention [53, 46], these approaches might struggle
with the complex relationships in WSIs. Our solution introduces multi-granular prompt learning,
bridging attention on both individual image patches and spatial groups to better align with the
hierarchical structure of WSI data.

4
3 Methods
Figure 2 provides an overview of the key steps in our method. Before diving into the details of
each section, we first introduce our PLIP model enhanced by Prov-GigaPath through the use of
adaptors.

Multi-Granular Attention

Dot-Product
❄️

Attention
Prompts 🔥 qq

Vision qq qq
Encoder
qq qq

🔥
5x Message Multi-Granular Attention
Passing

10x A whole slide image of lung


🔥 🔥
adenocarcinoma at low and
What visually descriptive
high resolution reveals
irregular tumor regions
❄️
features characterize
{class name} at both low
distorting adjacent Text
Optimal Transport
LLM

structures, smooth or
and high resolutions
irregular margins Encoder
within the whole slide
image? Please summarize indicating invasiveness,
into a single paragraph and features such as clear
Contextual
cytoplasm, prominent
Question nucleoli, rich vascularity, Prompt
and cellular heterogeneity.
Visual Descriptive Text Prompt

Figure 2: The pipeline of the proposed MGPath method. Low- and high-resolution image patches
are processed with large language models to generate visual contextual descriptions (Section 3.2).
Visual prompts are integrated with frozen features through multi-granular attention at both patch
and group-of-patch levels 3.3. The final output is obtained by aligning visual and text embeddings
using optimal transport (Section 3.4).

3.1 Bridging Pathology Visual and Text Encoders


To leverage Prov-GigaPath’s extensive pre-trained visual features for pathology, we implement
lightweight adaptors that map image patch-level features to an embedding space aligned with the
PLIP text encoder. These adaptors allow us to train joint image-text representations with parameter
efficiency by updating only the adaptor weights.
Given a set of collected pathology image-text pairs {(Ii , Ti )| i = 1, 2.., N } (Sec. 4), we denote by
EI (.) be a pre-trained vision encoder from Prov-GigaPath, extracting patch-level feature, and ET (.)
the pre-trained text encoder from PLIP model. Given a batch size of B samples, the image and text
embeddings are computed as xi = EI (Ii ) ∈ Rdv , ti = ET (Ti ) ∈ Rdt . We then design two trainable
adaptors AI (.) and AT (.), that maps (xi , ti ) into the same hidden dimension Rd and minimizes the
noise contrastive loss [42]:
" #
exp (cos(AI (xi ), AT (ti ))/τ )
Lcon = EB − log P , (1)
j exp (cos(AI (xi ), AT (tj ))/τ )

where cos(.) is the cosine similarity, and τ denotes for temperature of the softmax function. For
parameter efficiency, we train only the adaptors AI (.), AT (.) while keeping the Prov-GigaPath

5
visual encoder and PLIP text encoder frozen. After optimizing Eq.(1), we use the outputs of the
adaptors as visual and text embeddings for downstream tasks. Unless otherwise specified, we refer
to this model as GigaPath-PLIP.

3.2 Multi-Magnification Descriptive Text Prompts


To improve vision-language models (VLMs) for whole-slide image (WSI) analysis, designing effective
text prompts is essential. Pathologists typically examine WSIs by first assessing tissue structures at
low magnification before zooming in to analyze finer details such as nuclear size and shape. Inspired
by this diagnostic workflow and the inherently multi-scale nature of WSIs, recent studies [53, 18]
have introduced dual-scale visual descriptive text prompts to guide VLMs, leading to significant
improvements in classification performance. Building on this observation, we further extend and
refine this strategy to enhance model effectiveness.
First, to ensure that generated prompts remain robust across varying WSI magnifications, we
design shared prompts that combine both high- and low-scale descriptive elements, treating them
as contextual embeddings. Specifically, we leverage the API of a frozen language model (GPT-4)
and query it with the prompt as Figure 3

LLM Prompt
What visually descriptive features characterize {class name} at both low and
high resolutions within the whole-slide image? Please summarize into a single
paragraph.

Figure 3: LLM template prompt.

In the above query, we replace {class name} by specific categories, for e.g., they are invasive ductal
carcinoma (IDC) and invasive lobular carcinoma (ILC) in the TCGA-BRCA dataset.
Second, at each low/high scale, rather than inserting a single learnable text prompt of length K
alongside a frozen contextual prompt from LLMs [53, 18], we propose using M learnable prompts.
This approach aims to capture different sub-regions or structural features within each patch that
might be overlooked with only a single prompt. Specifically, we define visual descriptive text prompts
for both low and high resolutions as follows:
n  o
(l) (l) (l) (l)
Ti = [ωi ]1 [ωi ]2 ...[ωi ]K [LLM context] |M
i=1
(h)
n
(h) (h) (h)
 o (2)
Ti = [ωi ](1) [ωi ]2 ...[ωi ]K [LLM context] |M
i=1 ,

where [ωiβ ]j , j ∈ [1, ..., K], i ∈ [1, .., M ] are KM trainable textual prompts for each resolution
β ∈ {l, h}.

3.3 Granularity-aware Visual Prompt Learning


We propose to adapt visual prompts to frozen features extracted by a pre-trained vision encoder
in the VLM model by taking into account thenimage patch o level and spatial groupings of patches.
(l)
Specifically, for each WSI W , we denote by W , W (h) are representations of W at low and

6
n o
high magnification. We define a bag of multiple instances of W as I = I (l) , I (h) where I (l) ∈
RNl ×Nb ×Nb ×3 , I (h) ∈ RNh ×Nb ×Nb ×3 with Nl , Nh indicate the number of low and high-resolution
image patches and Nb is the patch size. Following prior works [53, 21, 34, 18], we employ a
non-overlapping sliding window technique to extract patches I from the WSI.

3.3.1 Patches-based Prompting


The frozen image encoder EI (.) (or AI (EI (.)) in case of GigaPath-PLIP) is used to map patches I
into a feature vector H = {H (l) ∈ RNl ×d , H (h) ∈ RNh ×d } where d denotes the feature dimension.
To effectively consolidate the extensive set of patch features into a final slide-level representation,
we introduce a set of learnable visual prompts pv ∈ RNp ×d , which facilitate the progressive merging
of patch features in H (l) (similarly for H (h) ) (Figure 2). In particular, we formulate pv as Query
(l) (l)
and take all features in H (l) as the Keys Kp and Values Vp in in self-attention [62]. We then
associate pv with patch features as:
   
(l)T
pv Kp
p(l)
v,p = Normalize SoftMax  √  V (l)  + pv ,
p (3)
d
where Normalize(. ) and Softmax(. ) indicate the layer normalization operator and activation
function respectively. Intuitively, Eq.(3) computes the correlations between the visual prompt pv
and all individual feature patches in H l , subsequently grouping patches with high similarity to
form fused prompt embeddings. However, since cancerous tissues in WSIs often appear as large,
contiguous regions of adjacent image patches, this motivates the introduction of spatial patch
group-based attention.

3.3.2 Spatial Patch Group-based Prompting


We build spatial correlations for multiple instances
n in I by using
o their image patch coordinates
(l) (l) (l) (l)
inside each WSI W . In particular, taken I = I1 , I2 , ..., INl with their corresponding extracted
n o
(l) (l) (l)
features in H (l) = H1 , H2 , ..., HNl , we construct a graph G(l) = (V (l) , E (l) ) to capture regional
tissue structures where the set of vertices V (l) = I (l) , and E (l) ∈ {0, 1}Nl ×Nl is the set of edges.
Edges in E (l) can be defined by linking inner patches to their K-nearest neighbors based on the
coordinates. We define the node-feature embedding as X (l) = H (l) ∈ RNl ×d that associates each
(l) (l) (l)
vertex vi with its feature node xi = Hi .
We next design a trainable message-passing network gϵ (.) based on the graph attention layer (GAT)
[63] to capture the feature representation of each node and its local neighbors. The message passing
of the GAT layer is formulated as:
 
(l) (l)
exp σ(aTs Θs xi + aTt Θt xj )
αi,j = P (l) (l)
T
k∈N (i)∪{i} exp(σ(as Θs xi + aTt Θt xk )) (4)
(l)′ (l) X (l)
xi = αi,i Θs xi + αi,j Θt xj ,
j∈N (i)

(l)′ (l)
where xi is aggregated features of xi with its local region after GAT layer, σ(.) is the LeakyReLU
activation function, N (i) denote the neighboring nodes of the i-th node, αi,j are the attention

7
coefficients and as , at , Θs , Θt are weight parameters of gϵ (.).

After doing a message passing by gϵ (.), the graph of patch-image features G(l) is updated to G(l) ,
where each node now represents a super-node that encapsulates its corresponding feature region.
′ (l) (l)
We then squeeze all feature nodes in G(l) as a vector Hgr and treat them as another Keys Kgr
(l)
and Values Vgr for region-level features. Similar to Eq.(3), we associate prompt pv with those
group-level features:
   
(l)T
pv Kgr
plv,gr = Normalize SoftMax  √  V (l)  + pv .
gr (5)
d

The final output of our multi-granular is computed as:


p(l) (l) (l)
v = (1 − α) · pv,p + α · pv,gr , (6)
which interpolates between image patches and spatial patch groups.

3.4 Optimal Transport for Visual-Text Alignment


(l)
Given descriptive text prompts T(l) and T(h) (Eq.(2)) and visual prompt-guided slide features pv
(h)
and pv (Eq.(6)) for low and high resolutions, our goal is to maximize the similarity between slide
and text embeddings for each class c. Rather than relying on cosine distance, as in prior works
[75, 74, 72, 46, 46, 56], we propose using optimal transport (OT)-based distance to capture a more
nuanced cross-alignment between visual and text domains. Although OT has been explored for
prompt learning in natural images and multi-modal learning [24, 5, 38, 51], we are the first to adapt
it for whole-slide imaging (WSI), effectively handling the alignment of multi-magnification patches
to capture rich structural details across scales.
Recap OT: Given two sets of points (features), we can represent the corresponding discrete
distributions as follows:
M
X N
X
µ= p i δ fi , ν= qj δgi , (7)
i=1 j=1

where δf and δg represent Dirac delta functions centered at f and g, respectively, and M and
N indicate the dimensions of the empirical distribution. The weight vectors p = {pi }M i=1 and
N
q = {qi }j=1 lie within the M and N -dimensional simplex, respectively, meaning they satisfy
PM PN
i=1 pi = 1 and j=1 qj = 1. The discrete optimal transport problem can then be expressed as:
M X
N
T ∗ = arg min
X
Tij Cij
T ∈RM XN i=1 j=1

s.t. T 1N = µ, T ⊤ 1M = ν. (8)
where T ∗ is denoted as the optimal transport plan, which is optimized to minimize the total distance
between the two probability vectors, C is the cost matrix which measures the distance between fi
and gj . We then define the OT distance between µ and ν as:
dOT (µ, ν) = ⟨T ∗ , C⟩. (9)
(l)
Objective functions: Given the visual prompt-guided slide features pv ∈ RNp ×d in Eq.(6)
and the descriptive text prompts T(l) in Eq.(2), we compute the textual embedding for T(l) as

8
 
(l) (l) (l)
pt = ET (T(l) ) ∈ RM ×d .We next denote Tc as the input text prompts, pt as the extracted
(l) c
textual embedding, and pv as the visual prompt-guided slide features associated with class c.
c      
(l) (l) (l) (l)
We then aim to minimize the distance between Tc and pv , indicated as dOT Tc , pv
c     c
(l) (l)
in the paper, by computing optimal transport distance between pt and pv . Specifically,
  n o   n o c c
(l) (l) N
we treat pt → F = fi |M
i=1 and pv
p
→ G = gj |j=1 and compute the cost matrix C as
 c  c
C = 1 − F T G ∈ RM ×Np , which used to compute T ∗ in Eq. (8) for estimate optimal transport
   
(h) (h)
distance defined in Eq. (9). Following the same procedure, we can also compute dOT Tc , pv
c
at high-resolution image patches. Then, the prediction probability is written as:
   
(k) (k)
exp(2 −
P
k∈{l,h} dOT Tc , pv )
Pc = P   c   , (10)
C (k) (k)

P
c′ =1 exp(2 k∈{l,h} dOT Tc′ , pv )
c
where λk controls contribution of each-resolution. Finally, we can train the model with the
cross-entropy as:
Lclass = Cross(P, GT), (11)

with Cross(.) be the cross-entropy and GT denotes slide-level ground-truth.

The details for solvers of Eq.(9) and a relaxed version with unbalanced optimal transport are
presented in Sections (D) and (D.1) in Appendix. Intuitively, using OT, in this case, offers several
key advantages over cosine similarity. Pathology images exhibit complex, heterogeneous patterns
that can be described from multiple perspectives. OT models these relationships as a distribution,
enabling a more holistic alignment that handles variability and incomplete details while reducing
noise from irrelevant prompts. This enhances the model’s ability to generalize to unseen or complex
disease cases.

4 Experiments
4.1 Settings
Datasets for contrastive learning. PatchGastricADC22[61] consists of approximately 262K
patches derived from WSI of H&E-stained gastric adenocarcinoma specimens, each paired with
associated diagnostic captions collected from the University of Health and Welfare, Mita Hospital,
Japan. QUILT-1M [20] includes approximately 653K images and one million pathology image-text
pairs, gathered from 1,087 hours of educational histopathology videos presented by pathologists
on YouTube. ARCH [15] is a pathology multiple-instance captioning dataset containing pathology
images at the bag and tile level. However, our work focuses on tile-level images from all datasets
for our contrastive training strategy. In total, we collected approximately 923K images from these
datasets.

Downstream tasks. For the classification task, the proposed method was evaluated in three
datasets from the Cancer Genome Atlas Data Portal[60]: TCGA-NSCLC, TCGA-RCC, and TCGA-BRCA.
We followed the ViLa-MIL[53] experimental settings for TCGA-NSCLC and TCGA-RCC, randomly
selecting proportions for training, validation, and testing. For TCGA-BRCA, we adapted the training
and testing slide ID from MSCPT [18]. The detailed description is included in the appendix section.

9
Implementation Details. We followed the ViLa-MIL preprocessing pipeline for tissue region selec-
tion and patch cropping. To integrate our attention module with CLIP50 and PLIP, we extracted tile-
level embeddings from their frozen vision encoders (1024-dimensional for CLIP50 and 512 for PLIP).
We used the visual encoder of Prov-GigaPath to produce 1536-dimensional embeddings. To align it
with PLIP’s frozen text encoder, we developed two MLP-based adaptors that project both encoders
into a shared feature space during a contrastive learning process, using datasets outlined in Section 4.
To implement spatial attention, we use
a Graph Attention Network (GAT) to Table 1: Comparison of methods on TCGA-BRCA with few-
model spatial relationships between shot settings. Results are shown for AUC, F1, and Accuracy
WSI patches. Each tile-level embed- (ACC). FVM denotes for foundation vision-language models.
ding serves as a node, connected to its
TCGA-BRCA
left, right, top, and bottom neighbors, Methods # Param.
AUC F1 ACC
ensuring local spatial dependencies are
Max-pooling 197K 60.42±4.35 56.40±3.58 68.55±6.54
captured. We then integrate spatial Mean-pooling 197K 66.64±4.21 60.70±2.78 71.73±3.59
CLIP ImageNet Pretrained
ABMIL [21] 461K 69.24±3.90 61.72±3.36 72.77±3.15
patch group-based attention pv,gr into
CLAM-SB [34] 660K 67.80±5.14 60.51±5.01 72.46±4.36
patch-based attention pv,p using Equa- CLAM-MB [34] 660K 60.81±4.87 55.48±4.96 67.31±4.19
tion 6. The hyperparameter α (0 to TransMIL [52] 2.54M 65.62±3.20 60.75±4.04 67.52±4.16
DSMIL [26] 462K 66.18±3.08 59.35±3.18 67.52±1.56
1) controls the balance between spa- RRT-MIL [59] 2.63M 66.33±4.30 61.14±5.93 71.21±8.94
tial context and prototype-based guid- CoOp [75] 337K 68.86±4.35 61.64±2.40 71.08±3.22
CoCoOp [74] 370K 69.13±4.27 61.48±2.62 72.41±1.87
ance. Metaprompt [72] 360K 69.12±4.46 63.39±4.38 74.65±7.20
TOP [46] 2.11M 69.74±3.14 63.39±4.62 74.41±5.27
4.2 Comparison to State-of- ViLa-MIL [53] 2.77M 72.25±6.16 62.04±2.38 75.01±6.14
MSCPT [18] 1.35M 74.56±4.54 65.59±1.85 75.82±2.38
the-Art MGPath (ViT) 592K 74.96±6.98 64.60±5.39 77.10±2.39
FVM

CONCH [32] 110M 84.11±15.44 65.63±10.81 73.24±8.89


We compare our MGPath with state-of- QUILT [20] 63M 73.48±10.57 63.78±8.72 73.26±10.13
the-art multi-instance learning meth-
Max-pooling 197K 66.50±2.74 61.50±2.88 71.57±4.82
ods, including Maxpooling, Mean- Mean-pooling 197K 71.62±2.41 66.34±2.96 74.45±2.49
PLIP Pathology Pretrained

pooling, ABMIL [21], CLAM [34], ABMIL [21] 461K 72.41±4.25 63.04±3.62 74.09±4.38
CLAM-SB [34] 660K 72.34±6.17 65.51±3.28 76.16±4.36
TransMIL [52], DSMIL [26], GTMIL [73], CLAM-MB [34] 660K 73.41±3.76 66.11±1.94 77.88±2.30
DTMIL [70], RRT-MIL [59] and IBMIL [31], TransMIL [52] 2.54M 74.98±6.01 67.50±6.00 77.04±6.14
DSMIL [26] 462K 71.44±2.72 64.48±1.64 75.26±2.28
and vision-language methods, in- RRT-MIL [59] 2.63M 71.21±6.46 64.15±1.38 75.92±5.10
cluding CoOp [75], CoCoOp [74], CoOp [75] 337K 71.53±2.45 64.84±2.40 74.22±5.02
CoCoOp [74] 370K 72.65±4.63 66.63±3.55 66.98±3.35
Metaprompt [72], TOP [46], ViLa-MIL [53], Metaprompt [72] 360K 74.86±4.25 65.03±1.81 77.88±3.22
MSCPT [18], QUILT [20], CONCH [32]. TOP [46] 2.11M 76.13±6.01 66.55±1.72 78.58±5.30
ViLa-MIL [53] 2.77M 74.06±4.62 66.03±1.81 78.12±4.88
Among these, QUILT and CONCH are MSCPT [18] 1.35M 75.55±5.25 67.46±2.43 79.14±2.63
foundation VLMs. MGPath 592K 79.02±6.43 68.25±4.42 79.65±1.72
MGPath (PLIP-G) 5.35M 87.36±1.85 73.13±3.49 79.56±4.77
We provide different versions for
our MGPath including CLIP backbone
ReNet-50 (CLIP50) for TCGA-NSCLC
and TCGA-RCC and ViT-16 backbone for TCGA-BRCA. We also provide a version using PLIP backbone,
as well as GigaPath-PLIP, which was pre-trained on the Pathology dataset.

4.3 Results on Few-shot and Zero-shot Settings.


MGPath with CLIP and PLIP backbones outperform several competitive MIL and VLM
methods. As shown in Tables 4 and 1, our MGPath, based on CLIP50 and PLIP, outperforms several

10
Table 2: Comparison of methods on TCGA-NSCLC, and TCGA-RCC datasets with few-shot
settings. Results are shown for AUC, F1, and Accuracy (ACC).

TCGA-NSCLC TCGA-RCC
Methods # Param.
AUC F1 ACC AUC F1 ACC

Max-pooling 197K 53.0±6.0 45.8±8.9 53.3±3.4 67.4±4.9 46.7±11.6 54.1±4.8


Mean-pooling 197K 67.4±7.2 61.1±5.5 61.9±5.5 83.3±6.0 60.9±8.5 62.3±7.4
ABMIL [21] 461K 60.5±15.9 56.8±11.8 61.2±6.1 83.6±3.1 64.4±4.2 65.7±4.7
CLAM-SB [34] 660K 66.7±13.6 59.9±13.8 64.0±7.7 90.1±2.2 75.3±7.4 77.6±7.0
CLAM-MB [34] 660K 68.8±12.5 60.3±11.1 63.0±9.3 90.9±4.1 76.2±4.4 78.6±4.9
TransMIL [52] 2.54M 64.2±8.5 57.5±6.4 59.7±5.4 89.4±5.6 73.0±7.8 75.3±7.2
DSMIL [26] 462K 67.9±8.0 61.0±7.0 61.3±7.0 87.6±4.5 71.5±6.6 72.8±6.4
GTMIL [73] N/A 66.0±15.3 61.1±12.3 63.8±9.9 81.1±13.3 71.1±15.7 76.1±12.9
DTMIL [70] 986.7K 67.5±10.3 57.3±11.3 66.6±7.5 90.0±4.6 74.4±5.3 76.8±5.2
IBMIL [31] N/A 69.2±7.4 57.4±8.3 66.9±6.5 90.5±4.1 75.1±5.2 77.2±4.2
ViLa-MIL [53] 8.8M/47M 74.7±3.5 67.0±4.9 67.7±4.4 92.6±3.0 78.3±6.9 80.3±6.2
CONCH ([32]) 110M 89.46±10.2 78.5±9.31 78.78±9.1 88.08±4.59 78.21±4.2 71.67±19.4
QUILT [20] 63M 79.66±13.19 72.30±13.35 72.42±13.24 96.92±1.6 78.46±5.55 86.34±1.56
MGPath (CLIP) 1.6M/39M 77.2±1.3 70.9±2.0 71.0±2.1 92.1 ± 2.8 76.5 ± 5.2 81.7 ± 2.9
MGPath (PLIP) 592K 83.6 ± 4.5 76.41 ± 4.8 76.5 ± 4.8 94.7 ± 1.6 78.6 ± 4.9 83.6 ± 3.5
MGPath (PLIP-G) 5.35M 93.02±2.99 84.64±4.75 84.77±4.67 98.2±0.31 88.33±3.41 91.72±1.74

baseline models and achieves significant improvements over other VLMs with similar architectures,
such as ViLa and MSCPT. The performance gain is particularly notable with the PLIP backbone.
For example, on TCGA-BRCA using CLIP (ViT), MGPath achieves an accuracy of 77.10%, compared
to 75.82% for MSCPT and 75.01% for ViLA-MIL. Additionally, with the PLIP backbone, MGPath
surpasses MSCPT and ViLa-MIL by margins of approximately 3.5% to 5%.

GigaPath-PLIP is a strong pre-trained VLM. We validated our whole-slide vision foundation


model, pre-trained on 1.3 billion pathology images, using the PLIP text encoder. By incorporat-
ing pathology-specific features from Prov-GigaPath [66], the integrated MGPath (PLIP-G) model
demonstrated strong performance across multiple metrics on the TCGA-NSCLC, TCGA-RCC, and
TCGA-BRCA datasets. When compared to other foundation VLMs such as CONCH and QUILT, our
model consistently outperforms them. For example, we achieve a 3% improvement in AUC over
CONCH on both the TCGA-BRCA and TCGA-NSCLC datasets.

Table 3: Zero-shot classification performance on TCGA-NSCLC, TCGA-RCC, and TCGA-BRCA


datasets. Metrics reported include balanced accuracy (B-Acc) and weighted F1-score (W-F1).

TCGA-NSCLC TCGA-RCC TCGA-BRCA Average


Zero-shot
B-Acc W-F1 B-Acc W-F1 B-Acc W-F1 B-Acc W-F1
QuiltNet 61.3 56.1 59.1 51.8 51.3 40.1 57.23 49.33
CONCH 80.0 79.8 72.9 69.1 64.0 61.2 72.3 70.03
PLIP 70.0 68.5 50.7 46.0 64.7 63.8 61.8 59.43
PLIP-G (Our) 72.7 72.6 81.3 81.4 70.0 69.9 74.67 74.63

11
GigaPath-PLIP achieves competitive performance in zero-shot tasks. We evaluate
the zero-shot capabilities of our model on three datasets and compare its performance against
foundation VLMs such as CONCH, QUILT, and PLIP. The results, summarized in Table 3, show
that the proposed VLM model achieves the best average performance across datasets, followed by
CONCH and PLIP. This consistent top-tier performance across multiple benchmarks underscores
the robustness and generalizability of our model.

4.4 Ablation Studies


PLIP enhanced Prov-GigaPath. We validate the use of Prov-GigaPath and PLIP under the
following settings: (i) using full vision-language PLIP model; (ii) combining Prov-GigaPath with
PLIP through the MLP layer which was pre-trained on the large-scale dataset; (iii) integrating
Prov-GigaPath with PLIP through adaptor layers which were randomly initialized; (iv) utilizing
Prov-GigaPath and an adaptor layer to map to the class output and only train MLP and last FFN
layer of slide encoder; (v) only using Prov-GigaPath and an MLP layer to map to the class output
and train MLP, the query matrix of last layer and the last FFN layer of slide encoder. Table ??
shows that using Prov-GigaPath combined with PLIP boosts the final performance compared to
only using PLIP or Prov-GigaPath.

Table 5: Ablation studies on adaptor learning for


Table 4: Ablation studies on multi-granular Prov-GiGaPath and PLIP . PLIP-G denotes
(M-Gran), ratio combines two attention levels for mixed version between Prov-GiGaPath and
(α), and message passing network types. PLIP.
TCGA-NSCLC
Methods # Param.
TCGA-NSCLC AUC F1 ACC
Configurations MGPath (PLIP) 592K 83.6±4.5 76.41±4.8 76.5±4.8
AUC F1 ACC MGPath (PLIP-G) 5.35M 91.7±3.6 84.2±4.6 84.4±4.5
MGPath (CLIP) 76.2±2.2 69.0±3.5 69.3±2.8 MGPath Random Adaptors 5.35M 91.4±4.2 82.8±5.7 83.0±5.6
GiGaPath Tuning (MLP + last FFN) 4.7M 62.7±3.5 64.66±5.3 52.8±3.4
- w/o M-Gran (CLIP) 74.6±2.2 67.8±2.4 67.8±2.5 GiGaPath Tuning (MLP + last Q-ViT) 5.8M 83.1±6.9 74.3±7.5 75.8±6.1
MGPath (PLIP-G) 91.7±3.6 84.2±4.6 84.4±4.5
- w/o M-Gran (PLIP-G) 90.6±4.5 82.4±5.7 82.5±5.7 Table 6: Contribution of OT and multiple
MGPath, α = 0.2 76.2±2.2 69.0±3.5 69.3±2.8 descriptive text prompts
- α = 0.5 73.7±3.1 67.4±2.6 67.8±2.7 TCGA-NSCLC
Methods
- α = 0.8 72.2±5.2 66.4±5.5 66.8±5.2 AUC F1 ACC
MGPath (OT, 4 text prompts) 76.2±2.2 69.0±3.5 69.3±2.8
TCGA-RCC MGPath (OT, 2 text prompts) 77.2±1.3 70.9±2.0 71.0±2.1
MGPath (CLIP) 92.1±2.8 76.5±5.2 81.7±2.9 MGPath (Cosine, 2 text prompts) 75.8±3.7 68.3±4.5 68.4±4.5
- w/o M-Gran (CLIP) 91.6±3.5 72.3±6.4 80.2±4.4
TCGA-RCC
MGPath (PLIP-G) 98.1±0.6 85.7±1.1 89.9±2.0 MGPath (OT, 4 text prompts) 92.1±2.8 76.5±5.2 81.7±2.9
- w/o M-Gran (PLIP-G) 98.1±0.6 85.0±4.0 89.3±3.0 MGPath (OT, 2 text prompts) 92.1±2.6 75.6±3.9 80.4±2.4
MGPath (Cosine, 4 text prompts) 91.8±2.8 75.9±4.3 80.5±2.6

Multi-Granular Prompt Learning. In Table 5, we show the performance of MGPath with and
without multi-granular (M-Gran) for CLIP (row 1 and 2) and PLIP-G (row 3 and 4) on TCGA-NSCLC
dataset. It shows that using M-Gran improves the final performance of MGPath. This also happens on
TCGA-RCC dataset. Table 5 also shows the impact of ratio when combining attention with graph and
attention no graph on TCGA-NSCLC. It shows that with a ratio of 0.2/0.8 (0.2 for spatial attention
obtained from graph structure and 0.8 for prototype-guided attention), MGPath achieves the highest
performance.

12
OT as Alignment between Contextual Prompts. Table 6 validates the use of OT in our
MGPath on TCGA-NSCLC and TCGA-RCC. We see that using OT helps to boost the performance of
MGPath (rows 1 and 2) compared to the use of cosine (rows 3 and 4). It also shows that the number
of prompt vectors depends on each dataset. In the appendix, we also run with another version using
unbalanced optimal transport (UoT). We observe that both UoT and OT provide good alignment
quality, with UoT slightly outperforming OT. However, this advantage comes at the cost of increased
running time.

4.5 Discussion
While we demonstrate significant improvements in few-shot and zero-shot WSI classification across
several settings, this paper does not explore other important challenges. For example, how can we
scale the current attention mechanism to handle even larger image patches (e.g., using Flash Attention
[11]), or extend the model from classification to tumor segmentation tasks [23]. Additionally, the
potential for extending GiGaPath to integrate with other large-scale VLM models, such as CONCH
[32], remains unexplored.

5 Conclusion
High-resolution WSI is crucial for cancer diagnosis and treatment but presents challenges in data
analysis. Recent VLM approaches, which utilize few-shot and weakly supervised learning, have
shown promise in handling complex whole-slide pathology images with limited annotations. However,
many overlook the hierarchical relationships between visual and textual embeddings, ignoring the
connections between global and local pathological details or relying on non-pathology-specific
pre-trained models like CLIP. Additionally, previous metrics lack precision in capturing fine-grained
alignments between image-text pairs. To address these gaps, (i) we propose MGPath, which integrates
Prov-GigaPath with PLIP, cross-aligning them with 923K domain-specific image-text pairs. (ii)
Our multi-granular prompt learning approach captures hierarchical tissue details effectively, (iii)
while OT-based visual-text distance ensures robustness against data augmentation perturbations.
Extensive experiments on three cancer subtyping datasets demonstrate that MGPath achieves state-
of-the-art results in WSI classification. We expect that this work will pave the way for combining
large-scale domain-specific models with multi-granular prompt learning and optimal transport to
enhance few-shot learning in pathology.

13
Supplement to “MGPATH: Vision-Language Model with
Multi-Granular Prompt Learning for Few-Shot WSI Classification”

A Description of Dataset Splitting


TCGA-BRCA. This dataset contains 1056 whole slide images of breast invasive carcinoma. To conduct
fair experiments, we adapted training and testing slides provided by the GitHub repository of
MSCPT [18]. In the MSCPT setup, 20% of the dataset was allocated for training, while the
remaining 80% (833 slides) served as the test set. A fixed set of 16-shot WSIs was randomly sampled
from the training set. Additionally, MSCPT specified the exact training and testing slides used in
its experiments. However, there are 35 slides in which we got errors in the pre-processing steps;
thus, we replaced those slides with the other ones (same number of WSI per class) downloaded from
Cancer Genome Atlas (TCGA) Data Portal (GDC) [60].

TCGA-RCC & TCGA-NSCLC. We adopt the same data splitting as in ViLa-MIL [53], using 16-shot
samples for training in each dataset. For testing, 192 samples were used for TCGA-RCC and 197
samples were used for TCGA-NSCLC.

A.1 Other hyper-parameters


For all experiments, we trained MGPath with the Adam optimizer with a learning rate of 9 × 10−6
and a weight decay of 1 × 10−5 to fine-tune all versions of MGPath presented in Tables 1 and 4.
The training process was conducted for a maximum of 200 epochs, with a batch size set to 1. The
best checkpoints are picked based validation performance with F1 score.

A.2 Baseline Setups


TCGA-BRCA: The baselines in Table 1 are sourced from the MSCPT [18] paper, where various
methods are evaluated using two backbones: Vision Transformer (ViT) [2] from the CLIP model
(top section of Table 1) and PLIP [19] (bottom section of Table 1). In this context, we introduce
three variations of MGPath - ViT, PLIP, and GigaPath-PLIP (abbreviated as PLIP-G), where all
versions utilize frozen vision and text encoders.

TCGA-RCC & TCGA-NSCLC: The baselines in Table 4 are adapted from ViLa-MIL [53] where methods
employ ResNet-50 from the CLIP model as the primary backbone. We present MGPath results
using three architectures: ResNet-50, PLIP, and PLIP-G. With ResNet-50, we follow the ViLa-MIL
approach by training the text encoder and reporting performance for this setup. To assess efficiency,
we provide the total parameter counts for both ViLa-MIL and MGPath, considering scenarios with
frozen backbones and trainable text encoders. For PLIP and PLIP-G, all visual and text encoders
are kept frozen.

CONCH & QUILT: We download the pre-trained weights of these foundation models and adapt them
for zero-shot evaluation on TCGA datasets following the authors’ guidelines from [32], which
randomly sample 75 samples for each class. For few-shot settings, since official implementations
are not provided, we initialize the models with their pre-trained weights and allow fully fine-tune
the text encoder and evaluate on the same subsets that we use for other baselines. While CONCH

14
provides prompts for the datasets in its publication, QUILT does not. Therefore, we fine-tune the
model using CONCH’s prompts and our own generated prompts for QUILT.

Table 7: Comparison of message passing al-


gorithms in MGPath, including GAT-CONV,
Graph Isomorphism Network (GIN), and Graph
Convolutional Network (GCN). Performance is
evaluated on the TCGA-NSCLC dataset using
5-fold cross-validation.

TCGA-NSCLC
Configurations
AUC F1 ACC
Figure 4: AUC performance comparison over
MGPath (GAT CONV) 77.2±1.3 70.9±2.0 71.0±2.1
MGPath (GIN) 77.1±2.9 69.8±3.9 69.9±4.0
epochs for PLIP (blue) and PLIP combined
MGPath (GCN) 75.1±2.9 67.6±2.5 67.1±2.8
with GigaPath (red). GigaPath significantly
enhances the AUC, achieving more stable
and higher values, particularly in the early
epochs.

B Impact of PLIP enhanced Prov-GigaPath


Figure 4 presents the AUC curves for three randomly selected folds, illustrating the impact of
Prov-GigaPath on model performance. The results show that integrating Prov-GigaPath leads
to consistently higher AUC values across all folds, demonstrating its effectiveness in enhancing
the proposed model. Notably, the improvements are most pronounced during the early training
epochs, where the model converges faster and achieves more stable performance compared to the
baseline. This suggests that Prov-GigaPath facilitates better feature extraction and generalization,
ultimately leading to a more robust model.

C Ablation Study on Message Passing Networks


In Table 7, we evaluate the performance of MGPath (CLIP) using the Graph Attention Network
(GAT-CONV) against alternatives like the Graph Isomorphism Network (GIN) [67] and the Graph
Convolutional Network (GCN) [25]. The results show that MGPath (GIN) achieves comparable
performance to MGPath (GAT-CONV), however, with higher variance. In contrast, MGPath
(GAT-CONV) significantly outperforms the GCN-based version, likely due to GAT’s ability to
dynamically assign attention weights to neighboring image patches, enabling it to prioritize the
most relevant neighbors for each node.

D Additional Details on Optimal Transport Distance


The following paragraphs will provide detailed information on the implementation of (un-balanced)
optimal transport (OT) [64, 44] and specifically the alignment of prompt-guided visual-text distances
in MGPath.

15
D.1 OT Formulation and Efficient Solver
n o n o
Given two set of feature embeddings F = fi |M M ×d and G = g |N N ×d , we can
i=1 ∈ R j j=1 ∈ R
represent them as two discrete distributions µ and ν by:

M
X N
X
µ= pi δfi , ν= q j δ gj , (12)
i=1 j=1

where δfi and δgj represent Dirac delta functions centered at F and G, respectively and the weights
are elements of the marginal p = {pi }M N
i=1 and q = {qi }j=1 and can be selected as the uniform weight
PM PN
with i=1 pi = 1, j=1 qj = 1.
Then we can compute the distance between F and G through µ and ν (Eq.(9)) as

dOT (µ, ν) = ⟨T ∗ , C⟩. (13)

where M X
N
T ∗ = arg min
X
Tij Cij
T ∈RM XN i=1 j=1

s.t. T 1N = µ, T ⊤ 1M = ν. (14)
with C ∈ RM ×N is the cost matrix which measures the distance between fi ∈ µ and gj ∈ ν.
Because directly solving Eq (14) is high-computational costs (O(n3 log n) with n proportional to
M and N ), Sinkhorn algorithm [10] is proposed to approximate solution by solving a regularized
problem: M X
N
T ∗ = arg min
X
Tij Cij − λH(T )
T ∈RM XN i=1 j=1

s.t. T 1N = µ, T ⊤ 1M = ν. (15)
P
where H(T ) = ij Tij log Tij be an entropy function and λ > 0 is the regularization parameter.
The optimization problem in Eq. (15) is strictly convex, allowing us to achieve a solution efficiently
with fewer iterations as outlined below:
T ∗ = diag(at ) exp(−C/λ)diag(bt ) (16)
where t is the iteration and at = µ/ exp(−C/λ)bt−1 and bt = ν/ exp(−C/λ)at , with the ini-
tialization on b0 = 1. In our experiments, we used t = 100 and λ = 0.1 based on validation
performance.

D.2 Relaxed Marginal Constraints with Unbalanced Optimal Transport


Due to strict marginal constraints in Eq (14), optimal transport may be unrealistic in real-world
scenarios where data distributions are noisy, incomplete, or unbalanced. The Unbalanced Optimal
Transport (UoT) [9, 30] addresses this challenge by relaxing the marginal constraints, allowing for
partial matching through penalties on mass creation or destruction. In particular, UoT solves
M X
N
T ∗ = arg min
X
Tij Cij − λH(T ) (17)
T ∈RM XN i=1 j=1

16
+ ρ1 KL(T 1N ||µ) + ρ1 KL(T ⊤ 1M ||ν)

here, ρ1 and ρ2 represent the marginal regularization parameters, and KL(P ||Q) denotes the
Kullback-Leibler divergence between two positive vectors. Similar to the classical OT formulation,
there are solvers based on the Sinkhorn algorithm that can address Eq. (18) [45]. However, these
solvers typically require more iteration steps to converge to optimal solutions due to the added
complexity introduced by the relaxed marginal constraints.

E Unbalance Optimal Transport (UoT)

Table 8: MGPath performance and running time (in second) comparison between OT and UoT.

TCGA-NSCLC
Methods
AUC ↑ F1 ↑ ACC ↑ Time (s) ↓
MGPath (OT, 4 text prompts) 76.2±2.2 69.0±3.5 69.3±2.8 1482
MGPath (UoT, 4 text prompts) 77.0±1.8 70.2±3.4 70.4±3.3 3260
TCGA-RCC
MGPath (OT, 4 text prompts) 92.1±2.8 76.5±5.2 81.7±2.9 1451
MGPath (UoT, 4 text prompts) 92.8±2.4 76.8±4.7 82.4 ± 2.4 3049

To conduct a comparative evaluation of the performance of MGPath using optimal transport


versus unbalanced optimal transport (Section D.2) given the more flexible constraints in UoT, we
conducted an additional experiment. To be specific, we test on the TCGA-NSCLC and TCGA-RCC
datasets with the CLIP architecture (ResNet-50) using a 4-text-prompt setting. Table 8 presents
our findings where the running time is computed as seconds of average across five-folds. The results
show that UoT outperforms OT with an approximate 1% improvement across all metrics. However,
UoT is approximately 2 times slower than OT. This increase is attributed to the added flexibility
and complexity introduced by relaxing the marginal constraints in the UoT formulation. Given this
trade-off, we choose OT as the main distance in MGPath and leave the UoT version for further
evaluation. It is also important to know that our OT formulation leverages approximate solutions
through the regularized formulation (Eq.,(15)) and produces smoothed optimal mappings T ∗ , which
can implicitly help the model adapt to perturbations like UoT.

17
References
[1] F. Ahmed, A. Sellergren, L. Yang, S. Xu, B. Babenko, A. Ward, N. Olson, A. Mohtashamian,
Y. Matias, G. S. Corrado, et al. Pathalign: A vision-language model for whole slide images in
histopathology. arXiv preprint arXiv:2406.19578, 2024. (Cited on page 2.)

[2] D. Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv
preprint arXiv: 2010.11929, 2020. (Cited on page 14.)

[3] J. Barker, A. Hoogi, A. Depeursinge, and D. L. Rubin. Automated classification of brain tumor
type in whole-slide digital pathology images using local representative tiles. Medical image
analysis, 30:60–71, 2016. (Cited on page 2.)

[4] Q. Cao, Z. Xu, Y. Chen, C. Ma, and X. Yang. Domain-controlled prompt learning. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 936–944, 2024.
(Cited on page 3.)

[5] G. Chen, W. Yao, X. Song, X. Li, Y. Rao, and K. Zhang. Plot: Prompt learning with optimal
transport for vision-language models. International Conference on Learning Representations,
2023. (Cited on pages 3 and 8.)

[6] R. J. Chen, T. Ding, M. Y. Lu, D. F. Williamson, G. Jaume, A. H. Song, B. Chen, A. Zhang,


D. Shao, M. Shaban, et al. Towards a general-purpose foundation model for computational
pathology. Nature Medicine, 30(3):850–862, 2024. (Cited on page 4.)

[7] R. J. Chen, M. Y. Lu, M. Shaban, C. Chen, T. Y. Chen, D. F. Williamson, and F. Mahmood.


Whole slide images are 2d point clouds: Context-aware survival prediction using patch-based
graph convolutional networks. In Medical Image Computing and Computer Assisted Intervention–
MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1,
2021, Proceedings, Part VIII 24, pages 339–349. Springer, 2021. (Cited on page 4.)

[8] S. Cheng, S. Liu, J. Yu, G. Rao, Y. Xiao, W. Han, W. Zhu, X. Lv, N. Li, J. Cai, et al.
Robust whole slide image analysis for cervical cancer screening using deep learning. Nature
communications, 12(1):5639, 2021. (Cited on page 2.)

[9] L. Chizat, G. Peyré, B. Schmitzer, and F.-X. Vialard. Scaling algorithms for unbalanced
optimal transport problems. Mathematics of Computation, 87(314):2563–2609, 2018. (Cited on
page 16.)

[10] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in


neural information processing systems, 26, 2013. (Cited on page 16.)

[11] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact
attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–
16359, 2022. (Cited on page 13.)

[12] S. Dong, Z. Pan, Y. Fu, D. Xu, K. Shi, Q. Yang, Y. Shi, and C. Zhuo. Partial unbalanced
feature transport for cross-modality cardiac image segmentation. IEEE Transactions on Medical
Imaging, 42(6):1758–1773, 2023. (Cited on page 3.)

18
[13] N. Farahani, A. V. Parwani, and L. Pantanowitz. Whole slide imaging in pathology: advantages,
limitations, and emerging perspectives. Pathology and Laboratory Medicine International, pages
23–33, 2015. (Cited on page 2.)

[14] M. Gadermayr and M. Tschuchnig. Multiple instance learning for digital pathology: A review
of the state-of-the-art, limitations & future potential. Computerized Medical Imaging and
Graphics, page 102337, 2024. (Cited on page 2.)

[15] J. Gamper and N. Rajpoot. Multiple instance captioning: Learning representations from
histopathology textbooks and articles. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 16549–16559, 2021. (Cited on pages 3 and 9.)

[16] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao. Clip-adapter:
Better vision-language models with feature adapters. International Journal of Computer Vision,
132(2):581–595, 2024. (Cited on pages 2 and 3.)

[17] C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang. Domain adaptation via prompt
learning. IEEE Transactions on Neural Networks and Learning Systems, 2023. (Cited on page 4.)

[18] M. Han, L. Qu, D. Yang, X. Zhang, X. Wang, and L. Zhang. Mscpt: Few-shot whole slide
image classification with multi-scale and context-focused prompt tuning. arXiv preprint
arXiv:2408.11505, 2024. (Cited on pages 2, 3, 4, 6, 7, 9, 10, and 14.)

[19] Z. Huang, F. Bianchi, M. Yuksekgonul, T. J. Montine, and J. Zou. A visual–language foundation


model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316,
2023. (Cited on pages 2, 3, 4, and 14.)

[20] W. Ikezogwo, S. Seyfioglu, F. Ghezloo, D. Geva, F. Sheikh Mohammed, P. K. Anand, R. Krishna,


and L. Shapiro. Quilt-1m: One million image-text pairs for histopathology. Advances in neural
information processing systems, 36, 2024. (Cited on pages 2, 3, 4, 9, 10, and 11.)

[21] M. Ilse, J. Tomczak, and M. Welling. Attention-based deep multiple instance learning. In
International conference on machine learning, pages 2127–2136. PMLR, 2018. (Cited on pages 2,
4, 7, 10, and 11.)

[22] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan. Maple: Multi-modal prompt
learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 19113–19122, 2023. (Cited on page 4.)

[23] M. Khened, A. Kori, H. Rajkumar, G. Krishnamurthi, and B. Srinivasan. A generalized


deep learning framework for whole-slide image segmentation and analysis. Scientific reports,
11(1):11579, 2021. (Cited on page 13.)

[24] K. Kim, Y. Oh, and J. C. Ye. Zegot: Zero-shot segmentation through optimal transport of
text prompts. arXiv preprint arXiv:2301.12171, 2023. (Cited on page 8.)

[25] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks.
International Conference on Learning Representations, 2017. (Cited on page 15.)

19
[26] B. Li, Y. Li, and K. W. Eliceiri. Dual-stream multiple instance learning network for whole slide
image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 14318–14328, 2021. (Cited on
pages 4, 10, and 11.)

[27] H. Li, C. Zhu, Y. Zhang, Y. Sun, Z. Shui, W. Kuang, S. Zheng, and L. Yang. Task-specific
fine-tuning via variational information bottleneck for weakly-supervised pathology whole slide
image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 7454–7463, 2023. (Cited on page 2.)

[28] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1: Long Papers, 2021.
(Cited on pages 2 and 3.)

[29] Z. Li, L. Zhao, Z. Zhang, H. Zhang, D. Liu, T. Liu, and D. N. Metaxas. Steering prototypes
with prompt-tuning for rehearsal-free continual learning. In Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision, pages 2523–2533, 2024. (Cited on page 4.)

[30] M. Liero, A. Mielke, and G. Savaré. Optimal entropy-transport problems and a new hellinger–
kantorovich distance between positive measures. Inventiones mathematicae, 211(3):969–1117,
2018. (Cited on page 16.)

[31] T. Lin, Z. Yu, H. Hu, Y. Xu, and C.-W. Chen. Interventional bag multi-instance learning on
whole-slide pathological images. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 19830–19839, 2023. (Cited on pages 2, 10, and 11.)

[32] M. Y. Lu, B. Chen, D. F. Williamson, R. J. Chen, I. Liang, T. Ding, G. Jaume, I. Odintsov,


L. P. Le, G. Gerber, et al. A visual-language foundation model for computational pathology.
Nature Medicine, 30(3):863–874, 2024. (Cited on pages 3, 4, 10, 11, 13, and 14.)

[33] M. Y. Lu, B. Chen, A. Zhang, D. F. Williamson, R. J. Chen, T. Ding, L. P. Le, Y.-S.


Chuang, and F. Mahmood. Visual language pretrained multiple instance zero-shot transfer for
histopathology images. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 19764–19775, 2023. (Cited on pages 2 and 4.)

[34] M. Y. Lu, D. F. Williamson, T. Y. Chen, R. J. Chen, M. Barbieri, and F. Mahmood. Data-


efficient and weakly supervised computational pathology on whole-slide images. Nature biomed-
ical engineering, 5(6):555–570, 2021. (Cited on pages 4, 7, 10, and 11.)

[35] Y. Lu, J. Liu, Y. Zhang, Y. Liu, and X. Tian. Prompt distribution learning. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215,
2022. (Cited on page 4.)

[36] A. Madabhushi and G. Lee. Image analysis and machine learning in digital pathology: Challenges
and opportunities. Medical image analysis, 33:170–175, 2016. (Cited on page 2.)

[37] D. Nechaev, A. Pchelnikov, and E. Ivanova. Hibou: A family of foundational vision transformers
for pathology. arXiv preprint arXiv:2406.05074, 2024. (Cited on page 4.)

20
[38] D. M. Nguyen, N. T. Diep, T. Q. Nguyen, H.-B. Le, T. Nguyen, T. Nguyen, T. Nguyen, N. Ho,
P. Xie, R. Wattenhofer, et al. Logra-med: Long context multi-graph alignment for medical
vision-language model. arXiv preprint arXiv:2410.02615, 2024. (Cited on page 8.)

[39] D. M. Nguyen, A. T. Le, T. Q. Nguyen, N. T. Diep, T. Nguyen, D. Duong-Tran, J. Peters,


L. Shen, M. Niepert, and D. Sonntag. Dude: Dual distribution-aware context prompt learning
for large vision-language model. Asian Conference on Machine Learning (ACML)), 2024. (Cited
on page 3.)

[40] H. Nguyen, K. Le, Q. Nguyen, T. Pham, H. Bui, and N. Ho. On robust optimal transport:
Computational complexity and barycenter computation. In Advances in NeurIPS, 2021. (Cited
on page 3.)

[41] M. K. K. Niazi, A. V. Parwani, and M. N. Gurcan. Digital pathology and artificial intelligence.
The lancet oncology, 20(5):e253–e261, 2019. (Cited on page 1.)

[42] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive
coding. arXiv preprint arXiv:1807.03748, 2018. (Cited on page 5.)

[43] L. Pantanowitz, P. N. Valenstein, A. J. Evans, K. J. Kaplan, J. D. Pfeifer, D. C. Wilbur, L. C.


Collins, and T. J. Colgan. Review of the current state of whole slide imaging in pathology.
Journal of pathology informatics, 2(1):36, 2011. (Cited on page 1.)

[44] G. Peyré, M. Cuturi, et al. Computational optimal transport: With applications to data science.
Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019. (Cited on page 15.)

[45] K. Pham, K. Le, N. Ho, T. Pham, and H. Bui. On unbalanced optimal transport: An analysis
of Sinkhorn algorithm. In International Conference on Machine Learning, pages 7673–7682.
PMLR, 2020. (Cited on pages 3 and 17.)

[46] L. Qu, K. Fu, M. Wang, Z. Song, et al. The rise of ai language pathologists: Exploring two-level
prompt learning for few-shot weakly-supervised whole slide image classification. Advances in
Neural Information Processing Systems, 36, 2024. (Cited on pages 3, 4, 8, and 10.)

[47] L. Qu, D. Yang, D. Huang, Q. Guo, R. Luo, S. Zhang, and X. Wang. Pathology-knowledge
enhanced multi-instance prompt learning for few-shot whole slide image classification. arXiv
preprint arXiv:2407.10814, 2024. (Cited on page 2.)

[48] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,


P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision.
In International conference on machine learning, pages 8748–8763. PMLR, 2021. (Cited on
page 2.)

[49] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu. Denseclip: Language-
guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 18082–18091, 2022. (Cited on
page 4.)

21
[50] J. Ryu, A. V. Puche, J. Shin, S. Park, B. Brattoli, J. Lee, W. Jung, S. I. Cho, K. Paeng, C.-Y.
Ock, et al. Ocelot: overlapped cell on tissue dataset for histopathology. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23902–23912, 2023.
(Cited on page 2.)

[51] T. Séjourné, G. Peyré, and F.-X. Vialard. Unbalanced optimal transport, from theory to
numerics. Handbook of Numerical Analysis, 24:407–471, 2023. (Cited on pages 3 and 8.)

[52] Z. Shao, H. Bian, Y. Chen, Y. Wang, J. Zhang, X. Ji, et al. Transmil: Transformer-based
correlated multiple instances learning for whole slide image classification. Advances in neural
information processing systems, 34:2136–2147, 2021. (Cited on pages 4, 10, and 11.)

[53] J. Shi, C. Li, T. Gong, Y. Zheng, and H. Fu. Vila-mil: Dual-scale vision-language multiple
instance learning for whole slide image classification. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pages 11248–11258, 2024. (Cited on pages 2, 3,
4, 6, 7, 9, 10, 11, and 14.)

[54] J. Shi, L. Tang, Y. Li, X. Zhang, Z. Gao, Y. Zheng, C. Wang, T. Gong, and C. Li. A
structure-aware hierarchical graph-based multiple instance learning framework for pt staging
in histopathological image. IEEE Transactions on Medical Imaging, 42(10):3000–3011, 2023.
(Cited on page 2.)

[55] M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao. Test-time
prompt tuning for zero-shot generalization in vision-language models. Advances in Neural
Information Processing Systems, 35:14274–14289, 2022. (Cited on page 4.)

[56] S. P. Singh and M. Jaggi. Model fusion via optimal transport. Advances in Neural Information
Processing Systems, 33:22045–22055, 2020. (Cited on page 8.)

[57] A. H. Song, G. Jaume, D. F. Williamson, M. Y. Lu, A. Vaidya, T. R. Miller, and F. Mahmood.


Artificial intelligence for digital and computational pathology. Nature Reviews Bioengineering,
1(12):930–949, 2023. (Cited on page 2.)

[58] W. Tang, S. Huang, X. Zhang, F. Zhou, Y. Zhang, and B. Liu. Multiple instance learning
framework with masked hard instance mining for whole slide image classification. In Proceedings
of the IEEE/CVF International Conference on Computer Vision, pages 4078–4087, 2023. (Cited
on page 2.)

[59] W. Tang, F. Zhou, S. Huang, X. Zhu, Y. Zhang, and B. Liu. Feature re-embedding: To-
wards foundation model-level performance in computational pathology. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11343–11352, 2024.
(Cited on page 10.)

[60] The Cancer Genome Atlas (TCGA). Genomic Data Commons Data Portal (GDC). https:
//portal.gdc.cancer.gov/projects/TCGA-BRCA. Accessed 07 Jul. 2023. (Cited on pages 9
and 14.)

[61] M. Tsuneki and F. Kanavati. Inference of captions from histopathological patches. In Inter-
national Conference on Medical Imaging with Deep Learning, pages 1235–1250. PMLR, 2022.
(Cited on pages 3 and 9.)

22
[62] A. Vaswani. Attention is all you need. Advances in Neural Information Processing Systems,
2017. (Cited on pages 2 and 7.)

[63] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention
networks. arXiv preprint arXiv:1710.10903, 2017. (Cited on page 7.)

[64] C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009. (Cited on page 15.)

[65] G. Xu, Z. Song, Z. Sun, C. Ku, Z. Yang, C. Liu, S. Wang, J. Ma, and W. Xu. Camel: A weakly
supervised learning framework for histopathology image segmentation. In Proceedings of the
IEEE/CVF International Conference on computer vision, pages 10682–10691, 2019. (Cited on
page 2.)

[66] H. Xu, N. Usuyama, J. Bagga, S. Zhang, R. Rao, T. Naumann, C. Wong, Z. Gero, J. González,
Y. Gu, et al. A whole-slide foundation model for digital pathology from real-world data. Nature,
pages 1–8, 2024. (Cited on pages 2, 3, 4, and 11.)

[67] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks?
International Conference on Learning Representations, 2019. (Cited on page 15.)

[68] H. Yao, R. Zhang, and C. Xu. Tcp: Textual-based class-aware prompt tuning for visual-
language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 23438–23448, 2024. (Cited on pages 2, 3, and 4.)

[69] F. Zhan, Y. Yu, K. Cui, G. Zhang, S. Lu, J. Pan, C. Zhang, F. Ma, X. Xie, and C. Miao.
Unbalanced feature transport for exemplar-based image translation. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages 15028–15038, 2021.
(Cited on page 3.)

[70] H. Zhang, Y. Meng, Y. Zhao, Y. Qiao, X. Yang, S. E. Coupland, and Y. Zheng. Dtfd-mil:
Double-tier feature distillation multiple instance learning for histopathology whole slide image
classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 18802–18812, 2022. (Cited on pages 10 and 11.)

[71] Y. Zhang, H. Fei, D. Li, T. Yu, and P. Li. Prompting through prototype: A prototype-based
prompt learning on pretrained vision-language models. arXiv preprint arXiv:2210.10841, 2022.
(Cited on page 4.)

[72] C. Zhao, Y. Wang, X. Jiang, Y. Shen, K. Song, D. Li, and D. Miao. Learning domain invariant
prompt for vision-language models. IEEE Transactions on Image Processing, 2024. (Cited on
pages 8 and 10.)

[73] Y. Zheng, R. H. Gindra, E. J. Green, E. J. Burks, M. Betke, J. E. Beane, and V. B. Kolachalama.


A graph-transformer for whole slide image classification. IEEE transactions on medical imaging,
41(11):3003–3015, 2022. (Cited on pages 4, 10, and 11.)

[74] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Conditional prompt learning for vision-language
models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 16816–16825, 2022. (Cited on pages 4, 8, and 10.)

23
[75] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models.
International Journal of Computer Vision, 130(9):2337–2348, 2022. (Cited on pages 2, 3, 4, 8,
and 10.)

24

You might also like