MGPATH: Vision-Language Model With Multi-Granular Prompt Learning For Few-Shot WSI Classification
MGPATH: Vision-Language Model With Multi-Granular Prompt Learning For Few-Shot WSI Classification
Abstract
Whole slide pathology image classification presents challenges due to gigapixel image sizes
and limited annotation labels, hindering model generalization. This paper introduces a prompt
learning method to adapt large vision-language models for few-shot pathology classification. We
first extend the Prov-GigaPath vision foundation model, pre-trained on 1.3 billion pathology
image tiles, into a vision-language model by adding adaptors and aligning it with medical
text encoders via contrastive learning on 923K image-text pairs. The model is then used to
extract visual features and text embeddings from few-shot annotations and fine-tunes with
learnable prompt embeddings. Unlike prior methods that combine prompts with frozen features
using prefix embeddings or self-attention, we propose multi-granular attention that compares
interactions between learnable prompts with individual image patches and groups of them.
This approach improves the model’s ability to capture both fine-grained details and broader
context, enhancing its recognition of complex patterns across sub-regions. To further improve
accuracy, we leverage (unbalanced) optimal transport-based visual-text distance to secure model
robustness by mitigating perturbations that might occur during the data augmentation process.
Empirical experiments on lung, kidney, and breast pathology modalities validate the effectiveness
of our approach; thereby, we surpass several of the latest competitors and consistently improve
performance across diverse architectures, including CLIP, PLIP, and Prov-GigaPath integrated
PLIP. We release our implementations and pre-trained models at this MGPATH.
1 Introduction
Whole slide imaging (WSI) [41] has become essential in modern pathology for capturing high-
resolution digital representations of entire tissue samples, enabling easier digital storage, sharing,
and remote analysis [43]. Unlike conventional methods that depend on examining slides under a
⋆
Equal second contribution
1
microscope, WSI provides faster, detailed structural and cellular insights essential for disease diagnosis
across multiple tissue layers, which is particularly valuable in cancer screening [3, 8]. Nevertheless,
WSIs are massive images, often containing billions of pixels [13, 57], making detailed annotations
and analysis difficult and expensive. To tackle these challenges, machine learning techniques
incorporating few-shot and weakly supervised learning have been developed [36, 27, 31, 50, 53].
Among these, multiple instance learning (MIL) and vision-language models (VLMs) have gained
particular attention for their ability to effectively manage limited annotations and interpret complex
whole-slide pathology images.
Classifier
Vision Patch-level
are extracted feature embeddings us- Encoder Attention
ing pre-trained vision encoders before Prompt- Prediction
being grouped into a "bag", i.e., a Prefix Tuning guided
WSI Patches slide feature
whole slide-level representation for the
🔥 Prompts
🔥
entire WSI. The MIL model mainly MGPATH ❄️
Classifier
focuses on learning ensemble func- Vision
Encoder t+9 t+1 t+2 t+3
tions to identify patterns in specific Multi-
patches, contributing to the overall la- Granular Prompt- Prediction
Attention guided
bel prediction for each bag (e.g., can- WSI Patches slide feature
cerous or non-cancerous), hence re-
Figure 1: Unlike previous methods that add prompts at pre-
ducing the need for detailed annota-
fix positions or patch-level attention - disrupting structural
tions. Nonetheless, these methods of-
correlations - our MGPath framework integrates prompts
ten struggle to select relevant patches
at both regional and individual patch levels (multi-granular
due to complex correlations and tissue
attention).
variability [14, 47]. To overcome those
obstacles, VLMs [33, 19, 20, 53] have emerged as a promising solution, combining slide-level visual
features with textual descriptions to enrich contextual understanding and support predictions
in sparse data scenarios with approaches such as zero-shot learning [66, 1]. Specifically, VLMs
incorporate multi-scale images [53, 18], permitting the extraction of global and local WSI features
at different resolutions. To adapt the pre-trained vision-language model efficiently, prompt learning
[75, 16] is employed where learnable prompts are treated as part of the input text to guide the
model, and contextual prompts [28, 68] are integrated into feature embeddings using a self-attention
mechanism [62]. Despite their strong classification performance across diverse tasks, these approaches
still encounter certain limitations.
First, (i) adapting prompt learning with frozen visual features often neglects the hierarchical
relationships among learnable prompts and the visual features they interact with - specifically,
the multi-granular attention between prompts to individual patches and groups of patches. This
limitation lessens the model’s ability to capture interdependence across distinct scales — from
fine-grained local features to broader contextual information, leading to less accurate comprehension
of complex patterns in pathology images. Second, (ii) many VLMs rely on the CLIP architecture
[48], which was not explicitly pre-trained on pathology images, thereby limiting its adaptability
in few-shot settings, especially when the architecture is primarily frozen and prompt learning is
applied. While there exist recent works that have incorporated PLIP [19], a model pre-trained
2
on 200k pathology image-text pairs curated from Twitter and showed significant improvements,
an open question remains whether scaling pre-training to millions or billions of pathology-specific
samples could further boost performance. Lastly, (iii) most VLM models for whole-slide pathology
rely on cosine similarity to align visual and textual features. This metric, however, can struggle
with multiple text descriptions for sub-regions [5] and with augmented data perturbations [39], as it
lacks the precision to capture fine-grained alignments between varied image-text pairs.
In this work, we present MGPath, a novel VLM method developed to address the challenges in
whole-slide pathology classification. Our approach begins by adapting Prov-GigaPath [66] - one
of the largest pre-trained vision models trained on 1.3 billion pathology image patches - into a
vision-language model. We accomplish this through contrastive learning with a pre-trained text
encoder from the PLIP model [19], which was trained on approximately 200K pathology image-text
pairs. To strengthen this alignment, we collected an additional 923K image-text pairs from ARCH
[15], PatchGastricADC22 [61] and Quilt-1M [20] and trained adaptor-based cross-alignment [16, 4]
between Prov-GigaPath’s visual encoder and PLIP’s text encoder. Crucially, only lightweight
adaptors are updated, making this process highly parameter-efficient. To the best of our knowledge,
MGPath is the first parameter-efficient vision-language model trained for pathology at this data
scale — utilizing 923K image-text pairs compared to the 200K in PLIP, and further benefiting from
Prov-GigaPath’s 1.3 billion sample pre-training.
Next, we leverage these pre-trained models for few-shot WSI tasks by introducing multi-granular
prompt learning. First, visual embeddings and descriptive text prompts are generated for image
patches at different resolutions using large language models, which have been shown to improve
performance [18, 53, 46]. Unlike prior methods that concatenate or use basic attention on individual
patches [28, 75, 68, 53], our attention integrates learnable prompts with frozen visual features at
both fine- and coarse-grained perspectives (Figure 1). We represent image patches from each WSI
as a spatial graph, using bounding box coordinates to enable region-level aggregation through
message passing along local connections. This spatial structure is encoded as tokens within the
Key-Value matrices, which interact with Query matrices derived from prompt embeddings. By
directing attention from Query to Key-Value matrices across both patch and region levels, our
approach effectively captures hierarchical information, enriching feature representation and selectively
emphasizing features across diverse tissue areas.
Finally, to measure the distance between prompt-fused visual embedding and multiple text prompts,
we resort to the optimal transport (OT) method [40, 45, 51, 5, 12, 39, 69], providing flexibility
in aligning heterogeneous data distributions. This property is beneficial in few-shot WSI classifi-
cation when it can (i) handle data augmentation with noise, as OT can adapt to perturbations
without losing meaningful structural relationships, and (ii) capture imbalances masses between
two modality embeddings when text prompts only describe sub-regions in WSI samples. Through
extensive evaluations of three datasets with various architectures (CLIP-ResNet50, CLIP-ViTB16,
PLIP, and (Prov-GigaPath)-integrated PLIP), we observe that MGPath demonstrate consistent
improvements over several state-of-the-art MIL and VLM in literature (14 competitors). As an
example, MGPath with (Prov-GigaPath)-PLIP variant outperforms MSCPT [18] by 5% in F1 and
8% in AUC on the TCGA-BRCA dataset. Additionally, it also surpasses two state-of-the-art VLMs
models, CONCH [32] and QUILT [20], by approximately 6% in accuracy on TCGA-BRCA.
3
2 Related Work
2.1 Large-scale Pre-trained Models for Pathology
Recent advancements in large-scale pre-trained models for pathology can be broadly classified into
two categories. Vision models, such as Virchow [20], Hibou [37], UNI [6], and Prov-GigaPath [66]
leverage massive pathology image datasets to learn robust visual representations. Among these,
Prov-GigaPath stands out as the largest model, trained on 1.3 billion pathology image patches, and
excels in resolving complex tissue patterns at high resolution. On the other hand, vision-language
models (VLMs) like PLIP [19] (trained 200K image-text pairs), CONCH [32] (1.17M), or QUILTNET[20]
(1M), integrate visual and textual information to enhance contextual understanding and improve
pathology slide interpretation. In contrast, our MGPath combines the strengths of both approaches by
using a parameter-efficient adaptor to link Prov-GigaPath (the largest pre-trained vision encoder)
with a text encoder from VLMs like PLIP or CONCH, leveraging both rich visual features and semantic
textual embeddings. Although we use the PLIP text encoder in our experiments due to its common
use in baselines, the method can be extended to other larger pre-trained text models.
4
3 Methods
Figure 2 provides an overview of the key steps in our method. Before diving into the details of
each section, we first introduce our PLIP model enhanced by Prov-GigaPath through the use of
adaptors.
Multi-Granular Attention
Dot-Product
❄️
Attention
Prompts 🔥 qq
Vision qq qq
Encoder
qq qq
🔥
5x Message Multi-Granular Attention
Passing
structures, smooth or
and high resolutions
irregular margins Encoder
within the whole slide
image? Please summarize indicating invasiveness,
into a single paragraph and features such as clear
Contextual
cytoplasm, prominent
Question nucleoli, rich vascularity, Prompt
and cellular heterogeneity.
Visual Descriptive Text Prompt
Figure 2: The pipeline of the proposed MGPath method. Low- and high-resolution image patches
are processed with large language models to generate visual contextual descriptions (Section 3.2).
Visual prompts are integrated with frozen features through multi-granular attention at both patch
and group-of-patch levels 3.3. The final output is obtained by aligning visual and text embeddings
using optimal transport (Section 3.4).
where cos(.) is the cosine similarity, and τ denotes for temperature of the softmax function. For
parameter efficiency, we train only the adaptors AI (.), AT (.) while keeping the Prov-GigaPath
5
visual encoder and PLIP text encoder frozen. After optimizing Eq.(1), we use the outputs of the
adaptors as visual and text embeddings for downstream tasks. Unless otherwise specified, we refer
to this model as GigaPath-PLIP.
LLM Prompt
What visually descriptive features characterize {class name} at both low and
high resolutions within the whole-slide image? Please summarize into a single
paragraph.
In the above query, we replace {class name} by specific categories, for e.g., they are invasive ductal
carcinoma (IDC) and invasive lobular carcinoma (ILC) in the TCGA-BRCA dataset.
Second, at each low/high scale, rather than inserting a single learnable text prompt of length K
alongside a frozen contextual prompt from LLMs [53, 18], we propose using M learnable prompts.
This approach aims to capture different sub-regions or structural features within each patch that
might be overlooked with only a single prompt. Specifically, we define visual descriptive text prompts
for both low and high resolutions as follows:
n o
(l) (l) (l) (l)
Ti = [ωi ]1 [ωi ]2 ...[ωi ]K [LLM context] |M
i=1
(h)
n
(h) (h) (h)
o (2)
Ti = [ωi ](1) [ωi ]2 ...[ωi ]K [LLM context] |M
i=1 ,
where [ωiβ ]j , j ∈ [1, ..., K], i ∈ [1, .., M ] are KM trainable textual prompts for each resolution
β ∈ {l, h}.
6
n o
high magnification. We define a bag of multiple instances of W as I = I (l) , I (h) where I (l) ∈
RNl ×Nb ×Nb ×3 , I (h) ∈ RNh ×Nb ×Nb ×3 with Nl , Nh indicate the number of low and high-resolution
image patches and Nb is the patch size. Following prior works [53, 21, 34, 18], we employ a
non-overlapping sliding window technique to extract patches I from the WSI.
(l)′ (l)
where xi is aggregated features of xi with its local region after GAT layer, σ(.) is the LeakyReLU
activation function, N (i) denote the neighboring nodes of the i-th node, αi,j are the attention
7
coefficients and as , at , Θs , Θt are weight parameters of gϵ (.).
′
After doing a message passing by gϵ (.), the graph of patch-image features G(l) is updated to G(l) ,
where each node now represents a super-node that encapsulates its corresponding feature region.
′ (l) (l)
We then squeeze all feature nodes in G(l) as a vector Hgr and treat them as another Keys Kgr
(l)
and Values Vgr for region-level features. Similar to Eq.(3), we associate prompt pv with those
group-level features:
(l)T
pv Kgr
plv,gr = Normalize SoftMax √ V (l) + pv .
gr (5)
d
where δf and δg represent Dirac delta functions centered at f and g, respectively, and M and
N indicate the dimensions of the empirical distribution. The weight vectors p = {pi }M i=1 and
N
q = {qi }j=1 lie within the M and N -dimensional simplex, respectively, meaning they satisfy
PM PN
i=1 pi = 1 and j=1 qj = 1. The discrete optimal transport problem can then be expressed as:
M X
N
T ∗ = arg min
X
Tij Cij
T ∈RM XN i=1 j=1
s.t. T 1N = µ, T ⊤ 1M = ν. (8)
where T ∗ is denoted as the optimal transport plan, which is optimized to minimize the total distance
between the two probability vectors, C is the cost matrix which measures the distance between fi
and gj . We then define the OT distance between µ and ν as:
dOT (µ, ν) = ⟨T ∗ , C⟩. (9)
(l)
Objective functions: Given the visual prompt-guided slide features pv ∈ RNp ×d in Eq.(6)
and the descriptive text prompts T(l) in Eq.(2), we compute the textual embedding for T(l) as
8
(l) (l) (l)
pt = ET (T(l) ) ∈ RM ×d .We next denote Tc as the input text prompts, pt as the extracted
(l) c
textual embedding, and pv as the visual prompt-guided slide features associated with class c.
c
(l) (l) (l) (l)
We then aim to minimize the distance between Tc and pv , indicated as dOT Tc , pv
c c
(l) (l)
in the paper, by computing optimal transport distance between pt and pv . Specifically,
n o n o c c
(l) (l) N
we treat pt → F = fi |M
i=1 and pv
p
→ G = gj |j=1 and compute the cost matrix C as
c c
C = 1 − F T G ∈ RM ×Np , which used to compute T ∗ in Eq. (8) for estimate optimal transport
(h) (h)
distance defined in Eq. (9). Following the same procedure, we can also compute dOT Tc , pv
c
at high-resolution image patches. Then, the prediction probability is written as:
(k) (k)
exp(2 −
P
k∈{l,h} dOT Tc , pv )
Pc = P c , (10)
C (k) (k)
−
P
c′ =1 exp(2 k∈{l,h} dOT Tc′ , pv )
c
where λk controls contribution of each-resolution. Finally, we can train the model with the
cross-entropy as:
Lclass = Cross(P, GT), (11)
The details for solvers of Eq.(9) and a relaxed version with unbalanced optimal transport are
presented in Sections (D) and (D.1) in Appendix. Intuitively, using OT, in this case, offers several
key advantages over cosine similarity. Pathology images exhibit complex, heterogeneous patterns
that can be described from multiple perspectives. OT models these relationships as a distribution,
enabling a more holistic alignment that handles variability and incomplete details while reducing
noise from irrelevant prompts. This enhances the model’s ability to generalize to unseen or complex
disease cases.
4 Experiments
4.1 Settings
Datasets for contrastive learning. PatchGastricADC22[61] consists of approximately 262K
patches derived from WSI of H&E-stained gastric adenocarcinoma specimens, each paired with
associated diagnostic captions collected from the University of Health and Welfare, Mita Hospital,
Japan. QUILT-1M [20] includes approximately 653K images and one million pathology image-text
pairs, gathered from 1,087 hours of educational histopathology videos presented by pathologists
on YouTube. ARCH [15] is a pathology multiple-instance captioning dataset containing pathology
images at the bag and tile level. However, our work focuses on tile-level images from all datasets
for our contrastive training strategy. In total, we collected approximately 923K images from these
datasets.
Downstream tasks. For the classification task, the proposed method was evaluated in three
datasets from the Cancer Genome Atlas Data Portal[60]: TCGA-NSCLC, TCGA-RCC, and TCGA-BRCA.
We followed the ViLa-MIL[53] experimental settings for TCGA-NSCLC and TCGA-RCC, randomly
selecting proportions for training, validation, and testing. For TCGA-BRCA, we adapted the training
and testing slide ID from MSCPT [18]. The detailed description is included in the appendix section.
9
Implementation Details. We followed the ViLa-MIL preprocessing pipeline for tissue region selec-
tion and patch cropping. To integrate our attention module with CLIP50 and PLIP, we extracted tile-
level embeddings from their frozen vision encoders (1024-dimensional for CLIP50 and 512 for PLIP).
We used the visual encoder of Prov-GigaPath to produce 1536-dimensional embeddings. To align it
with PLIP’s frozen text encoder, we developed two MLP-based adaptors that project both encoders
into a shared feature space during a contrastive learning process, using datasets outlined in Section 4.
To implement spatial attention, we use
a Graph Attention Network (GAT) to Table 1: Comparison of methods on TCGA-BRCA with few-
model spatial relationships between shot settings. Results are shown for AUC, F1, and Accuracy
WSI patches. Each tile-level embed- (ACC). FVM denotes for foundation vision-language models.
ding serves as a node, connected to its
TCGA-BRCA
left, right, top, and bottom neighbors, Methods # Param.
AUC F1 ACC
ensuring local spatial dependencies are
Max-pooling 197K 60.42±4.35 56.40±3.58 68.55±6.54
captured. We then integrate spatial Mean-pooling 197K 66.64±4.21 60.70±2.78 71.73±3.59
CLIP ImageNet Pretrained
ABMIL [21] 461K 69.24±3.90 61.72±3.36 72.77±3.15
patch group-based attention pv,gr into
CLAM-SB [34] 660K 67.80±5.14 60.51±5.01 72.46±4.36
patch-based attention pv,p using Equa- CLAM-MB [34] 660K 60.81±4.87 55.48±4.96 67.31±4.19
tion 6. The hyperparameter α (0 to TransMIL [52] 2.54M 65.62±3.20 60.75±4.04 67.52±4.16
DSMIL [26] 462K 66.18±3.08 59.35±3.18 67.52±1.56
1) controls the balance between spa- RRT-MIL [59] 2.63M 66.33±4.30 61.14±5.93 71.21±8.94
tial context and prototype-based guid- CoOp [75] 337K 68.86±4.35 61.64±2.40 71.08±3.22
CoCoOp [74] 370K 69.13±4.27 61.48±2.62 72.41±1.87
ance. Metaprompt [72] 360K 69.12±4.46 63.39±4.38 74.65±7.20
TOP [46] 2.11M 69.74±3.14 63.39±4.62 74.41±5.27
4.2 Comparison to State-of- ViLa-MIL [53] 2.77M 72.25±6.16 62.04±2.38 75.01±6.14
MSCPT [18] 1.35M 74.56±4.54 65.59±1.85 75.82±2.38
the-Art MGPath (ViT) 592K 74.96±6.98 64.60±5.39 77.10±2.39
FVM
pooling, ABMIL [21], CLAM [34], ABMIL [21] 461K 72.41±4.25 63.04±3.62 74.09±4.38
CLAM-SB [34] 660K 72.34±6.17 65.51±3.28 76.16±4.36
TransMIL [52], DSMIL [26], GTMIL [73], CLAM-MB [34] 660K 73.41±3.76 66.11±1.94 77.88±2.30
DTMIL [70], RRT-MIL [59] and IBMIL [31], TransMIL [52] 2.54M 74.98±6.01 67.50±6.00 77.04±6.14
DSMIL [26] 462K 71.44±2.72 64.48±1.64 75.26±2.28
and vision-language methods, in- RRT-MIL [59] 2.63M 71.21±6.46 64.15±1.38 75.92±5.10
cluding CoOp [75], CoCoOp [74], CoOp [75] 337K 71.53±2.45 64.84±2.40 74.22±5.02
CoCoOp [74] 370K 72.65±4.63 66.63±3.55 66.98±3.35
Metaprompt [72], TOP [46], ViLa-MIL [53], Metaprompt [72] 360K 74.86±4.25 65.03±1.81 77.88±3.22
MSCPT [18], QUILT [20], CONCH [32]. TOP [46] 2.11M 76.13±6.01 66.55±1.72 78.58±5.30
ViLa-MIL [53] 2.77M 74.06±4.62 66.03±1.81 78.12±4.88
Among these, QUILT and CONCH are MSCPT [18] 1.35M 75.55±5.25 67.46±2.43 79.14±2.63
foundation VLMs. MGPath 592K 79.02±6.43 68.25±4.42 79.65±1.72
MGPath (PLIP-G) 5.35M 87.36±1.85 73.13±3.49 79.56±4.77
We provide different versions for
our MGPath including CLIP backbone
ReNet-50 (CLIP50) for TCGA-NSCLC
and TCGA-RCC and ViT-16 backbone for TCGA-BRCA. We also provide a version using PLIP backbone,
as well as GigaPath-PLIP, which was pre-trained on the Pathology dataset.
10
Table 2: Comparison of methods on TCGA-NSCLC, and TCGA-RCC datasets with few-shot
settings. Results are shown for AUC, F1, and Accuracy (ACC).
TCGA-NSCLC TCGA-RCC
Methods # Param.
AUC F1 ACC AUC F1 ACC
baseline models and achieves significant improvements over other VLMs with similar architectures,
such as ViLa and MSCPT. The performance gain is particularly notable with the PLIP backbone.
For example, on TCGA-BRCA using CLIP (ViT), MGPath achieves an accuracy of 77.10%, compared
to 75.82% for MSCPT and 75.01% for ViLA-MIL. Additionally, with the PLIP backbone, MGPath
surpasses MSCPT and ViLa-MIL by margins of approximately 3.5% to 5%.
11
GigaPath-PLIP achieves competitive performance in zero-shot tasks. We evaluate
the zero-shot capabilities of our model on three datasets and compare its performance against
foundation VLMs such as CONCH, QUILT, and PLIP. The results, summarized in Table 3, show
that the proposed VLM model achieves the best average performance across datasets, followed by
CONCH and PLIP. This consistent top-tier performance across multiple benchmarks underscores
the robustness and generalizability of our model.
Multi-Granular Prompt Learning. In Table 5, we show the performance of MGPath with and
without multi-granular (M-Gran) for CLIP (row 1 and 2) and PLIP-G (row 3 and 4) on TCGA-NSCLC
dataset. It shows that using M-Gran improves the final performance of MGPath. This also happens on
TCGA-RCC dataset. Table 5 also shows the impact of ratio when combining attention with graph and
attention no graph on TCGA-NSCLC. It shows that with a ratio of 0.2/0.8 (0.2 for spatial attention
obtained from graph structure and 0.8 for prototype-guided attention), MGPath achieves the highest
performance.
12
OT as Alignment between Contextual Prompts. Table 6 validates the use of OT in our
MGPath on TCGA-NSCLC and TCGA-RCC. We see that using OT helps to boost the performance of
MGPath (rows 1 and 2) compared to the use of cosine (rows 3 and 4). It also shows that the number
of prompt vectors depends on each dataset. In the appendix, we also run with another version using
unbalanced optimal transport (UoT). We observe that both UoT and OT provide good alignment
quality, with UoT slightly outperforming OT. However, this advantage comes at the cost of increased
running time.
4.5 Discussion
While we demonstrate significant improvements in few-shot and zero-shot WSI classification across
several settings, this paper does not explore other important challenges. For example, how can we
scale the current attention mechanism to handle even larger image patches (e.g., using Flash Attention
[11]), or extend the model from classification to tumor segmentation tasks [23]. Additionally, the
potential for extending GiGaPath to integrate with other large-scale VLM models, such as CONCH
[32], remains unexplored.
5 Conclusion
High-resolution WSI is crucial for cancer diagnosis and treatment but presents challenges in data
analysis. Recent VLM approaches, which utilize few-shot and weakly supervised learning, have
shown promise in handling complex whole-slide pathology images with limited annotations. However,
many overlook the hierarchical relationships between visual and textual embeddings, ignoring the
connections between global and local pathological details or relying on non-pathology-specific
pre-trained models like CLIP. Additionally, previous metrics lack precision in capturing fine-grained
alignments between image-text pairs. To address these gaps, (i) we propose MGPath, which integrates
Prov-GigaPath with PLIP, cross-aligning them with 923K domain-specific image-text pairs. (ii)
Our multi-granular prompt learning approach captures hierarchical tissue details effectively, (iii)
while OT-based visual-text distance ensures robustness against data augmentation perturbations.
Extensive experiments on three cancer subtyping datasets demonstrate that MGPath achieves state-
of-the-art results in WSI classification. We expect that this work will pave the way for combining
large-scale domain-specific models with multi-granular prompt learning and optimal transport to
enhance few-shot learning in pathology.
13
Supplement to “MGPATH: Vision-Language Model with
Multi-Granular Prompt Learning for Few-Shot WSI Classification”
TCGA-RCC & TCGA-NSCLC. We adopt the same data splitting as in ViLa-MIL [53], using 16-shot
samples for training in each dataset. For testing, 192 samples were used for TCGA-RCC and 197
samples were used for TCGA-NSCLC.
TCGA-RCC & TCGA-NSCLC: The baselines in Table 4 are adapted from ViLa-MIL [53] where methods
employ ResNet-50 from the CLIP model as the primary backbone. We present MGPath results
using three architectures: ResNet-50, PLIP, and PLIP-G. With ResNet-50, we follow the ViLa-MIL
approach by training the text encoder and reporting performance for this setup. To assess efficiency,
we provide the total parameter counts for both ViLa-MIL and MGPath, considering scenarios with
frozen backbones and trainable text encoders. For PLIP and PLIP-G, all visual and text encoders
are kept frozen.
CONCH & QUILT: We download the pre-trained weights of these foundation models and adapt them
for zero-shot evaluation on TCGA datasets following the authors’ guidelines from [32], which
randomly sample 75 samples for each class. For few-shot settings, since official implementations
are not provided, we initialize the models with their pre-trained weights and allow fully fine-tune
the text encoder and evaluate on the same subsets that we use for other baselines. While CONCH
14
provides prompts for the datasets in its publication, QUILT does not. Therefore, we fine-tune the
model using CONCH’s prompts and our own generated prompts for QUILT.
TCGA-NSCLC
Configurations
AUC F1 ACC
Figure 4: AUC performance comparison over
MGPath (GAT CONV) 77.2±1.3 70.9±2.0 71.0±2.1
MGPath (GIN) 77.1±2.9 69.8±3.9 69.9±4.0
epochs for PLIP (blue) and PLIP combined
MGPath (GCN) 75.1±2.9 67.6±2.5 67.1±2.8
with GigaPath (red). GigaPath significantly
enhances the AUC, achieving more stable
and higher values, particularly in the early
epochs.
15
D.1 OT Formulation and Efficient Solver
n o n o
Given two set of feature embeddings F = fi |M M ×d and G = g |N N ×d , we can
i=1 ∈ R j j=1 ∈ R
represent them as two discrete distributions µ and ν by:
M
X N
X
µ= pi δfi , ν= q j δ gj , (12)
i=1 j=1
where δfi and δgj represent Dirac delta functions centered at F and G, respectively and the weights
are elements of the marginal p = {pi }M N
i=1 and q = {qi }j=1 and can be selected as the uniform weight
PM PN
with i=1 pi = 1, j=1 qj = 1.
Then we can compute the distance between F and G through µ and ν (Eq.(9)) as
where M X
N
T ∗ = arg min
X
Tij Cij
T ∈RM XN i=1 j=1
s.t. T 1N = µ, T ⊤ 1M = ν. (14)
with C ∈ RM ×N is the cost matrix which measures the distance between fi ∈ µ and gj ∈ ν.
Because directly solving Eq (14) is high-computational costs (O(n3 log n) with n proportional to
M and N ), Sinkhorn algorithm [10] is proposed to approximate solution by solving a regularized
problem: M X
N
T ∗ = arg min
X
Tij Cij − λH(T )
T ∈RM XN i=1 j=1
s.t. T 1N = µ, T ⊤ 1M = ν. (15)
P
where H(T ) = ij Tij log Tij be an entropy function and λ > 0 is the regularization parameter.
The optimization problem in Eq. (15) is strictly convex, allowing us to achieve a solution efficiently
with fewer iterations as outlined below:
T ∗ = diag(at ) exp(−C/λ)diag(bt ) (16)
where t is the iteration and at = µ/ exp(−C/λ)bt−1 and bt = ν/ exp(−C/λ)at , with the ini-
tialization on b0 = 1. In our experiments, we used t = 100 and λ = 0.1 based on validation
performance.
16
+ ρ1 KL(T 1N ||µ) + ρ1 KL(T ⊤ 1M ||ν)
here, ρ1 and ρ2 represent the marginal regularization parameters, and KL(P ||Q) denotes the
Kullback-Leibler divergence between two positive vectors. Similar to the classical OT formulation,
there are solvers based on the Sinkhorn algorithm that can address Eq. (18) [45]. However, these
solvers typically require more iteration steps to converge to optimal solutions due to the added
complexity introduced by the relaxed marginal constraints.
Table 8: MGPath performance and running time (in second) comparison between OT and UoT.
TCGA-NSCLC
Methods
AUC ↑ F1 ↑ ACC ↑ Time (s) ↓
MGPath (OT, 4 text prompts) 76.2±2.2 69.0±3.5 69.3±2.8 1482
MGPath (UoT, 4 text prompts) 77.0±1.8 70.2±3.4 70.4±3.3 3260
TCGA-RCC
MGPath (OT, 4 text prompts) 92.1±2.8 76.5±5.2 81.7±2.9 1451
MGPath (UoT, 4 text prompts) 92.8±2.4 76.8±4.7 82.4 ± 2.4 3049
17
References
[1] F. Ahmed, A. Sellergren, L. Yang, S. Xu, B. Babenko, A. Ward, N. Olson, A. Mohtashamian,
Y. Matias, G. S. Corrado, et al. Pathalign: A vision-language model for whole slide images in
histopathology. arXiv preprint arXiv:2406.19578, 2024. (Cited on page 2.)
[2] D. Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv
preprint arXiv: 2010.11929, 2020. (Cited on page 14.)
[3] J. Barker, A. Hoogi, A. Depeursinge, and D. L. Rubin. Automated classification of brain tumor
type in whole-slide digital pathology images using local representative tiles. Medical image
analysis, 30:60–71, 2016. (Cited on page 2.)
[4] Q. Cao, Z. Xu, Y. Chen, C. Ma, and X. Yang. Domain-controlled prompt learning. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 936–944, 2024.
(Cited on page 3.)
[5] G. Chen, W. Yao, X. Song, X. Li, Y. Rao, and K. Zhang. Plot: Prompt learning with optimal
transport for vision-language models. International Conference on Learning Representations,
2023. (Cited on pages 3 and 8.)
[8] S. Cheng, S. Liu, J. Yu, G. Rao, Y. Xiao, W. Han, W. Zhu, X. Lv, N. Li, J. Cai, et al.
Robust whole slide image analysis for cervical cancer screening using deep learning. Nature
communications, 12(1):5639, 2021. (Cited on page 2.)
[9] L. Chizat, G. Peyré, B. Schmitzer, and F.-X. Vialard. Scaling algorithms for unbalanced
optimal transport problems. Mathematics of Computation, 87(314):2563–2609, 2018. (Cited on
page 16.)
[11] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact
attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–
16359, 2022. (Cited on page 13.)
[12] S. Dong, Z. Pan, Y. Fu, D. Xu, K. Shi, Q. Yang, Y. Shi, and C. Zhuo. Partial unbalanced
feature transport for cross-modality cardiac image segmentation. IEEE Transactions on Medical
Imaging, 42(6):1758–1773, 2023. (Cited on page 3.)
18
[13] N. Farahani, A. V. Parwani, and L. Pantanowitz. Whole slide imaging in pathology: advantages,
limitations, and emerging perspectives. Pathology and Laboratory Medicine International, pages
23–33, 2015. (Cited on page 2.)
[14] M. Gadermayr and M. Tschuchnig. Multiple instance learning for digital pathology: A review
of the state-of-the-art, limitations & future potential. Computerized Medical Imaging and
Graphics, page 102337, 2024. (Cited on page 2.)
[15] J. Gamper and N. Rajpoot. Multiple instance captioning: Learning representations from
histopathology textbooks and articles. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 16549–16559, 2021. (Cited on pages 3 and 9.)
[16] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao. Clip-adapter:
Better vision-language models with feature adapters. International Journal of Computer Vision,
132(2):581–595, 2024. (Cited on pages 2 and 3.)
[17] C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang. Domain adaptation via prompt
learning. IEEE Transactions on Neural Networks and Learning Systems, 2023. (Cited on page 4.)
[18] M. Han, L. Qu, D. Yang, X. Zhang, X. Wang, and L. Zhang. Mscpt: Few-shot whole slide
image classification with multi-scale and context-focused prompt tuning. arXiv preprint
arXiv:2408.11505, 2024. (Cited on pages 2, 3, 4, 6, 7, 9, 10, and 14.)
[21] M. Ilse, J. Tomczak, and M. Welling. Attention-based deep multiple instance learning. In
International conference on machine learning, pages 2127–2136. PMLR, 2018. (Cited on pages 2,
4, 7, 10, and 11.)
[22] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan. Maple: Multi-modal prompt
learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 19113–19122, 2023. (Cited on page 4.)
[24] K. Kim, Y. Oh, and J. C. Ye. Zegot: Zero-shot segmentation through optimal transport of
text prompts. arXiv preprint arXiv:2301.12171, 2023. (Cited on page 8.)
[25] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks.
International Conference on Learning Representations, 2017. (Cited on page 15.)
19
[26] B. Li, Y. Li, and K. W. Eliceiri. Dual-stream multiple instance learning network for whole slide
image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 14318–14328, 2021. (Cited on
pages 4, 10, and 11.)
[27] H. Li, C. Zhu, Y. Zhang, Y. Sun, Z. Shui, W. Kuang, S. Zheng, and L. Yang. Task-specific
fine-tuning via variational information bottleneck for weakly-supervised pathology whole slide
image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 7454–7463, 2023. (Cited on page 2.)
[28] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1: Long Papers, 2021.
(Cited on pages 2 and 3.)
[29] Z. Li, L. Zhao, Z. Zhang, H. Zhang, D. Liu, T. Liu, and D. N. Metaxas. Steering prototypes
with prompt-tuning for rehearsal-free continual learning. In Proceedings of the IEEE/CVF
Winter Conference on Applications of Computer Vision, pages 2523–2533, 2024. (Cited on page 4.)
[30] M. Liero, A. Mielke, and G. Savaré. Optimal entropy-transport problems and a new hellinger–
kantorovich distance between positive measures. Inventiones mathematicae, 211(3):969–1117,
2018. (Cited on page 16.)
[31] T. Lin, Z. Yu, H. Hu, Y. Xu, and C.-W. Chen. Interventional bag multi-instance learning on
whole-slide pathological images. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 19830–19839, 2023. (Cited on pages 2, 10, and 11.)
[35] Y. Lu, J. Liu, Y. Zhang, Y. Liu, and X. Tian. Prompt distribution learning. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215,
2022. (Cited on page 4.)
[36] A. Madabhushi and G. Lee. Image analysis and machine learning in digital pathology: Challenges
and opportunities. Medical image analysis, 33:170–175, 2016. (Cited on page 2.)
[37] D. Nechaev, A. Pchelnikov, and E. Ivanova. Hibou: A family of foundational vision transformers
for pathology. arXiv preprint arXiv:2406.05074, 2024. (Cited on page 4.)
20
[38] D. M. Nguyen, N. T. Diep, T. Q. Nguyen, H.-B. Le, T. Nguyen, T. Nguyen, T. Nguyen, N. Ho,
P. Xie, R. Wattenhofer, et al. Logra-med: Long context multi-graph alignment for medical
vision-language model. arXiv preprint arXiv:2410.02615, 2024. (Cited on page 8.)
[40] H. Nguyen, K. Le, Q. Nguyen, T. Pham, H. Bui, and N. Ho. On robust optimal transport:
Computational complexity and barycenter computation. In Advances in NeurIPS, 2021. (Cited
on page 3.)
[41] M. K. K. Niazi, A. V. Parwani, and M. N. Gurcan. Digital pathology and artificial intelligence.
The lancet oncology, 20(5):e253–e261, 2019. (Cited on page 1.)
[42] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive
coding. arXiv preprint arXiv:1807.03748, 2018. (Cited on page 5.)
[44] G. Peyré, M. Cuturi, et al. Computational optimal transport: With applications to data science.
Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019. (Cited on page 15.)
[45] K. Pham, K. Le, N. Ho, T. Pham, and H. Bui. On unbalanced optimal transport: An analysis
of Sinkhorn algorithm. In International Conference on Machine Learning, pages 7673–7682.
PMLR, 2020. (Cited on pages 3 and 17.)
[46] L. Qu, K. Fu, M. Wang, Z. Song, et al. The rise of ai language pathologists: Exploring two-level
prompt learning for few-shot weakly-supervised whole slide image classification. Advances in
Neural Information Processing Systems, 36, 2024. (Cited on pages 3, 4, 8, and 10.)
[47] L. Qu, D. Yang, D. Huang, Q. Guo, R. Luo, S. Zhang, and X. Wang. Pathology-knowledge
enhanced multi-instance prompt learning for few-shot whole slide image classification. arXiv
preprint arXiv:2407.10814, 2024. (Cited on page 2.)
[49] Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu. Denseclip: Language-
guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 18082–18091, 2022. (Cited on
page 4.)
21
[50] J. Ryu, A. V. Puche, J. Shin, S. Park, B. Brattoli, J. Lee, W. Jung, S. I. Cho, K. Paeng, C.-Y.
Ock, et al. Ocelot: overlapped cell on tissue dataset for histopathology. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23902–23912, 2023.
(Cited on page 2.)
[51] T. Séjourné, G. Peyré, and F.-X. Vialard. Unbalanced optimal transport, from theory to
numerics. Handbook of Numerical Analysis, 24:407–471, 2023. (Cited on pages 3 and 8.)
[52] Z. Shao, H. Bian, Y. Chen, Y. Wang, J. Zhang, X. Ji, et al. Transmil: Transformer-based
correlated multiple instances learning for whole slide image classification. Advances in neural
information processing systems, 34:2136–2147, 2021. (Cited on pages 4, 10, and 11.)
[53] J. Shi, C. Li, T. Gong, Y. Zheng, and H. Fu. Vila-mil: Dual-scale vision-language multiple
instance learning for whole slide image classification. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pages 11248–11258, 2024. (Cited on pages 2, 3,
4, 6, 7, 9, 10, 11, and 14.)
[54] J. Shi, L. Tang, Y. Li, X. Zhang, Z. Gao, Y. Zheng, C. Wang, T. Gong, and C. Li. A
structure-aware hierarchical graph-based multiple instance learning framework for pt staging
in histopathological image. IEEE Transactions on Medical Imaging, 42(10):3000–3011, 2023.
(Cited on page 2.)
[55] M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao. Test-time
prompt tuning for zero-shot generalization in vision-language models. Advances in Neural
Information Processing Systems, 35:14274–14289, 2022. (Cited on page 4.)
[56] S. P. Singh and M. Jaggi. Model fusion via optimal transport. Advances in Neural Information
Processing Systems, 33:22045–22055, 2020. (Cited on page 8.)
[58] W. Tang, S. Huang, X. Zhang, F. Zhou, Y. Zhang, and B. Liu. Multiple instance learning
framework with masked hard instance mining for whole slide image classification. In Proceedings
of the IEEE/CVF International Conference on Computer Vision, pages 4078–4087, 2023. (Cited
on page 2.)
[59] W. Tang, F. Zhou, S. Huang, X. Zhu, Y. Zhang, and B. Liu. Feature re-embedding: To-
wards foundation model-level performance in computational pathology. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11343–11352, 2024.
(Cited on page 10.)
[60] The Cancer Genome Atlas (TCGA). Genomic Data Commons Data Portal (GDC). https:
//portal.gdc.cancer.gov/projects/TCGA-BRCA. Accessed 07 Jul. 2023. (Cited on pages 9
and 14.)
[61] M. Tsuneki and F. Kanavati. Inference of captions from histopathological patches. In Inter-
national Conference on Medical Imaging with Deep Learning, pages 1235–1250. PMLR, 2022.
(Cited on pages 3 and 9.)
22
[62] A. Vaswani. Attention is all you need. Advances in Neural Information Processing Systems,
2017. (Cited on pages 2 and 7.)
[63] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio. Graph attention
networks. arXiv preprint arXiv:1710.10903, 2017. (Cited on page 7.)
[64] C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009. (Cited on page 15.)
[65] G. Xu, Z. Song, Z. Sun, C. Ku, Z. Yang, C. Liu, S. Wang, J. Ma, and W. Xu. Camel: A weakly
supervised learning framework for histopathology image segmentation. In Proceedings of the
IEEE/CVF International Conference on computer vision, pages 10682–10691, 2019. (Cited on
page 2.)
[66] H. Xu, N. Usuyama, J. Bagga, S. Zhang, R. Rao, T. Naumann, C. Wong, Z. Gero, J. González,
Y. Gu, et al. A whole-slide foundation model for digital pathology from real-world data. Nature,
pages 1–8, 2024. (Cited on pages 2, 3, 4, and 11.)
[67] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks?
International Conference on Learning Representations, 2019. (Cited on page 15.)
[68] H. Yao, R. Zhang, and C. Xu. Tcp: Textual-based class-aware prompt tuning for visual-
language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 23438–23448, 2024. (Cited on pages 2, 3, and 4.)
[69] F. Zhan, Y. Yu, K. Cui, G. Zhang, S. Lu, J. Pan, C. Zhang, F. Ma, X. Xie, and C. Miao.
Unbalanced feature transport for exemplar-based image translation. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition, pages 15028–15038, 2021.
(Cited on page 3.)
[70] H. Zhang, Y. Meng, Y. Zhao, Y. Qiao, X. Yang, S. E. Coupland, and Y. Zheng. Dtfd-mil:
Double-tier feature distillation multiple instance learning for histopathology whole slide image
classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 18802–18812, 2022. (Cited on pages 10 and 11.)
[71] Y. Zhang, H. Fei, D. Li, T. Yu, and P. Li. Prompting through prototype: A prototype-based
prompt learning on pretrained vision-language models. arXiv preprint arXiv:2210.10841, 2022.
(Cited on page 4.)
[72] C. Zhao, Y. Wang, X. Jiang, Y. Shen, K. Song, D. Li, and D. Miao. Learning domain invariant
prompt for vision-language models. IEEE Transactions on Image Processing, 2024. (Cited on
pages 8 and 10.)
[74] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Conditional prompt learning for vision-language
models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 16816–16825, 2022. (Cited on pages 4, 8, and 10.)
23
[75] K. Zhou, J. Yang, C. C. Loy, and Z. Liu. Learning to prompt for vision-language models.
International Journal of Computer Vision, 130(9):2337–2348, 2022. (Cited on pages 2, 3, 4, 8,
and 10.)
24