MSCPT: Few-Shot Whole Slide Image Classification With Multi-Scale and Context-Focused Prompt Tuning
MSCPT: Few-Shot Whole Slide Image Classification With Multi-Scale and Context-Focused Prompt Tuning
Aggregation
standard paradigm for weakly supervised classification of whole
Classifier
Encoder
Frozen
Image
slide images (WSI). However, this paradigm relies on the use WSI
Patches Patches WSI-level
of a large number of labelled WSIs for training. The lack of Embedding
Embedding
Prediction
WSI
training data and the presence of rare diseases present significant
(a) Traditional MIL-based Mthod (WSI-level)
challenges for these methods. Prompt tuning combined with
the pre-trained Vision-Language models (VLMs) is an effective
Embeded
Layer L
Layer 1
Layer 2
Word
[�����] .
solution to the Few-shot Weakly Supervised WSI classification ...
(FSWC) tasks. Nevertheless, applying prompt tuning methods Stroma Tumor ... Lymph ... Frozen
Text Prompt Text Encoder
designed for natural images to WSIs presents three significant Visual Prompt
Embeded
...
Layer L
Patch
Layer 1
Layer 2
Similarity
knowledge from the VLM’s text modality; 2) They overlook Scores
Image
the essential multi-scale and contextual information in WSIs, Frozen
Image Encoder
Embedding
Patch-level
Prediction
Image (Patch)
leading to suboptimal results; and 3) They lack exploration
(b) Prompt Tuning for Single Image (Patch-level)
of instance aggregation methods. To address these problems,
We are studying at {Cancer
Pathological
we propose a Multi-Scale and Context-focused Prompt Tuning
Description
Prompted
Category}. Could you please
Encoder
Visual
help ous generate ten visual
Text
LLM
Description
Embedding
description exhibited in the HE
(MSCPT) method for FSWC tasks. Specifically, MSCPT employs histological images of {Subtype
Name} at 5× magnification.
Prompt Tuning
No-Parametrci
Aggregation
Graph
language prior knowledge at multi-scale, guiding hierarchical
prompt tuning. Additionally, we design a graph prompt tuning
Prompted
WSI-level
Encoder
Image
Prediction
module to learn essential contextual information within WSI, Patches Similarity
Embedding Matirx : Trainable
Patches
and finally, a non-parametric cross-guided instance aggregation : Frozen
WSI
module has been introduced to get the WSI-level features. Based
(c) Our Proposed MSCPT (WSI-level)
on two VLMs, extensive experiments and visualizations on three
datasets demonstrated the powerful performance of our MSCPT.
Fig. 1. Motivation of our MSCPT. (a) Traditional MIL-based methods mainly
Index Terms—whole slide image classification, prompt tuning, focus on instance aggregation and require a large amount of training data.
few-shot learning, multimodal. (b) Prompt tuning methods for natural images incorporate a set of trainable
parameters into the input space for training, enabling pre-trained VLMs to
be applied to downstream tasks. However, those methods are only suitable
I. I NTRODUCTION for single images and are no longer adequate for WSI-level tasks due to the
enormous size of WSIs. (c) MSCPT leverages pathological visual descriptions
Developing automated analysis frameworks using Whole combined with multimodal hierarchical prompt tuning to explore the potential
Slide Images (WSIs) is crucial in clinical practice [1]–[4], of VLMs. For simplicity, we only depicted the data flow diagram for a single
as WSIs are widely regarded as the “gold standard” for scale.
cancer diagnosis, typing, staging, and prognosis analysis [5],
[6]. Given the enormous size of WSIs (roughly 40,000 ×
40,000), multiple instance learning (MIL) [7] has become the labor-intensive [8], [9]. This weak supervision paradigm has
dominant method. As shown in Fig. 1 a, traditional MIL-based led to a problem: a large number of WSIs are required
methods typically follow a four-step paradigm: patch cutting, to train an effective model [10], [11]. In clinical practice,
feature extraction, feature aggregation, and classification. Most patient privacy concerns, rare diseases, and difficulty preparing
MIL-based methods are conducted under weak supervision pathology slides make accumulating a large number of WSIs
at the bag level, as creating instance-level labels is quite very challenging [12], [13].
Vision-Language models (VLMs) have shown excellent
This project was funded by the National Natural Science Foundation of
China 82090052. Minghao Han and Linhao Qu are the co-first authors. generalization ability to downstream tasks [14]–[20]. Recently,
(Corresponding author: Lihua Zhang.) researchers have proposed specialized VLMs for analyzing
Minghao Han, Dingkang Yang, Xukun Zhang, and Lihua Zhang are pathological images, including MI-Zero [21], PLIP [22], and
with the Academy for Engineering and Technology, Fudan University,
Shanghai 200433, China. (E-mail: {mhhan22, zhangxk21}@m.fudan.edu.cn, Conch [23]. These VLMs, extensively pre-trained on abundant
{dkyang20, lihuazhang }@fudan.edu.cn). Linhao Qu is with the Digital image-text pairs, contain significant prior knowledge. If the
Medical Research Center, School of Basic Medical Science, Fudan University, prior knowledge of VLMs can be fully exploited with a few
Shanghai 200032, China. (E-mail: [email protected]). Xiaoying Wang is
with the Zhongshan Hospital, Fudan University, Shanghai 200032, China. (E- training samples, it can partially alleviate the data scarcity
mail: [email protected]). problem in WSI classification tasks. Therefore, we aim to
2
explore a novel “data-efficient” method based on the VLMs well as global trainable prompts. The MHPT module employs
to improve the model’s performance on the Few-shot Weakly the transformer layers in the text encoder to effectively learn
Supervised WSI Classification (FSWC) [8] task. the interactions among three distinct prompts.
Nevertheless, there is a gap between generally pre-trained Furthermore, the Image-text Similarity-based Graph Prompt
VLMs and specific downstream tasks. Under the few-shot Tuning (ISGPT) module is introduced to extract contex-
scenarios, researchers often employ prompt tuning to bridge tual information. Precisely, we do not follow previous ap-
this gap with the help of a few training samples [14], [18], [24], proaches [29], [30] of using patch positions or patch feature
[25]. As shown in Fig. 1 b, prompt tuning aims to learn a set similarity to construct graph neural networks (GNNs). We
of trainable continuous vectors and incorporate these vectors propose to use the similarity between patches and pathological
into the input space for training, effectively adapting the fixed visual descriptions as the basis for building GNNs. We believe
pre-trained VLMs for specific downstream tasks. that using image-text pairs to build GNNs is more effective
However, existing prompt tuning methods for natural images for capturing global features than methods relying on patch
(such as CoOp [25], CoCoOp [24], and MetaPrompt [26]) positions and image feature similarity, and corresponding
are only effective for single images (i.e., patch-level). Since ablation experiments confirm this hypothesis.
each WSI typically contains tens of thousands of patches, Finally, impressed by the powerful zero-shot capabilities
these methods are ineffective for WSI-level tasks. Also, studies of VLMs [21]–[23], we fully leverage the similarity between
indicate that the multi-scale information [27] and the con- patches and pathological visual descriptions to aggregate in-
textual information [28] in WSIs play a significant role in stances. The Non-Parametric Cross-Guided Pooling (NPCGP)
cancer analysis, but those methods fail to capture this crucial module, utilizing the Top-K algorithm for instance aggrega-
information. Additionally, in training VLMs, the image-text tion, is introduced to further reduce the risk of overfitting in
pairs contain more than just information about the category. few-shot scenarios. Overall, our contributions are summarised
They also include more details about the image, such as as follows:
contextual properties of the object [14] and descriptions of 1) MSCPT demonstrates that high-level concepts from
the cellular microenvironment [22], [23]. However, existing pathological descriptions combined with low-level im-
prompt tuning methods have primarily focused on image age representations can enhance few-shot weakly super-
category information, without emphasizing a detailed image vised WSI classification.
content analysis, which has left the full potential of patholog- 2) MSCPT achieves excellent performance by introducing
ical VLMs underexplored. only a limited number of trainable parameters (∼0.9%
To address the aforementioned issue, we propose Multi- of the pre-trained VLM). Additionally, MSCPT is appli-
Scale and Context-focused Prompt Tuning (MSCPT) for WSI cable to fine-tune any VLMs for WSI-level tasks.
classification in weakly supervised and few-shot scenarios. 3) Extensive experiments and visualizations on three
Our framework fully leverages the characteristic of VLM datasets and two VLMs have confirmed that MSCPT’s
training with image-text pairs at dual magnification scales: 1) performance is state-of-the-art in few-shot scenarios,
At low magnification, we provide the VLM with pathological surpassing other traditional MIL-based and prompt tun-
visual descriptions at the tissue level (such as the infiltration ing methods.
between tumor tissue and other normal tissues); 2) At high
magnification, pathological visual descriptions at the cellular
level (such as cell morphology, nuclear changes, and the II. R ELATED W ORK
formation of various organelles) are provided to the VLM.
These pathological visual descriptions at multi-scale can help A. Multiple Instance Learning in Whole Slide Images
VLM identify regions that are helpful for cancer analysis and Due to the high resolution of Whole Slide Images (WSIs)
achieve optimal results even with limited training samples. and the challenges of detailed labelling, weakly supervised
As illustrated in Fig. 1 c, the core idea behind developing methods based on Multiple Instance Learning (MIL) have
MSCPT is to incorporate prior knowledge at the tissue and emerged as the mainstream for WSI analysis. The MIL-based
cellular scales into the WSI-level tasks. Specifically, we first methods treat a WSI as a bag and all patches as instances,
use a frozen large language model (LLM) to generate multi- considering a bag positive if it contains at least one positive
scale pathological visual descriptions, leveraging them as prior instance. Within the MIL framework, an aggregation step is
knowledge. required to aggregate all instances into bag features. The most
Secondly, we design a Multi-scale Hierarchical Prompt primitive aggregation methods are non-parametric mean pool-
Tuning (MHPT) module to combine pathological visual de- ing and max pooling. However, since disease-related instances
scriptions from multi-scale hierarchically to enhance prompt are a small fraction [31], those non-parametric aggregation
effectiveness. Inspired by Metaprompt [26], a dual-path asym- methods treated all instances equally, causing useful informa-
metric framework is adopted, asymmetrically freezing the tion to be overwhelmed by irrelevant data. Subsequently, some
image encoder and text encoder at different scales for prompt attention-based methods (such as ABMIL [32], DSMIL [33]
tuning. This asymmetric framework enables us to freeze half and CLAM [31]) were introduced, assigning different weights
of the encoder to reduce the number of trainable parameters. to each instance and aggregating them based on the weights.
Specifically, MHPT contains low-level and high-level prompts Furthermore, MIL methods based on Graph Neural Networks
for both low and high-magnification visual descriptions, as (GNNs) [29], [30] and Transformers [1], [34] had also been
3
[Class 2]
Clear cells with perinuclear halos.
Low Level
Encoder
Frozen
···
Text
h h
Extramedullary hematopoietic areas.
Tumor cells with high nuclear-to-
h TopK Pooling
cytoplasmic ratio. Σ Σ
Trainable Parameters
g
LLM
High Level Visual Desciription (20×)
Distinct cellular borders with acinar
�ℎ��ℎ
or nesting appearance.
�ℎ��ℎ �ℎ��ℎ ���� ����
Hierarchical
High Level
Prompted
Occasional papillary or tubular
Encoder
Text
structures in clear cell tumors. Image-text imliarity-based Graph Prompt Tuning
Σ Summation
···
Pale cytoplasm with multifocality or
bilaterality in some cases. GCN
���
�
Frozen Parameters
Patch Selection Low Level Patches (5×)
Patching & Selection
Low Level
WSI
Prompted
0.3 0.2 0.1 0.4 0.1 0.1 0.7 0.1
Encoder
Image
Dot Product
��ℝ�∗�
0.7 0.1 0.1 0.1 0.4 0.2 0.2 0.2
0.6 0.2 0.1 0.1 0.1 0.6 0.2 0.1
High Level Patches (20×) �ℎ��ℎ
0.1 0.3 0.1 0.5 0.3 0.1 0.1 0.5
High Level
Encoder
Frozen
�∗��
��ℝ
Similarity Matrix (S) & Adjacency Matrix (A)
Fig. 2. We develop MSCPT based on the dual-path asymmetric framework, which inputs patches and pathological visual descriptions from multi-scale to
different encoders. MSCPT utilizes a large language model to generate multi-scale pathological visual descriptions. These descriptions are combined using
Multi-scale Hierarchical Prompt Tuning (MHPT) to integrate information across multiple scales. Then Image-text similarity-based Graph Prompt Tuning
(ISGPT) is employed to learn context information at each scale. Finally, Non-Parametric Cross-Guided Pooling (NPCGP) aggregates instances guided by
pathological visual descriptions to achieve the final Whole Slide Image classification result.
proposed to capture both local and global contextual in- this paper, we propose using pathological visual descriptions
formation of WSIs. Those methods have shown significant as prior knowledge to unleash the potential of VLMs.
improvements in recent years. Still, the cost of enhancing
model performance is the increase in parameters, requiring a C. Prompt Tuning in Vision-Language Models
large amount of data to train a well-performing model. In many Prompt tuning has demonstrated remarkable efficiency and
cases, training data faces a scarcity issue. Therefore, this paper effectiveness, whether in text or multimodal [18], [24], [25].
proposes MSCPT, which leverages Vision-Language models CLIP demonstrated remarkable zero-shot performance with
combined with pathological descriptions from Large language hand-crafted prompts, but the results can vary significantly
models to enhance the performance in few-shot scenarios. depending on the prompt used due to their sensitivity to
changes. Therefore, CoOp [25] and CoCoOp [24] proposed
that the model itself should determine the choice of prompts.
B. Vision-Language Models
Khattak et al. argued that optimizing prompt tuning within a
Vision-Language models (VLMs) are rapidly developing in single branch is not ideal. They introduced MaPLe [18] as
various fields. During training, VLMs use contrastive learn- a solution to enhance the alignment between visual and lan-
ing to reduce distances between paired image-text pairs and guage representations. Regrettably, these innovative methods
increase distances between unpaired ones. CLIP [14] col- are highly applicable to natural images but do not consider
lected over 400M image-text pairs from the internet and used the enormous size of WSIs and the crucial multi-scale and
contrastive learning to align them, resulting in compatibility contextual information needed for WSI analysis.
across various tasks. Compared to natural images, gathering To our knowledge, Qu et al. have conducted research
pairs of pathological images and corresponding descriptions is TOP [8] on the fine-tuning of CLIP for FSWC tasks. Shi et
challenging. To address this issue, MI-Zero [21] first pretrains al. also proposed ViLa-MIL [36] based on CLIP, which helps
image and text encoders using unpaired data, and then aligns with WSI classification by introducing multi-scale language
them in a common space using 33,480 pairs of pathological prior knowledge. These two studies are exceptional, pushing
image-text pairs. Huang et al. gathered over 450K pathological the boundaries of VLM capabilities and boosting model per-
image-text pairs from Twitter and LAION [35] and developed formance in few-shot scenarios. However, these methods are
PLIP [22]. Lu et al. trained Conch [23] on over 1.17M patho- all based on CLIP and do not investigate the performance of
logical image-text pairs, and it performs well on downstream models on pathological VLMs. Moreover, due to the large
tasks. Pretrained VLMs have significant potential, but effective number of patches in a WSI, they have to focus solely on the
methods to leverage them for WSI-level tasks are lacking. In text and neglect visual prompt tuning. Additionally, they do
4
not consider the crucial contextual information in WSI. Al- 2) Patch Selection: Due to the high resolution of WSIs,
though ViLa-MIL takes into account multi-scale information, dividing them into non-overlapping patches will result in
it merely integrates information using a late fusion approach a large number of patches. However, research has shown
without fully exploring the interactions between these scales. that only a few patches contain crucial information [31].
We validated our proposed MSCPT on both general VLM By preliminarily identifying patches closely linked to cancer
(i.e., CLIP) and pathology-specific VLM (i.e., PLIP). By analysis, we can notably diminish the computational resources
utilizing the zero-shot capability of VLM to initially select demanded by visual prompt tuning. The powerful zero-shot
a subset of patches closely related to cancer, we then con- ability of the VLMs allows for the initial screening of cancer-
ducted visual prompt tuning on these patches. Additionally, we related patches.
adopted an intermediate fusion approach to integrate pathology Specifically, we utilize Fimg to extract visual embeddings
prior knowledge at multi-scale, leveraging the transformer from patches while leveraging Ftext to extract textual em-
layers to hierarchically learn the relationships between them. beddings from the category prompts. Following this, the
Ultimately, we also utilized image-text similarity to construct similarities between patches and prompts are computed. Then,
GNNs to capture contextual information within the WSI. the top n patches with the highest similarity scores are selected
for each category. To enhance the robustness of patch selec-
tion, we generated 50 sets of manual category templates and
III. M ETHOD averaged their embeddings following [21]. For a WSI Xi , we
In this section, we introduce our few-shot weakly- choose patches xli,j , j = 1, 2, ..., nl at low magnification.
supervised WSI classification model, named Multi-scale and Due to our unique architecture, we solely perform patch
Context-focused Prompt Tuning (MSCPT), as illustrated in selection and visual prompt tuning at low magnification.
Fig. 2. MSCPT utilizes a dual-path asymmetric structure as
its foundation while conducting hierarchical prompt tuning on C. Multi-scale Visual Descriptions Construction
both textual and visual modalities. In this part, we aim to generate pathological visual descrip-
tions as pathological language prior knowledge to guide
the hierarchical prompt tuning and instances aggregation. To
A. Problem Formulation reduce manual workload, large language models (LLMs) are
Given a dataset X = {X1 , X2 , ..., XN } consisting of employed to generate descriptions related to different diseases.
N WSIs, each WSI is cropped into non-overlapping small That is, we enter the question “We are studying Cancer Cat-
patches, named instances. All instances belonging to the same egory. Please list C l visual descriptions at 5× magnification
WSI collectively form a bag. In weakly-supervised WSI tasks, and C h visual descriptions at 20× magnification observed in
only the labels of bags are known. The labels of the bags H&E-stained histological images of Cancer Sub-category.”
Yi ∈ {0, 1} , i = {1, 2, ...N } and the label of each instance into the LLM. And then we can get the multi-scale visual
{yi,j , j = 1, 2, . . . Mi } have the following relationship: description sets T low = {Tk,c
low
| 0 ≤ k ≤ K, 0 ≤ c ≤ C l } and
high high
P T = {Tk,c | 0 ≤ k ≤ K, 0 ≤ c ≤ C h }. K represents the
0, if j yi,j = 0,
Yi = (1) number of WSI categories, and C l and C h denote the counts
1, else. of low-level and high-level descriptions, respectively.
where C i and E i represent the class token [CLS] and the last
Multi-scale Hierarchical
Prompt Tuning token [EOT ] of the i-th transformer layer T i , and L signifies
Transformer Layer � ��−1
���,1
… ��−1
���,�� the number of transformer layers. Lastly, by projecting the
last token of the last transformer layer through the textual
��−1 ��−1
����,1 ��−1
����,2 ��−1 �−1
���,1 … ����,�� ��−1
ℎ��ℎ E�−1 g projection head T extP roj into the joint embedding space,
high high
the final textual representation zk,c for Tk,c is obtained:
Trainable
Transformer Layer � − 1 Parameters high
zk,c = T extproj(E L ). (5)
Frozen
Parameters
… … … … … … …
[CLS] token
E. Image-text Similarity-based Graph Prompt Tuning
Transformer Layer 1 [EOT] token
…
Some studies have shown that the interactions between
High Level
[CLS] X X X X �0ℎ��ℎ [EOT] Prompt different areas of WSI and their structural information play
… Low Level a crucial role in cancer analysis [28]. However, the current
Prompt
Tokenization & Embedding prompt tuning methods are unable to capture this informa-
Global Prompt
tion. To address this, we propose Image-text Similarity-based
… Last token of
X
High Level
low level desriptions Graph Prompt Tuning (ISGPT) module. More specifically, we
X X X
Description
deviate from conventional methods that utilize patch coordi-
nates or patch feature similarity in constructing graph neural
Fig. 3. Details of the Prompted Hierarchical High-Level Text Encoder. networks (GNNs) [29], [30]. Our innovative approach involves
The multi-scale Hierarchical Prompt Turning (MHPT) module utilizes the utilizing the similarity between patches and pathological visual
transformer layer to integrate pathological visual descriptions from different
scales.
descriptions as the foundation for developing GNNs. We treat
the patches as nodes and aim to construct the adjacency matrix
A by calculating the semantic similarity S between the patch
contained in T low and T high , which can help improve the embeddings and description embeddings. Specifically, after
multi-scale information processing capability of MSCPT. To patches and descriptions have passed through the encoders
achieve this purpose, we propose Multi-scale Hierarchical from Section III-D, patch embeddings P ∈ RM ×d and de-
Prompt Turning (MHPT) module. The core component of scription embeddings Z ∈ RKC×d are obtained, respectively.
MHPT, prompted hierarchical high-level text encoder, has The formula for semantic similarity S ∈ RM ×KC is:
been drawn in Fig. 3.
exp(cos(Pi , Zj )/τ )
1) Multi-scale Prompts Construction: For each layer of si,j = PK×C , (6)
low
Ftext , we introduce a learnable vector called global prompts m=1 exp(cos(Pi , Zm )/τ )
pglob to learn and integrate information from high-level text where τ is the temperature coefficient, cos(·, ·) represents the
prompts phigh and low-level text prompts plow . As an example, cosine similarity. K represents the number of WSI categories
consider the construction of multi-scale prompts for a high- and d is the embedding dimensionality. C and M denote the
high
level pathological visual description Tk,c . After tokenization number of pathological descriptions and patches at a given
high
and embedding, Tk,c is transformed into phigh
0 . And then scale, respectively. Subsequently, the calculation formula for
the low-level text prompts plow are obtained based on T low . the adjacency matrix A ∈ RM ×M is written as:
More specifically, a set of descriptions T low gets fed into the
low exp(cos(Si , Sj )/τ )
frozen Ftext , and the last token of each transformer layer gets ai,j = PM , (7)
extracted. These tokens are then fed into a prompt generator m=1 exp(cos(Si , Sm )/τ )
g, formulated as:
where Si ∈ RKC represents the semantic similarity between
pllow,i = g dllow ,
(3) i-th patch embeddings and all description embeddings(i.e., the
i-th row of S). We avoid constructing A based on patch
where dllow is the last token of Tk,i
low
at the l-th layer, generator coordinates or patch feature similarity, as this approach might
g is a basic multilayer perceptron to align vectors of different overlook fewer but significant patches when focusing only
scales into a common embedding space. Then, these tokens on Euclidean distance or patch feature similarity. Subsequent
get concatenated to obtain low-level text prompts pllow . experimental results have demonstrated the superior perfor-
2) Hierarchical Prompt Tuning: After obtaining the three mance of our method for constructing A. We choose Graph
prompts, to capture more complex associations between patho- Convolutional Network (GCN) [39] as the graph learning
logical visual descriptions at multi-scale, hierarchical prompt model. The definition of the GCN operation in the l-th GCN
tuning is performed on Ftexthigh
, which can be expressed as: layer is as follows:
−1 −1
FGCN A, H (l) = σ(D̃ 2 ÃD̃ 2 H (l) W (l) ). (8)
h i
C , , , pihigh , E i = Ti C i−1 , pi−1 i−1 i−1
i i−1
glob , plow , phigh , E ,
i = 1, 2, 3, ..., L, Here à = A + I, I is the identity matrix and σ (·) denotes
(4) an activation function. D̃i,i = j Ãi,j , W (l) is layer-specific
P
6
trainable weight matrix. H (l) ∈ RM ×d is the input embed- TCGA-RCC is a renal cell carcinoma (RCC) WSIs dataset
dings of all nodes. Therefore, the patch embeddings after graph containing 873 slides. Precisely, it consists of 121 slides of
prompt tuning at both high and low scales are represented as: chromophobe renal cell carcinoma (CHRCC), 455 slides of
high high
clear-cell renal cell carcinoma (CCRCC), and 297 slides of
P̃ = FGCN (Ahigh , P high ), (9) papillary renal cell carcinoma (PRCC). Likewise, 20% of the
low dataset (175 slides) is randomly taken out for training, while
P̃ = low
FGCN (Alow , P low ). (10)
698 slides are reserved for testing.
2) Evaluation Metrics: For all datasets, we leverage Accu-
F. Non-Parametric Cross-Guided Pooling racy (ACC), Area Under Curve (AUC), and macro F1-score
Impressed by the powerful zero-shot capability of pre- (F1) to evaluate model performance. To reduce the impact
trained VLMs, the possibility of employing a similar non- of data split on model evaluation, we follow ViLa-MIL [36]
parametric approach for instance aggregation was pondered. and employ five fixed seeds to perform five rounds of dataset
We propose Non-Parametric Cross-Guided Pooling (NPCGP) splitting, model training, and testing. We report the mean and
to aggregate instance into bag features. In NPCGP, we com- standard deviation of the metrics over five seeds.
pute semantic similarities between the patch embeddings P̃ 3) Model Zoo: Thirteen influential approaches were em-
post graph tuning and pathological visual description em- ployed for comparison, including traditional MIL-based meth-
beddings Z at both the same and across scales. The reason ods: Mean pooling, Max pooling, ABMIL [32], CLAM [31],
for calculating similarities both within the same and across TransMIL [1], DSMIL [33] and RRT-MIL [40]; prompt tuning
scales is our concern that the pathological visual descriptions methods for natural images: CoOp [25], CoCoOp [24] and
provided by LLM may contain scale-related inaccuracies. Metaprompt [26]; prompt tuning methods for WSIs: TOP [8]
Hence, this procedure serves to bolster the robustness of fea- and ViLa-MIL [36]. Adapting to WSI-level tasks, we inte-
ture aggregation strategies. Lastly, the bag-level unnormalized grated an attention-based instance aggregation module [32]
probability distribution Logits is obtained through the topK into the prompt tuning methods designed for natural images,
max-pooling operator htopK : such as CoOp, CoCoOp, and Metaprompt.
T
4) Implementation Details: Following CLAM [31], the
Logitshigh = htopK P̃high · Zhigh original WSIs were initially processed using the Otsu thresh-
T
(11) olding algorithm to remove the background parts. Sub-
+ htopK P̃high · Zlow , sequently, the WSIs were segmented into multiple non-
overlapping patches of 256 × 256 pixels at 5× and 20×
T
Logitslow = htopK P̃low · Zlow
(12) magnification levels. We applied to perform our MSCPT on
T
+ htopK P̃low · Zhigh , CLIP [14] and PLIP [22], both of which use ViT-B/16 [38]
as their visual tower. Apart from MSCPT, Metaprompt, and
1
Logitsoverall = Logitshigh + Logitslow .
(13) DSMIL, which utilized inputs of both 5× and 20× magnifi-
2 cation patches, the remaining methods solely relied on 20×
Following previous work [26], we use cross-entropy loss to magnification patches as inputs.
optimize the three distributions Logitsoverall , Logitshigh , and For all methods, the Adam optimizer was employed with
Logitslow , but only Logitsoverall was used during model a learning rate of 1e-4, a weight decay of 1e-5 and batch
inference. size was set to 1. All methods were trained for a fixed
number of epochs (100 for CLIP and 50 for PLIP) with
IV. E XPERIMENTAL R ESULTS early stop. We chose GPT-4 [41] to generate pathological
visual descriptions, providing 10 low-level visual descriptions
A. Experimental Settings
and 30 high-level visual descriptions for each category of
1) Datasets: To comprehensively assess the performance WSIs (i.e., C l = 10 and C h = 30). For MSCPT and
of our Multi-Scale and Context-focused Prompt Tuning Metaprompt, we utilized the zero-shot capability of the VLMs
(MSCPT), three real datasets from the Cancer Genome At- to select 30 patches for each category at 5× magnification.
las (TCGA) Data Portal were used: TCGA-NSCLC, TCGA- The lengths of the global prompts pglob in both image and
BRCA, and TCGA-RCC. text encoder were uniformly set to 2. In this paper, unless
TCGA-NSCLC is a dataset of 1041 non-small cell lung explicitly stated otherwise, all experiments are conducted
cancer (NSCLC) WSIs, including 530 lung adenocarcinoma with 16 training samples per category. All the work was
(LUAD) and 511 lung squamous cell carcinoma (LUSC) conducted using the PyTorch library on a workstation with
slides. 20% of the dataset (209 slides) is used for training, eight NVIDIA A800 GPUs. All codes and details are released
and the remaining 80% (832 slides) is used for testing. at https://github.com/Hanminghao/MSCPT.
TCGA-BRCA is a dataset comprising 1056 slides of breast
invasive carcinoma (BRCA) WSIs. This dataset includes 845
slides of invasive ductal carcinoma (IDC) and 211 slides of B. Comparisons with State-of-the-Art
invasive lobular carcinoma (ILC). 20% of them (223 slides) The experimental results under the 16-shot setting are
are randomly selected as the training set, and the remaining displayed in Table I. We observed some intriguing insights,
80% (833 slides) are used as the testing set. such as complex and parameter-heavy methods like TransMIL
7
TABLE I
C ANCER SUB - TYPING RESULTS ON TCGA-NSCLC, TCGA-BRCA, AND TCGA-RCC. T HE HIGHEST PERFORMANCE IS IN BOLD , AND THE
SECOND - BEST PERFORMANCE IS UNDERLINED . W E PROVIDED MEAN AND STANDARD DEVIATION RESULTS UNDER FIVE RANDOM SEEDS .
CLAM-MB [31] 660K 69.65±3.61 64.52±3.22 65.14±2.69 67.98±4.86 60.68±6.47 74.09±3.52 93.59±1.16 78.72±2.18 8103±2.06
TransMIL [1] 2.54M 64.82±8.01 59.17±10.87 62.00±5.18 65.31±6.02 57.72±2.48 68.12±4.11 94.17±1.23 79.63±1.52 81.86±1.41
CLIP
DSMIL [33] 462K 66.00±9.23 63.87±7.00 64.11±6.65 66.18±10.08 59.35±8.01 67.52±11.56 91.53±5.17 78.38±6.56 80.69±6.47
RRT-MIL [40] 2.63M 66.47±6.73 62.10±6.17 63.20±5.24 66.33±4.30 61.14±5.93 71.21±8.94 93.89±1.91 81.04±2.11 83.30±2.24
CoOp [25] 337K 69.06±4.06 63.87±3.77 64.27±3.55 68.86±3.45 61.64±2.40 72.10±3.22 94.18±1.72 79.88±2.40 82.15±1.96
CoCoOp [24] 370K 64.37±2.28 60.95±1.55 61.37±1.36 66.50±3.02 59.64±2.90 71.07±4.93 85.68±2.66 67.72±3.49 71.00±2.90
Metaprompt [26] 360K 75.94±3.01 70.35±3.09 70.41±3.09 69.12±4.12 63.39±4.28 74.65±7.20 94.18±1.56 80.03±2.06 82.52±2.15
TOP [8] 2.11M 73.56±3.14 68.19±1.22 68.77±2.53 69.75±4.66 61.32±6.12 71.68±2.56 93.56±1.22 79.66±1.97 80.79±1.05
ViLa-MIL [36] 2.77M 74.85±7.62 68.74±5.86 68.87±5.97 70.13±3.86 62.04±2.28 71.93±2.31 93.34±1.49 79.40±1.13 81.81±0.92
MSCPT(ours) 1.35M 78.67±3.93 72.47±3.13 72.67±2.96 74.56±4.54 65.59±1.85 75.82±2.38 95.04±1.31 83.78±2.19 85.62±2.14
Max-pooling 197K 71.78±4.13 66.40±3.51 66.66±3.42 66.66±2.36 60.32±2.24 71.57±4.83 95.18±0.63 81.63±0.92 84.30±1.30
Mean-pooling 197K 70.55±6.64 65.32±5.60 65.50±5.55 71.62±2.41 64.62±2.96 74.45±2.49 94.75±0.51 82.22±0.67 85.24±0.80
ABMIL [32] 461K 78.54±4.29 72.06±3.79 72.12±3.78 72.18±1.28 64.49±1.74 74.63±1.31 96.51±0.63 85.66±1.97 87.94±1.92
CLAM-SB [31] 660K 80.56±4.57 73.15±4.05 73.27±3.97 73.49±2.12 65.22±2.61 75.05±3.88 96.41±0.36 84.71±1.60 87.25±1.34
Pathology Pretrained
CLAM-MB [31] 660K 80.68±3.63 73.15±3.00 73.32±2.83 74.33±1.76 66.11±1.94 76.11±2.03 96.58±0.59 85.20±0.83 87.85±0.79
TransMIL [1] 2.54M 73.40±10.33 66.92±7.94 67.21±7.63 70.52±2.45 62.06±1.67 70.14±2.77 96.35±0.54 83.70±0.80 86.33±0.46
PLIP
DSMIL [33] 462K 77.75±7.22 72.84±6.31 73.08±6.00 70.14±4.11 63.01±2.78 71.48±5.37 93.01±6.05 79.58±9.16 82.87±7.32
RRT-MIL [40] 2.63M 76.30±10.01 70.86±7.47 71.01±7.44 72.77±2.20 65.74±2.34 74.38±4.01 96.09±1.06 83.94±2.05 86.56±2.28
CoOp [25] 337K 77.92±5.48 71.58±4.74 71.63±4.75 73.77±2.83 64.88±1.26 74.14±3.38 95.76±0.80 83.23±2.07 85.90±1.63
CoCoOp [24] 370K 72.62±8.45 66.63±5.83 66.97±5.85 71.21±4.20 62.95±3.95 73.57±6.31 95.81±0.42 83.18±1.35 86.02±1.03
Metaprompt [26] 360K 78.31±5.66 72.03±4.60 71.86±4.61 73.98±2.15 65.50±2.05 75.56±4.58 95.75±0.48 83.52±1.46 86.62±1.43
TOP [8] 2.11M 78.91±3.79 72.33±4.89 72.91±4.61 74.06±2.66 65.17±2.16 76.51±1.79 95.06±0.51 82.86±1.35 86.14±0.98
ViLa-MIL [36] 2.77M 80.98±2.52 73.81±3.64 73.94±3.56 74.86±2.45 66.03±1.81 77.35±1.63 95.72±0.60 83.85±1.10 86.53±1.03
MSCPT(ours) 1.35M 84.29±3.97 76.39±5.69 76.54±5.49 75.55±5.25 67.46±2.43 79.14±2.63 96.94±0.36 87.01±1.51 89.28±1.22
and RRT-MIL underperformed despite their strong perfor- significant improvements in all evaluation metrics across the
mance with full data training. Conversely, less parameterized three datasets and two VLMs. Compared to the top-performing
methods such as ABMIL and CLAM exhibited slightly better traditional MIL-based methods, MSCPT shows improvements
performance. This is because traditional MIL-based methods of 0.3-13.0% in AUC, 2.0-12.3% in F1, and 1.5-11.6% in
require a lot of WSIs for training and the more parameters ACC across three datasets and two VLMs. Overall, MSCPT
they have, the more training data is needed. Furthermore, shows greater performance improvements when based on
after adapting the prompt tuning methods designed for natural CLIP compared to PLIP. This is attributed to the specialized
images (i.e., CoOp, CoCoOp, and Metaprompt) to tasks at pre-training of PLIP on pathological images, enhancing its
the WSI level, these methods outperform traditional MIL- encoding capabilities for patches, thereby reducing the re-
based methods when based on CLIP and achieve comparable liance on textual descriptions. Compared to the top-performing
performance when using PLIP. Relatively few parameters con- prompt tuning method suitable for natural images, MSCPT
tribute to this result. Additionally, we found that Metaprompt improved the AUC, F1, and ACC by 1.0-8.2%, 2.8-6.7%, and
outperforms CoOp across most metrics, thanks to its integra- 1.6-6.9%.
tion of visual prompt tuning and multi-scale information. This
Prompt tuning methods explicitly designed for WSI exhibit
result motivates us to pursue visual prompt tuning and develop
superior performance. This is attributed to their incorporation
more effective multi-scale information integration modules.
of priors into pre-trained VLMs and leveraging those priors
Despite prompt tuning methods designed for Few-shot Weakly
to guide prompt tuning. Additionally, ViLa-MIL introduces
Supervised WSI Classification tasks having a relatively higher
multi-scale information compared to TOP, positioning it as
number of parameters, they exhibit the best performance. This
the second-best overall performer. In comparison to ViLa-
is because VLMs’ prior knowledge is effectively exploited un-
MIL, MSCPT shows improvements across all datasets, with
der the guidance of visual descriptive text prompts, alleviating
AUC increasing by 0.9-5.7%, F1 by 2.2-6.2%, and ACC by
the demand for extensive training data.
2.3-5.5%. This is because MSCPT performs prompt tuning
Compared to other methods, our proposed MSCPT exhibits both on the textual and visual modality. Furthermore, MSCPT
8
TABLE II TABLE IV
C ORE COMPONENTS ABLATION EXPERIMENT ON THE TCGA-NSCLC A BLATION EXPERIMENT OF DIFFERENT INSTANCE AGGREGATION
DATASET BASED ON PLIP. METHODS ON THE TCGA-NSCLC DATASET BASED ON PLIP.
(a) TCGA-BRCA CLIP-based Whole Slide Image Patch Selection Similarity (5×) Metaprompt Heatmap (20×) MSCPT Heatmap (20×) CLAM-MB Heatmap (20×)
Similarity/Attention Score
Selected High Similarity Patches (5×) High Similarity Score Patches (20×)
Whole Slide Image Patch Selection Similarity (5×) Metaprompt Heatmap (20×) MSCPT Heatmap (20×) ABMIL Heatmap (20×)
(b) TCGA-RCC PLIP-based
Low
Selected High Similarity Patches (5×) High Similarity Score Patches (20×)
Fig. 4. Visualization of the original WSI, similarity score map for patch selection, heatmaps generated using MSCPT, attention heatmaps of the baseline
(i.e., Metaprompt) and the best-performing traditional MIL-based method, selected high similarity patches and patches with highest similarity scores using
MSCPT. The area surrounded by the red line in the original WSI is the tumor area.
70 62 64
68 60 62
70
Score (%)
Score (%)
Score (%)
66 58 Score (%) 60
64 68
56 58
62 66
56
60 CLAM-MB 54 CLAM-MB CLAM-MB CLAM-MB
Metaprompt Metaprompt 64 Metaprompt Metaprompt
58 54
CoOp 52 CoOp CoOp CoOp
ViLa-MIL ViLa-MIL 62 ViLa-MIL 52 ViLa-MIL
56
MSCPT (ours) 50 MSCPT (ours) MSCPT (ours) MSCPT (ours)
1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16
Number of training samples per class Number of training samples per class Number of training samples per class Number of training samples per class
Fig. 5. Experiments on TCGA-BRCA with 16, 8, 4, 2, and 1-shot settings. (a) (b) are CLIP-based results, while (c) (d) are PLIP-based results.
in Section IV-D support this point. We also conducted an MSCPT’s robustness across different LLMs and highlights
ablation study on cross-guidance. Instead of computing cross- the benefit of accurate pathological visual descriptions for
scale cosine similarity during feature aggregation, we only model performance. Providing a more accurate description
calculated cosine similarity between patch and description helps improve model performance.
embeddings at the same scale. Removing cross-guidance led
to a drop in performance across all metrics, likely because D. Visualization
LLMs may produce descriptions with incorrect scales.
As shown in Fig. 4, we have visualized a case of TCGA-
4) Effects of Large Language Models: To verify the impact RCC based on PLIP and a case of TCGA-BRCA based on
of different LLMs on model performance, we compared the CLIP. As depicted in Fig. 4a, during patch selection, CLIP
performance of MSCPT when using descriptions generated assigned high similarity scores not only to tumor regions
by different LLMs (i.e., Gemini-1.5-pro [42], Cluade-3 [43], but also to non-tumor areas. This outcome arose because
Llama-3 [44], GPT-3.5 [45] and GPT-4 [41]). The results CLIP was not specifically designed for pathological images,
obtained using CLIP on TCGA-RCC are presented in Table resulting in a less-than-optimal zero-shot capability for this
V. When generating descriptions using Claude-3, MSCPT per- type of imagery. However, after prompt tuning using MSCPT,
forms comparably to the baseline. However, MSCPT outper- the model correctly assigned high scores to the actual tumor
forms the baseline when using other LLMs. This demonstrates regions, while the regions that originally received high scores
10
dropped to lower score ranges (red arrows in Fig. 4a). Mean- [2] S. J. Wagner, D. Reisenbüchler, N. P. West, J. M. Niehues, G. P.
while, CLAM-MB struggled to differentiate tumor and no- Veldhuizen, P. Quirke, H. I. Grabsch, P. A. Brandt, G. G. Hutchins,
S. D. Richman et al., “Fully transformer-based biomarker prediction
tumor. Similarly, Metaprompt assigned high attention weights from colorectal cancer histology: a large-scale multicentric study,” arXiv
to certain non-tumor tissues (red arrows in Fig. 4a). preprint arXiv:2301.09617, 2023.
During selecting patches using PLIP, the model could [3] X. Xing, M. Zhu, Z. Chen, and Y. Yuan, “Comprehensive learning and
adaptive teaching: Distilling multi-modal knowledge for pathological
roughly identify tumor regions but also assigned high scores glioma grading,” Med. Image Anal., vol. 91, p. 102990, 2024.
to a small number of non-tumor areas. However, this issue [4] Q. Guo, L. Qu, J. Zhu, H. Li, Y. Wu, S. Wang, M. Yu, J. Wu, H. Wen,
was mitigated with MSCPT (red arrows in Fig. 4b). While X. Ju et al., “Predicting lymph node metastasis from primary cervical
squamous cell carcinoma based on deep learning in histopathologic
ABMIL could also determine instance importance, it tended images,” Mod. Pathol., vol. 36, no. 12, p. 100316, 2023.
to assign higher scores to certain non-tumor regions compared [5] A. K. Glaser, N. P. Reder, Y. Chen, E. F. McCarty, C. Yin, L. Wei,
to MSCPT (yellow arrows in Fig. 4b). Due to PLIP’s improved Y. Wang, L. D. True, and J. T. Liu, “Light-sheet microscopy for slide-
free non-destructive pathology of large clinical specimens,” Nat. Biomed.
ability to represent pathological images, Metaprompt produced Eng, vol. 1, no. 7, p. 0084, 2017.
visualization results comparable to MSCPT. [6] J. A. Ludwig and J. N. Weinstein, “Biomarkers in cancer staging,
prognosis and treatment selection,” Nat. Rev. Cancer, vol. 5, no. 11,
pp. 845–856, 2005.
E. Results with Fewer Training Samples [7] M. Ilse, J. M. Tomczak, and M. Welling, “Deep multiple instance
learning for digital histopathology,” in MICCAI. Elsevier, 2020, pp.
To further validate MSCPT’s performance, we conducted 521–546.
experiments on TCGA-BRCA with 16, 8, 4, 2, and 1-shot set- [8] L. Qu, K. Fu, M. Wang, Z. Song et al., “The rise of ai language
pathologists: Exploring two-level prompt learning for few-shot weakly-
tings. Based on the results in Table I, we selected several well- supervised whole slide image classification,” NeurIPS, vol. 36, 2024.
performing models (i.e., CLAM-MB, CoOp, Metaprompt, [9] Z. Shao, Y. Chen, H. Bian, J. Zhang, G. Liu, and Y. Zhang, “Hvtsurv:
ViLa-MIL, and MSCPT) for these experiments. It is also worth Hierarchical vision transformer for patient-level survival prediction from
whole slide image,” in AAAI, vol. 37, no. 2, 2023, pp. 2209–2217.
noting that with limited training samples, sample selection [10] G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. Werneck
significantly impacts model performance [8]. To address this, Krauss Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra,
we conducted dataset splitting, model training, and testing and T. J. Fuchs, “Clinical-grade computational pathology using weakly
supervised deep learning on whole slide images,” Nat. Med., vol. 25,
using ten different seeds, excluding the two best and two worst no. 8, pp. 1301–1309, 2019.
results to calculate the average. Due to the sample imbalance in [11] L. Qu, Y. Ma, X. Luo, Q. Guo, M. Wang, and Z. Song, “Rethinking
TCGA-BRCA, we just reported AUC and macro F1-score, as multiple instance learning for whole slide image classification: A good
instance classifier is all you need,” IEEE Trans. Circuits Syst. Video
shown in Fig. 5. When using CLIP as the base model, MSCPT Technol., 2024.
underperforms compared to CLAM-MB and Metaprompt in 1- [12] C. L. Srinidhi, O. Ciga, and A. L. Martel, “Deep neural network models
and 2-shot settings, likely due to MSCPT’s larger parameter for computational histopathology: A survey,” Med. Image Anal., vol. 67,
p. 101813, 2021.
size and CLIP’s limited understanding of pathology descrip- [13] A. Shmatko, N. Ghaffari Laleh, M. Gerstung, and J. N. Kather, “Artificial
tions. However, with 4 or more shots, MSCPT significantly intelligence in histopathology: enhancing cancer research and clinical
outperforms other methods. Additionally, when using PLIP as oncology,” Nat. Cancer, vol. 3, no. 9, pp. 1026–1038, 2022.
[14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
the base model, MSCPT consistently performs better than any G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
other method. visual models from natural language supervision,” in ICML. PMLR,
2021, pp. 8748–8763.
[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
V. C ONCLUSION of deep bidirectional transformers for language understanding,” arXiv
In this paper, we propose Multi-Scale and Context-focused preprint arXiv:1810.04805, 2018.
[16] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou,
Prompt Tuning (MSCPT) to solve the Few-shot Weakly- K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al., “Towards expert-
supervised WSI Classification (FSWC) task. MSCPT gener- level medical question answering with large language models,” arXiv
ates multi-scale pathological visual descriptions using GPT-4, preprint arXiv:2305.09617, 2023.
[17] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang,
guiding hierarchical prompt tuning and instance aggregation. “Biobert: a pre-trained biomedical language representation model for
Experiments on three WSI subtyping datasets and two Vision- biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240,
Language models (VLMs) show that MSCPT achieved state- 2020.
[18] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple:
of-the-art results in FSWC tasks. Furthermore, MSCPT is ap- Multi-modal prompt learning,” in CVPR, 2023, pp. 19 113–19 122.
plicable to fine-tune any VLMs for WSI-level tasks. However, [19] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image
we find that the model performance varies significantly on pre-training for unified vision-language understanding and generation,”
in ICML. PMLR, 2022, pp. 12 888–12 900.
different datasets and VLMs. That is because the performance [20] J. Cheng, X. Pan, K. Yang, S. Cao, B. Liu, Q. Yan, and Y. Yuan,
of model fine-tuning largely depends on the pre-trained VLM “Gexmolgen: Cross-modal generation of hit-like molecules via large
itself. We look forward to the emergence of more compre- language model encoding of gene expression signatures,” bioRxiv,
2024. [Online]. Available: https://www.biorxiv.org/content/early/2024/
hensive and powerful pre-trained pathological VLMs, which 02/19/2023.11.11.566725
will significantly promote the development of FSWC tasks and [21] M. Y. Lu, B. Chen, A. Zhang, D. F. Williamson, R. J. Chen, T. Ding,
even all computational pathology. L. P. Le, Y.-S. Chuang, and F. Mahmood, “Visual language pretrained
multiple instance zero-shot transfer for histopathology images,” in
CVPR, 2023, pp. 19 764–19 775.
R EFERENCES [22] Z. Huang, F. Bianchi, M. Yuksekgonul, T. J. Montine, and J. Zou, “A
visual–language foundation model for pathology image analysis using
[1] Z. Shao, H. Bian, Y. Chen, Y. Wang, J. Zhang, X. Ji et al., “Transmil: medical twitter,” Nat. Med., vol. 29, no. 9, pp. 2307–2316, 2023.
Transformer based correlated multiple instance learning for whole slide [23] M. Y. Lu, B. Chen, D. F. Williamson, R. J. Chen, I. Liang, T. Ding,
image classification,” NeurIPS, vol. 34, pp. 2136–2147, 2021. G. Jaume, I. Odintsov, L. P. Le, G. Gerber et al., “A visual-language
11