0% found this document useful (0 votes)
39 views11 pages

MSCPT: Few-Shot Whole Slide Image Classification With Multi-Scale and Context-Focused Prompt Tuning

The document presents a novel method called Multi-Scale and Context-focused Prompt Tuning (MSCPT) for Few-shot Weakly Supervised Whole Slide Image (WSI) classification, addressing challenges in traditional multiple instance learning (MIL) methods that require extensive labeled data. MSCPT leverages pre-trained Vision-Language models to generate multi-scale pathological visual descriptions, enhancing classification performance with limited training samples. Extensive experiments demonstrate that MSCPT outperforms existing methods, achieving state-of-the-art results in few-shot scenarios across multiple datasets.

Uploaded by

Shounak Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views11 pages

MSCPT: Few-Shot Whole Slide Image Classification With Multi-Scale and Context-Focused Prompt Tuning

The document presents a novel method called Multi-Scale and Context-focused Prompt Tuning (MSCPT) for Few-shot Weakly Supervised Whole Slide Image (WSI) classification, addressing challenges in traditional multiple instance learning (MIL) methods that require extensive labeled data. MSCPT leverages pre-trained Vision-Language models to generate multi-scale pathological visual descriptions, enhancing classification performance with limited training samples. Extensive experiments demonstrate that MSCPT outperforms existing methods, achieving state-of-the-art results in few-shot scenarios across multiple datasets.

Uploaded by

Shounak Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1

MSCPT: Few-shot Whole Slide Image


Classification with Multi-scale and Context-focused
Prompt Tuning
Minghao Han, Linhao Qu, Dingkang Yang, Xukun Zhang, Xiaoying Wang, Lihua Zhang

Abstract—Multiple instance learning (MIL) has become a


arXiv:2408.11505v1 [cs.CV] 21 Aug 2024

Aggregation
standard paradigm for weakly supervised classification of whole

Classifier
Encoder

Frozen
Image
slide images (WSI). However, this paradigm relies on the use WSI
Patches Patches WSI-level
of a large number of labelled WSIs for training. The lack of Embedding
Embedding
Prediction
WSI
training data and the presence of rare diseases present significant
(a) Traditional MIL-based Mthod (WSI-level)
challenges for these methods. Prompt tuning combined with
the pre-trained Vision-Language models (VLMs) is an effective

Embeded

Layer L
Layer 1

Layer 2
Word
[�����] .
solution to the Few-shot Weakly Supervised WSI classification ...

(FSWC) tasks. Nevertheless, applying prompt tuning methods Stroma Tumor ... Lymph ... Frozen
Text Prompt Text Encoder
designed for natural images to WSIs presents three significant Visual Prompt

... ... Text


Embedding
challenges: 1) These methods fail to fully leverage the prior ...

Embeded
...

Layer L
Patch

Layer 1

Layer 2
Similarity
knowledge from the VLM’s text modality; 2) They overlook Scores
Image
the essential multi-scale and contextual information in WSIs, Frozen
Image Encoder
Embedding
Patch-level
Prediction
Image (Patch)
leading to suboptimal results; and 3) They lack exploration
(b) Prompt Tuning for Single Image (Patch-level)
of instance aggregation methods. To address these problems,
We are studying at {Cancer

Pathological
we propose a Multi-Scale and Context-focused Prompt Tuning

Description

Prompted
Category}. Could you please

Encoder
Visual
help ous generate ten visual

Text
LLM

Description
Embedding
description exhibited in the HE
(MSCPT) method for FSWC tasks. Specifically, MSCPT employs histological images of {Subtype
Name} at 5× magnification.

the frozen large language model to generate pathological visual Question

Prompt Tuning

No-Parametrci
Aggregation
Graph
language prior knowledge at multi-scale, guiding hierarchical
prompt tuning. Additionally, we design a graph prompt tuning

Prompted
WSI-level

Encoder
Image
Prediction
module to learn essential contextual information within WSI, Patches Similarity
Embedding Matirx : Trainable
Patches
and finally, a non-parametric cross-guided instance aggregation : Frozen
WSI
module has been introduced to get the WSI-level features. Based
(c) Our Proposed MSCPT (WSI-level)
on two VLMs, extensive experiments and visualizations on three
datasets demonstrated the powerful performance of our MSCPT.
Fig. 1. Motivation of our MSCPT. (a) Traditional MIL-based methods mainly
Index Terms—whole slide image classification, prompt tuning, focus on instance aggregation and require a large amount of training data.
few-shot learning, multimodal. (b) Prompt tuning methods for natural images incorporate a set of trainable
parameters into the input space for training, enabling pre-trained VLMs to
be applied to downstream tasks. However, those methods are only suitable
I. I NTRODUCTION for single images and are no longer adequate for WSI-level tasks due to the
enormous size of WSIs. (c) MSCPT leverages pathological visual descriptions
Developing automated analysis frameworks using Whole combined with multimodal hierarchical prompt tuning to explore the potential
Slide Images (WSIs) is crucial in clinical practice [1]–[4], of VLMs. For simplicity, we only depicted the data flow diagram for a single
as WSIs are widely regarded as the “gold standard” for scale.
cancer diagnosis, typing, staging, and prognosis analysis [5],
[6]. Given the enormous size of WSIs (roughly 40,000 ×
40,000), multiple instance learning (MIL) [7] has become the labor-intensive [8], [9]. This weak supervision paradigm has
dominant method. As shown in Fig. 1 a, traditional MIL-based led to a problem: a large number of WSIs are required
methods typically follow a four-step paradigm: patch cutting, to train an effective model [10], [11]. In clinical practice,
feature extraction, feature aggregation, and classification. Most patient privacy concerns, rare diseases, and difficulty preparing
MIL-based methods are conducted under weak supervision pathology slides make accumulating a large number of WSIs
at the bag level, as creating instance-level labels is quite very challenging [12], [13].
Vision-Language models (VLMs) have shown excellent
This project was funded by the National Natural Science Foundation of
China 82090052. Minghao Han and Linhao Qu are the co-first authors. generalization ability to downstream tasks [14]–[20]. Recently,
(Corresponding author: Lihua Zhang.) researchers have proposed specialized VLMs for analyzing
Minghao Han, Dingkang Yang, Xukun Zhang, and Lihua Zhang are pathological images, including MI-Zero [21], PLIP [22], and
with the Academy for Engineering and Technology, Fudan University,
Shanghai 200433, China. (E-mail: {mhhan22, zhangxk21}@m.fudan.edu.cn, Conch [23]. These VLMs, extensively pre-trained on abundant
{dkyang20, lihuazhang }@fudan.edu.cn). Linhao Qu is with the Digital image-text pairs, contain significant prior knowledge. If the
Medical Research Center, School of Basic Medical Science, Fudan University, prior knowledge of VLMs can be fully exploited with a few
Shanghai 200032, China. (E-mail: [email protected]). Xiaoying Wang is
with the Zhongshan Hospital, Fudan University, Shanghai 200032, China. (E- training samples, it can partially alleviate the data scarcity
mail: [email protected]). problem in WSI classification tasks. Therefore, we aim to
2

explore a novel “data-efficient” method based on the VLMs well as global trainable prompts. The MHPT module employs
to improve the model’s performance on the Few-shot Weakly the transformer layers in the text encoder to effectively learn
Supervised WSI Classification (FSWC) [8] task. the interactions among three distinct prompts.
Nevertheless, there is a gap between generally pre-trained Furthermore, the Image-text Similarity-based Graph Prompt
VLMs and specific downstream tasks. Under the few-shot Tuning (ISGPT) module is introduced to extract contex-
scenarios, researchers often employ prompt tuning to bridge tual information. Precisely, we do not follow previous ap-
this gap with the help of a few training samples [14], [18], [24], proaches [29], [30] of using patch positions or patch feature
[25]. As shown in Fig. 1 b, prompt tuning aims to learn a set similarity to construct graph neural networks (GNNs). We
of trainable continuous vectors and incorporate these vectors propose to use the similarity between patches and pathological
into the input space for training, effectively adapting the fixed visual descriptions as the basis for building GNNs. We believe
pre-trained VLMs for specific downstream tasks. that using image-text pairs to build GNNs is more effective
However, existing prompt tuning methods for natural images for capturing global features than methods relying on patch
(such as CoOp [25], CoCoOp [24], and MetaPrompt [26]) positions and image feature similarity, and corresponding
are only effective for single images (i.e., patch-level). Since ablation experiments confirm this hypothesis.
each WSI typically contains tens of thousands of patches, Finally, impressed by the powerful zero-shot capabilities
these methods are ineffective for WSI-level tasks. Also, studies of VLMs [21]–[23], we fully leverage the similarity between
indicate that the multi-scale information [27] and the con- patches and pathological visual descriptions to aggregate in-
textual information [28] in WSIs play a significant role in stances. The Non-Parametric Cross-Guided Pooling (NPCGP)
cancer analysis, but those methods fail to capture this crucial module, utilizing the Top-K algorithm for instance aggrega-
information. Additionally, in training VLMs, the image-text tion, is introduced to further reduce the risk of overfitting in
pairs contain more than just information about the category. few-shot scenarios. Overall, our contributions are summarised
They also include more details about the image, such as as follows:
contextual properties of the object [14] and descriptions of 1) MSCPT demonstrates that high-level concepts from
the cellular microenvironment [22], [23]. However, existing pathological descriptions combined with low-level im-
prompt tuning methods have primarily focused on image age representations can enhance few-shot weakly super-
category information, without emphasizing a detailed image vised WSI classification.
content analysis, which has left the full potential of patholog- 2) MSCPT achieves excellent performance by introducing
ical VLMs underexplored. only a limited number of trainable parameters (∼0.9%
To address the aforementioned issue, we propose Multi- of the pre-trained VLM). Additionally, MSCPT is appli-
Scale and Context-focused Prompt Tuning (MSCPT) for WSI cable to fine-tune any VLMs for WSI-level tasks.
classification in weakly supervised and few-shot scenarios. 3) Extensive experiments and visualizations on three
Our framework fully leverages the characteristic of VLM datasets and two VLMs have confirmed that MSCPT’s
training with image-text pairs at dual magnification scales: 1) performance is state-of-the-art in few-shot scenarios,
At low magnification, we provide the VLM with pathological surpassing other traditional MIL-based and prompt tun-
visual descriptions at the tissue level (such as the infiltration ing methods.
between tumor tissue and other normal tissues); 2) At high
magnification, pathological visual descriptions at the cellular
level (such as cell morphology, nuclear changes, and the II. R ELATED W ORK
formation of various organelles) are provided to the VLM.
These pathological visual descriptions at multi-scale can help A. Multiple Instance Learning in Whole Slide Images
VLM identify regions that are helpful for cancer analysis and Due to the high resolution of Whole Slide Images (WSIs)
achieve optimal results even with limited training samples. and the challenges of detailed labelling, weakly supervised
As illustrated in Fig. 1 c, the core idea behind developing methods based on Multiple Instance Learning (MIL) have
MSCPT is to incorporate prior knowledge at the tissue and emerged as the mainstream for WSI analysis. The MIL-based
cellular scales into the WSI-level tasks. Specifically, we first methods treat a WSI as a bag and all patches as instances,
use a frozen large language model (LLM) to generate multi- considering a bag positive if it contains at least one positive
scale pathological visual descriptions, leveraging them as prior instance. Within the MIL framework, an aggregation step is
knowledge. required to aggregate all instances into bag features. The most
Secondly, we design a Multi-scale Hierarchical Prompt primitive aggregation methods are non-parametric mean pool-
Tuning (MHPT) module to combine pathological visual de- ing and max pooling. However, since disease-related instances
scriptions from multi-scale hierarchically to enhance prompt are a small fraction [31], those non-parametric aggregation
effectiveness. Inspired by Metaprompt [26], a dual-path asym- methods treated all instances equally, causing useful informa-
metric framework is adopted, asymmetrically freezing the tion to be overwhelmed by irrelevant data. Subsequently, some
image encoder and text encoder at different scales for prompt attention-based methods (such as ABMIL [32], DSMIL [33]
tuning. This asymmetric framework enables us to freeze half and CLAM [31]) were introduced, assigning different weights
of the encoder to reduce the number of trainable parameters. to each instance and aggregating them based on the weights.
Specifically, MHPT contains low-level and high-level prompts Furthermore, MIL methods based on Graph Neural Networks
for both low and high-magnification visual descriptions, as (GNNs) [29], [30] and Transformers [1], [34] had also been
3

Multi-scale Prompt Construction Multi-scale Hierarchical Non-Parametric Cross-Guided Pooling


Low Level Visual Desciription (5×) Prompt Tuning ��� ������ℎ��ℎ ������������� ���������

Sparse stroma among tumor cells.
[Class 1]

[Class 2]
Clear cells with perinuclear halos.

Low Level
Encoder

Frozen
···

Text
h h
Extramedullary hematopoietic areas.
Tumor cells with high nuclear-to-

h TopK Pooling
cytoplasmic ratio. Σ Σ

Trainable Parameters
g
LLM
High Level Visual Desciription (20×)
Distinct cellular borders with acinar
�ℎ��ℎ
or nesting appearance.
�ℎ��ℎ �ℎ��ℎ ���� ����

Hierarchical
High Level

Prompted
Occasional papillary or tubular

Encoder
Text
structures in clear cell tumors. Image-text imliarity-based Graph Prompt Tuning

Σ Summation
···
Pale cytoplasm with multifocality or
bilaterality in some cases. GCN
���

Frozen Parameters
Patch Selection Low Level Patches (5×)
Patching & Selection

Low Level
WSI

Prompted
0.3 0.2 0.1 0.4 0.1 0.1 0.7 0.1

Encoder
Image

Dot Product
��ℝ�∗�
0.7 0.1 0.1 0.1 0.4 0.2 0.2 0.2
0.6 0.2 0.1 0.1 0.1 0.6 0.2 0.1
High Level Patches (20×) �ℎ��ℎ
0.1 0.3 0.1 0.5 0.3 0.1 0.1 0.5
High Level
Encoder

Frozen

0.2 0.4 0.1 0.3 0.5 0.2 0.2 0.1


Image

�∗��
��ℝ
Similarity Matrix (S) & Adjacency Matrix (A)

Fig. 2. We develop MSCPT based on the dual-path asymmetric framework, which inputs patches and pathological visual descriptions from multi-scale to
different encoders. MSCPT utilizes a large language model to generate multi-scale pathological visual descriptions. These descriptions are combined using
Multi-scale Hierarchical Prompt Tuning (MHPT) to integrate information across multiple scales. Then Image-text similarity-based Graph Prompt Tuning
(ISGPT) is employed to learn context information at each scale. Finally, Non-Parametric Cross-Guided Pooling (NPCGP) aggregates instances guided by
pathological visual descriptions to achieve the final Whole Slide Image classification result.

proposed to capture both local and global contextual in- this paper, we propose using pathological visual descriptions
formation of WSIs. Those methods have shown significant as prior knowledge to unleash the potential of VLMs.
improvements in recent years. Still, the cost of enhancing
model performance is the increase in parameters, requiring a C. Prompt Tuning in Vision-Language Models
large amount of data to train a well-performing model. In many Prompt tuning has demonstrated remarkable efficiency and
cases, training data faces a scarcity issue. Therefore, this paper effectiveness, whether in text or multimodal [18], [24], [25].
proposes MSCPT, which leverages Vision-Language models CLIP demonstrated remarkable zero-shot performance with
combined with pathological descriptions from Large language hand-crafted prompts, but the results can vary significantly
models to enhance the performance in few-shot scenarios. depending on the prompt used due to their sensitivity to
changes. Therefore, CoOp [25] and CoCoOp [24] proposed
that the model itself should determine the choice of prompts.
B. Vision-Language Models
Khattak et al. argued that optimizing prompt tuning within a
Vision-Language models (VLMs) are rapidly developing in single branch is not ideal. They introduced MaPLe [18] as
various fields. During training, VLMs use contrastive learn- a solution to enhance the alignment between visual and lan-
ing to reduce distances between paired image-text pairs and guage representations. Regrettably, these innovative methods
increase distances between unpaired ones. CLIP [14] col- are highly applicable to natural images but do not consider
lected over 400M image-text pairs from the internet and used the enormous size of WSIs and the crucial multi-scale and
contrastive learning to align them, resulting in compatibility contextual information needed for WSI analysis.
across various tasks. Compared to natural images, gathering To our knowledge, Qu et al. have conducted research
pairs of pathological images and corresponding descriptions is TOP [8] on the fine-tuning of CLIP for FSWC tasks. Shi et
challenging. To address this issue, MI-Zero [21] first pretrains al. also proposed ViLa-MIL [36] based on CLIP, which helps
image and text encoders using unpaired data, and then aligns with WSI classification by introducing multi-scale language
them in a common space using 33,480 pairs of pathological prior knowledge. These two studies are exceptional, pushing
image-text pairs. Huang et al. gathered over 450K pathological the boundaries of VLM capabilities and boosting model per-
image-text pairs from Twitter and LAION [35] and developed formance in few-shot scenarios. However, these methods are
PLIP [22]. Lu et al. trained Conch [23] on over 1.17M patho- all based on CLIP and do not investigate the performance of
logical image-text pairs, and it performs well on downstream models on pathological VLMs. Moreover, due to the large
tasks. Pretrained VLMs have significant potential, but effective number of patches in a WSI, they have to focus solely on the
methods to leverage them for WSI-level tasks are lacking. In text and neglect visual prompt tuning. Additionally, they do
4

not consider the crucial contextual information in WSI. Al- 2) Patch Selection: Due to the high resolution of WSIs,
though ViLa-MIL takes into account multi-scale information, dividing them into non-overlapping patches will result in
it merely integrates information using a late fusion approach a large number of patches. However, research has shown
without fully exploring the interactions between these scales. that only a few patches contain crucial information [31].
We validated our proposed MSCPT on both general VLM By preliminarily identifying patches closely linked to cancer
(i.e., CLIP) and pathology-specific VLM (i.e., PLIP). By analysis, we can notably diminish the computational resources
utilizing the zero-shot capability of VLM to initially select demanded by visual prompt tuning. The powerful zero-shot
a subset of patches closely related to cancer, we then con- ability of the VLMs allows for the initial screening of cancer-
ducted visual prompt tuning on these patches. Additionally, we related patches.
adopted an intermediate fusion approach to integrate pathology Specifically, we utilize Fimg to extract visual embeddings
prior knowledge at multi-scale, leveraging the transformer from patches while leveraging Ftext to extract textual em-
layers to hierarchically learn the relationships between them. beddings from the category prompts. Following this, the
Ultimately, we also utilized image-text similarity to construct similarities between patches and prompts are computed. Then,
GNNs to capture contextual information within the WSI. the top n patches with the highest similarity scores are selected
for each category. To enhance the robustness of patch selec-
tion, we generated 50 sets of manual category templates and
III. M ETHOD averaged their embeddings following [21]. For a WSI Xi , we

In this section, we introduce our few-shot weakly- choose patches xli,j , j = 1, 2, ..., nl at low magnification.
supervised WSI classification model, named Multi-scale and Due to our unique architecture, we solely perform patch
Context-focused Prompt Tuning (MSCPT), as illustrated in selection and visual prompt tuning at low magnification.
Fig. 2. MSCPT utilizes a dual-path asymmetric structure as
its foundation while conducting hierarchical prompt tuning on C. Multi-scale Visual Descriptions Construction
both textual and visual modalities. In this part, we aim to generate pathological visual descrip-
tions as pathological language prior knowledge to guide
the hierarchical prompt tuning and instances aggregation. To
A. Problem Formulation reduce manual workload, large language models (LLMs) are
Given a dataset X = {X1 , X2 , ..., XN } consisting of employed to generate descriptions related to different diseases.
N WSIs, each WSI is cropped into non-overlapping small That is, we enter the question “We are studying Cancer Cat-
patches, named instances. All instances belonging to the same egory. Please list C l visual descriptions at 5× magnification
WSI collectively form a bag. In weakly-supervised WSI tasks, and C h visual descriptions at 20× magnification observed in
only the labels of bags are known. The labels of the bags H&E-stained histological images of Cancer Sub-category.”
Yi ∈ {0, 1} , i = {1, 2, ...N } and the label of each instance into the LLM. And then we can get the multi-scale visual
{yi,j , j = 1, 2, . . . Mi } have the following relationship: description sets T low = {Tk,c
low
| 0 ≤ k ≤ K, 0 ≤ c ≤ C l } and
high high
 P T = {Tk,c | 0 ≤ k ≤ K, 0 ≤ c ≤ C h }. K represents the
0, if j yi,j = 0,
Yi = (1) number of WSI categories, and C l and C h denote the counts
1, else. of low-level and high-level descriptions, respectively.

B. Review of CLIP and Patch Selection D. Multi-scale Hierarchical Prompt Tuning


1) Review of CLIP: CLIP [14] adopts a two-tower structure, Inspired by MetaPrompt [26], a unique dual-path asym-
including an image encoder and a text encoder. The image metric framework is employed for multimodal hierarchical
encoder Fimg can be either a ResNet [37] or ViT [38], which prompt tuning, as shown in the left of Fig. 2. Freezing
is used to transform images into visual embeddings. The text two of all encoders helps reduce the trainable parameters
encoder Ftext takes a series of words as input and outputs and alleviates overfitting in few-shot scenarios. Compared to
textual embeddings. During the training process, CLIP utilizes previous works where their encoders process the same inputs,
a contrastive loss to learn a joint embedding space for the our method adopts a unique strategy: the prompted and
two modalities. During inference, we assume x is the visual frozen encoders take entirely different inputs. Considering
K
embedding, and {wi }i=1 is a series of textual embeddings the immense size of WSIs and the substantial computational
generated by Ftext . Each wi corresponds to prompt (such as and storage resource requirements for visual prompt tuning,
“an image of {class name}”) embedding for a specific image we only conducted visual prompt tuning at the low level.
category. Therefore, the predicted probabilities can be obtained Rather than modifying the visual prompts tuning method
by calculating the cosine similarity between x and wi : from Metaprompt, our emphasis is placed on the text modality.
Specifically, the low-level pathological visual descriptions
exp(cos(x, wi )/τ ) T low are sent into the frozen low-level text encoder Ftext low
,
p(y = i | x) = PK , (2) high
while the high-level pathological visual descriptions T are
j=1 exp(cos(x, wj )/τ )
sent into the prompted hierarchical high-level text encoder
high
where τ is the temperature coefficient, cos(·, ·) represents the Ftext . Simultaneously, patches are also fed into the corre-
cosine similarity, and K is the number of categories. sponding encoders. We wish to integrate different information
5

where C i and E i represent the class token [CLS] and the last
Multi-scale Hierarchical
Prompt Tuning token [EOT ] of the i-th transformer layer T i , and L signifies
Transformer Layer � ��−1
���,1
… ��−1
���,�� the number of transformer layers. Lastly, by projecting the
last token of the last transformer layer through the textual
��−1 ��−1
����,1 ��−1
����,2 ��−1 �−1
���,1 … ����,�� ��−1
ℎ��ℎ E�−1 g projection head T extP roj into the joint embedding space,
high high
the final textual representation zk,c for Tk,c is obtained:
Trainable
Transformer Layer � − 1 Parameters high
zk,c = T extproj(E L ). (5)
Frozen
Parameters
… … … … … … …
[CLS] token
E. Image-text Similarity-based Graph Prompt Tuning
Transformer Layer 1 [EOT] token


Some studies have shown that the interactions between
High Level
[CLS] X X X X �0ℎ��ℎ [EOT] Prompt different areas of WSI and their structural information play
… Low Level a crucial role in cancer analysis [28]. However, the current
Prompt
Tokenization & Embedding prompt tuning methods are unable to capture this informa-
Global Prompt
tion. To address this, we propose Image-text Similarity-based
… Last token of

X
High Level
low level desriptions Graph Prompt Tuning (ISGPT) module. More specifically, we
X X X
Description
deviate from conventional methods that utilize patch coordi-
nates or patch feature similarity in constructing graph neural
Fig. 3. Details of the Prompted Hierarchical High-Level Text Encoder. networks (GNNs) [29], [30]. Our innovative approach involves
The multi-scale Hierarchical Prompt Turning (MHPT) module utilizes the utilizing the similarity between patches and pathological visual
transformer layer to integrate pathological visual descriptions from different
scales.
descriptions as the foundation for developing GNNs. We treat
the patches as nodes and aim to construct the adjacency matrix
A by calculating the semantic similarity S between the patch
contained in T low and T high , which can help improve the embeddings and description embeddings. Specifically, after
multi-scale information processing capability of MSCPT. To patches and descriptions have passed through the encoders
achieve this purpose, we propose Multi-scale Hierarchical from Section III-D, patch embeddings P ∈ RM ×d and de-
Prompt Turning (MHPT) module. The core component of scription embeddings Z ∈ RKC×d are obtained, respectively.
MHPT, prompted hierarchical high-level text encoder, has The formula for semantic similarity S ∈ RM ×KC is:
been drawn in Fig. 3.
exp(cos(Pi , Zj )/τ )
1) Multi-scale Prompts Construction: For each layer of si,j = PK×C , (6)
low
Ftext , we introduce a learnable vector called global prompts m=1 exp(cos(Pi , Zm )/τ )
pglob to learn and integrate information from high-level text where τ is the temperature coefficient, cos(·, ·) represents the
prompts phigh and low-level text prompts plow . As an example, cosine similarity. K represents the number of WSI categories
consider the construction of multi-scale prompts for a high- and d is the embedding dimensionality. C and M denote the
high
level pathological visual description Tk,c . After tokenization number of pathological descriptions and patches at a given
high
and embedding, Tk,c is transformed into phigh
0 . And then scale, respectively. Subsequently, the calculation formula for
the low-level text prompts plow are obtained based on T low . the adjacency matrix A ∈ RM ×M is written as:
More specifically, a set of descriptions T low gets fed into the
low exp(cos(Si , Sj )/τ )
frozen Ftext , and the last token of each transformer layer gets ai,j = PM , (7)
extracted. These tokens are then fed into a prompt generator m=1 exp(cos(Si , Sm )/τ )
g, formulated as:
where Si ∈ RKC represents the semantic similarity between
pllow,i = g dllow ,

(3) i-th patch embeddings and all description embeddings(i.e., the
i-th row of S). We avoid constructing A based on patch
where dllow is the last token of Tk,i
low
at the l-th layer, generator coordinates or patch feature similarity, as this approach might
g is a basic multilayer perceptron to align vectors of different overlook fewer but significant patches when focusing only
scales into a common embedding space. Then, these tokens on Euclidean distance or patch feature similarity. Subsequent
get concatenated to obtain low-level text prompts pllow . experimental results have demonstrated the superior perfor-
2) Hierarchical Prompt Tuning: After obtaining the three mance of our method for constructing A. We choose Graph
prompts, to capture more complex associations between patho- Convolutional Network (GCN) [39] as the graph learning
logical visual descriptions at multi-scale, hierarchical prompt model. The definition of the GCN operation in the l-th GCN
tuning is performed on Ftexthigh
, which can be expressed as: layer is as follows:
−1 −1
 
FGCN A, H (l) = σ(D̃ 2 ÃD̃ 2 H (l) W (l) ). (8)
h i
C , , , pihigh , E i = Ti C i−1 , pi−1 i−1 i−1
 i  i−1
glob , plow , phigh , E ,
i = 1, 2, 3, ..., L, Here à = A + I, I is the identity matrix and σ (·) denotes
(4) an activation function. D̃i,i = j Ãi,j , W (l) is layer-specific
P
6

trainable weight matrix. H (l) ∈ RM ×d is the input embed- TCGA-RCC is a renal cell carcinoma (RCC) WSIs dataset
dings of all nodes. Therefore, the patch embeddings after graph containing 873 slides. Precisely, it consists of 121 slides of
prompt tuning at both high and low scales are represented as: chromophobe renal cell carcinoma (CHRCC), 455 slides of
high high
clear-cell renal cell carcinoma (CCRCC), and 297 slides of
P̃ = FGCN (Ahigh , P high ), (9) papillary renal cell carcinoma (PRCC). Likewise, 20% of the
low dataset (175 slides) is randomly taken out for training, while
P̃ = low
FGCN (Alow , P low ). (10)
698 slides are reserved for testing.
2) Evaluation Metrics: For all datasets, we leverage Accu-
F. Non-Parametric Cross-Guided Pooling racy (ACC), Area Under Curve (AUC), and macro F1-score
Impressed by the powerful zero-shot capability of pre- (F1) to evaluate model performance. To reduce the impact
trained VLMs, the possibility of employing a similar non- of data split on model evaluation, we follow ViLa-MIL [36]
parametric approach for instance aggregation was pondered. and employ five fixed seeds to perform five rounds of dataset
We propose Non-Parametric Cross-Guided Pooling (NPCGP) splitting, model training, and testing. We report the mean and
to aggregate instance into bag features. In NPCGP, we com- standard deviation of the metrics over five seeds.
pute semantic similarities between the patch embeddings P̃ 3) Model Zoo: Thirteen influential approaches were em-
post graph tuning and pathological visual description em- ployed for comparison, including traditional MIL-based meth-
beddings Z at both the same and across scales. The reason ods: Mean pooling, Max pooling, ABMIL [32], CLAM [31],
for calculating similarities both within the same and across TransMIL [1], DSMIL [33] and RRT-MIL [40]; prompt tuning
scales is our concern that the pathological visual descriptions methods for natural images: CoOp [25], CoCoOp [24] and
provided by LLM may contain scale-related inaccuracies. Metaprompt [26]; prompt tuning methods for WSIs: TOP [8]
Hence, this procedure serves to bolster the robustness of fea- and ViLa-MIL [36]. Adapting to WSI-level tasks, we inte-
ture aggregation strategies. Lastly, the bag-level unnormalized grated an attention-based instance aggregation module [32]
probability distribution Logits is obtained through the topK into the prompt tuning methods designed for natural images,
max-pooling operator htopK : such as CoOp, CoCoOp, and Metaprompt.
 T
 4) Implementation Details: Following CLAM [31], the
Logitshigh = htopK P̃high · Zhigh original WSIs were initially processed using the Otsu thresh-
 T
 (11) olding algorithm to remove the background parts. Sub-
+ htopK P̃high · Zlow , sequently, the WSIs were segmented into multiple non-
overlapping patches of 256 × 256 pixels at 5× and 20×
 T

Logitslow = htopK P̃low · Zlow
  (12) magnification levels. We applied to perform our MSCPT on
T
+ htopK P̃low · Zhigh , CLIP [14] and PLIP [22], both of which use ViT-B/16 [38]
as their visual tower. Apart from MSCPT, Metaprompt, and
1
Logitsoverall = Logitshigh + Logitslow .

(13) DSMIL, which utilized inputs of both 5× and 20× magnifi-
2 cation patches, the remaining methods solely relied on 20×
Following previous work [26], we use cross-entropy loss to magnification patches as inputs.
optimize the three distributions Logitsoverall , Logitshigh , and For all methods, the Adam optimizer was employed with
Logitslow , but only Logitsoverall was used during model a learning rate of 1e-4, a weight decay of 1e-5 and batch
inference. size was set to 1. All methods were trained for a fixed
number of epochs (100 for CLIP and 50 for PLIP) with
IV. E XPERIMENTAL R ESULTS early stop. We chose GPT-4 [41] to generate pathological
visual descriptions, providing 10 low-level visual descriptions
A. Experimental Settings
and 30 high-level visual descriptions for each category of
1) Datasets: To comprehensively assess the performance WSIs (i.e., C l = 10 and C h = 30). For MSCPT and
of our Multi-Scale and Context-focused Prompt Tuning Metaprompt, we utilized the zero-shot capability of the VLMs
(MSCPT), three real datasets from the Cancer Genome At- to select 30 patches for each category at 5× magnification.
las (TCGA) Data Portal were used: TCGA-NSCLC, TCGA- The lengths of the global prompts pglob in both image and
BRCA, and TCGA-RCC. text encoder were uniformly set to 2. In this paper, unless
TCGA-NSCLC is a dataset of 1041 non-small cell lung explicitly stated otherwise, all experiments are conducted
cancer (NSCLC) WSIs, including 530 lung adenocarcinoma with 16 training samples per category. All the work was
(LUAD) and 511 lung squamous cell carcinoma (LUSC) conducted using the PyTorch library on a workstation with
slides. 20% of the dataset (209 slides) is used for training, eight NVIDIA A800 GPUs. All codes and details are released
and the remaining 80% (832 slides) is used for testing. at https://github.com/Hanminghao/MSCPT.
TCGA-BRCA is a dataset comprising 1056 slides of breast
invasive carcinoma (BRCA) WSIs. This dataset includes 845
slides of invasive ductal carcinoma (IDC) and 211 slides of B. Comparisons with State-of-the-Art
invasive lobular carcinoma (ILC). 20% of them (223 slides) The experimental results under the 16-shot setting are
are randomly selected as the training set, and the remaining displayed in Table I. We observed some intriguing insights,
80% (833 slides) are used as the testing set. such as complex and parameter-heavy methods like TransMIL
7

TABLE I
C ANCER SUB - TYPING RESULTS ON TCGA-NSCLC, TCGA-BRCA, AND TCGA-RCC. T HE HIGHEST PERFORMANCE IS IN BOLD , AND THE
SECOND - BEST PERFORMANCE IS UNDERLINED . W E PROVIDED MEAN AND STANDARD DEVIATION RESULTS UNDER FIVE RANDOM SEEDS .

Trainable TCGA-NSCLC TCGA-BRCA TCGA-RCC


Methods
Param AUC F1 ACC AUC F1 ACC AUC F1 ACC
Max-pooling 197K 63.80±6.84 60.40±4.76 60.70±4.75 60.42±4.35 56.40±3.58 68.552±6.54 84.51±3.21 65.83±2.72 69.26±2.33
Mean-pooling 197K 69.53±4.74 63.76±5.77 64.69±4.31 66.64±2.41 60.70±2.78 71.73±3.59 93.31±0.66 78.64±0.74 81.29±1.08
ABMIL [32] 461K 66.95±4.31 62.60±3.75 62.96±3.65 67.92±3.90 61.72±3.60 72.77±3.15 93.41±1.41 79.80±1.56 82.47±1.46
CLAM-SB [31] 660K 67.49±5.94 62.86±4.19 63.51±4.19 67.80±5.14 60.51±5.07 72.46±4.36 93.85±1.52 79.87±3.17 83.21±2.67
ImageNet Pretrained

CLAM-MB [31] 660K 69.65±3.61 64.52±3.22 65.14±2.69 67.98±4.86 60.68±6.47 74.09±3.52 93.59±1.16 78.72±2.18 8103±2.06
TransMIL [1] 2.54M 64.82±8.01 59.17±10.87 62.00±5.18 65.31±6.02 57.72±2.48 68.12±4.11 94.17±1.23 79.63±1.52 81.86±1.41
CLIP

DSMIL [33] 462K 66.00±9.23 63.87±7.00 64.11±6.65 66.18±10.08 59.35±8.01 67.52±11.56 91.53±5.17 78.38±6.56 80.69±6.47
RRT-MIL [40] 2.63M 66.47±6.73 62.10±6.17 63.20±5.24 66.33±4.30 61.14±5.93 71.21±8.94 93.89±1.91 81.04±2.11 83.30±2.24
CoOp [25] 337K 69.06±4.06 63.87±3.77 64.27±3.55 68.86±3.45 61.64±2.40 72.10±3.22 94.18±1.72 79.88±2.40 82.15±1.96
CoCoOp [24] 370K 64.37±2.28 60.95±1.55 61.37±1.36 66.50±3.02 59.64±2.90 71.07±4.93 85.68±2.66 67.72±3.49 71.00±2.90
Metaprompt [26] 360K 75.94±3.01 70.35±3.09 70.41±3.09 69.12±4.12 63.39±4.28 74.65±7.20 94.18±1.56 80.03±2.06 82.52±2.15
TOP [8] 2.11M 73.56±3.14 68.19±1.22 68.77±2.53 69.75±4.66 61.32±6.12 71.68±2.56 93.56±1.22 79.66±1.97 80.79±1.05
ViLa-MIL [36] 2.77M 74.85±7.62 68.74±5.86 68.87±5.97 70.13±3.86 62.04±2.28 71.93±2.31 93.34±1.49 79.40±1.13 81.81±0.92
MSCPT(ours) 1.35M 78.67±3.93 72.47±3.13 72.67±2.96 74.56±4.54 65.59±1.85 75.82±2.38 95.04±1.31 83.78±2.19 85.62±2.14
Max-pooling 197K 71.78±4.13 66.40±3.51 66.66±3.42 66.66±2.36 60.32±2.24 71.57±4.83 95.18±0.63 81.63±0.92 84.30±1.30
Mean-pooling 197K 70.55±6.64 65.32±5.60 65.50±5.55 71.62±2.41 64.62±2.96 74.45±2.49 94.75±0.51 82.22±0.67 85.24±0.80
ABMIL [32] 461K 78.54±4.29 72.06±3.79 72.12±3.78 72.18±1.28 64.49±1.74 74.63±1.31 96.51±0.63 85.66±1.97 87.94±1.92
CLAM-SB [31] 660K 80.56±4.57 73.15±4.05 73.27±3.97 73.49±2.12 65.22±2.61 75.05±3.88 96.41±0.36 84.71±1.60 87.25±1.34
Pathology Pretrained

CLAM-MB [31] 660K 80.68±3.63 73.15±3.00 73.32±2.83 74.33±1.76 66.11±1.94 76.11±2.03 96.58±0.59 85.20±0.83 87.85±0.79
TransMIL [1] 2.54M 73.40±10.33 66.92±7.94 67.21±7.63 70.52±2.45 62.06±1.67 70.14±2.77 96.35±0.54 83.70±0.80 86.33±0.46
PLIP

DSMIL [33] 462K 77.75±7.22 72.84±6.31 73.08±6.00 70.14±4.11 63.01±2.78 71.48±5.37 93.01±6.05 79.58±9.16 82.87±7.32
RRT-MIL [40] 2.63M 76.30±10.01 70.86±7.47 71.01±7.44 72.77±2.20 65.74±2.34 74.38±4.01 96.09±1.06 83.94±2.05 86.56±2.28
CoOp [25] 337K 77.92±5.48 71.58±4.74 71.63±4.75 73.77±2.83 64.88±1.26 74.14±3.38 95.76±0.80 83.23±2.07 85.90±1.63
CoCoOp [24] 370K 72.62±8.45 66.63±5.83 66.97±5.85 71.21±4.20 62.95±3.95 73.57±6.31 95.81±0.42 83.18±1.35 86.02±1.03
Metaprompt [26] 360K 78.31±5.66 72.03±4.60 71.86±4.61 73.98±2.15 65.50±2.05 75.56±4.58 95.75±0.48 83.52±1.46 86.62±1.43
TOP [8] 2.11M 78.91±3.79 72.33±4.89 72.91±4.61 74.06±2.66 65.17±2.16 76.51±1.79 95.06±0.51 82.86±1.35 86.14±0.98
ViLa-MIL [36] 2.77M 80.98±2.52 73.81±3.64 73.94±3.56 74.86±2.45 66.03±1.81 77.35±1.63 95.72±0.60 83.85±1.10 86.53±1.03
MSCPT(ours) 1.35M 84.29±3.97 76.39±5.69 76.54±5.49 75.55±5.25 67.46±2.43 79.14±2.63 96.94±0.36 87.01±1.51 89.28±1.22

and RRT-MIL underperformed despite their strong perfor- significant improvements in all evaluation metrics across the
mance with full data training. Conversely, less parameterized three datasets and two VLMs. Compared to the top-performing
methods such as ABMIL and CLAM exhibited slightly better traditional MIL-based methods, MSCPT shows improvements
performance. This is because traditional MIL-based methods of 0.3-13.0% in AUC, 2.0-12.3% in F1, and 1.5-11.6% in
require a lot of WSIs for training and the more parameters ACC across three datasets and two VLMs. Overall, MSCPT
they have, the more training data is needed. Furthermore, shows greater performance improvements when based on
after adapting the prompt tuning methods designed for natural CLIP compared to PLIP. This is attributed to the specialized
images (i.e., CoOp, CoCoOp, and Metaprompt) to tasks at pre-training of PLIP on pathological images, enhancing its
the WSI level, these methods outperform traditional MIL- encoding capabilities for patches, thereby reducing the re-
based methods when based on CLIP and achieve comparable liance on textual descriptions. Compared to the top-performing
performance when using PLIP. Relatively few parameters con- prompt tuning method suitable for natural images, MSCPT
tribute to this result. Additionally, we found that Metaprompt improved the AUC, F1, and ACC by 1.0-8.2%, 2.8-6.7%, and
outperforms CoOp across most metrics, thanks to its integra- 1.6-6.9%.
tion of visual prompt tuning and multi-scale information. This
Prompt tuning methods explicitly designed for WSI exhibit
result motivates us to pursue visual prompt tuning and develop
superior performance. This is attributed to their incorporation
more effective multi-scale information integration modules.
of priors into pre-trained VLMs and leveraging those priors
Despite prompt tuning methods designed for Few-shot Weakly
to guide prompt tuning. Additionally, ViLa-MIL introduces
Supervised WSI Classification tasks having a relatively higher
multi-scale information compared to TOP, positioning it as
number of parameters, they exhibit the best performance. This
the second-best overall performer. In comparison to ViLa-
is because VLMs’ prior knowledge is effectively exploited un-
MIL, MSCPT shows improvements across all datasets, with
der the guidance of visual descriptive text prompts, alleviating
AUC increasing by 0.9-5.7%, F1 by 2.2-6.2%, and ACC by
the demand for extensive training data.
2.3-5.5%. This is because MSCPT performs prompt tuning
Compared to other methods, our proposed MSCPT exhibits both on the textual and visual modality. Furthermore, MSCPT
8

TABLE II TABLE IV
C ORE COMPONENTS ABLATION EXPERIMENT ON THE TCGA-NSCLC A BLATION EXPERIMENT OF DIFFERENT INSTANCE AGGREGATION
DATASET BASED ON PLIP. METHODS ON THE TCGA-NSCLC DATASET BASED ON PLIP.

MHPT ISGPT NPCGP TCGA-NSCLC (PLIP-based) Methods TCGA-NSCLC (PLIP-based)


AUC F1 ACC AUC F1 ACC
- - - 78.31±5.66 72.03±4.60 71.86±4.61 Mean Pooling 78.88±4.61 73.60±3.38 73.39±3.68
✓ - - 80.57±4.17 73.62±2.41 73.77±2.48 Max Pooling 82.63±4.58 75.68±3.51 75.89±3.54
✓ ✓ - 82.75±5.73 75.48±4.70 75.53±4.72 Attention-based Pooling 82.75±5.73 75.48±4.70 75.53±4.72
✓ - ✓ 81.92±5.03 74.96±4.68 75.10±4.64 NPCGP w/o cross-guidance 83.61±5.68 75.98±5.35 75.81±5.21
✓ ✓ ✓ 84.29±3.97 76.39±5.69 76.54±5.49 NPCGP(ours) 84.29±3.97 76.39±5.69 76.54±5.49

TABLE III TABLE V


A BLATION EXPERIMENT OF DIFFERENT GRAPH CONSTRUCTION AND R ESULTS OF DIFFERENT LARGE LANGUAGE MODELS ON THE TCGA-RCC
TRAINING METHODS ON THE TCGA-RCC DATASET BASED ON CLIP. DATASET BASED ON CLIP.

Trainable TCGA-RCC (CLIP-based) TCGA-RCC (CLIP-based)


Methods Param Methods
AUC F1 ACC AUC F1 ACC
GCN+KNN(Coord.) 1.35M 92.85±2.43 79.29±3.98 81.63±3.74 Gemini-1.5-pro [42] 94.14±1.79 81.61±2.69 83.67±3.63
GCN+KNN(Feat.) 1.35M 93.92±2.66 80.46±4.33 82.41±4.26 Claude-3 [43] 93.97±2.35 80.61±4.05 82.72±3.59
GAT+Sim. 1.35M 93.14±1.78 80.82±4.25 82.41±3.85 Llama-3 [44] 94.63±1.42 82.36±1.81 84.38±2.08
GraphSAGE+Sim. 2.40M 93.60±2.72 80.40±3.89 82.66±3.69 GPT-3.5 [45] 94.52±2.16 82.09±3.07 84.27±3.61
GCN+Sim.(ours) 1.35M 95.04±1.31 82.59±2.14 85.62±2.14 GPT-4 [41] 95.04±1.31 83.78±2.19 85.62±2.14

takes into account both the multi-scale and contextual informa-


similarity, we used K-Nearest Neighbor (KNN) to create ad-
tion of WSIs. Unlike the late fusion approach in ViLa-MIL,
jacency matrices based on patch coordinates or visual features
MSCPT employs an intermediate fusion method, leveraging
as referenced in studies [29], [30]. Additionally, we tested
the transformer layer and trainable global prompts to integrate
the effectiveness of GCN by comparing them with GAT [46]
pathological visual descriptions from both high and low levels.
and GraphSAGE [47]. Experimental results on TCGA-RCC
using CLIP are shown in Table III, where KNN(Coord.) and
C. Ablation Experiment KNN(Feat.) refer to using KNN for constructing adjacency
1) Effects of Each Component in MSCPT: To verify the matrices based on patch coordinates and patch features, and
effectiveness of three core components, ablation experiments Sim. signifies constructing adjacency matrices using image-
were conducted on the TCGA-NSCLC dataset based on PLIP, text similarity. Switching to KNN to construct the adja-
the experimental results are presented in Table II. When all cency matrix led to decreased performance across all metrics,
modules were removed, MSCPT regresses to the baseline whether based on coordinates or patch visual features. This
(i.e., Metaprompt). All metrics showed significant improve- decline may be attributed to the limited scope of connectivity
ment (2.2%-2.9%) after adding the Multi-scale Hierarchical in the adjacent matrix construction methods. Connecting only
Prompt Tuning (MHPT) module to the baseline. This is nearby patches based on coordinates restricts the GNN to
because the MHPT module utilizes transformer layers to inte- the local context, while connecting visually similar patches
grate pathological visual descriptions across different scales, based on their features may lack global information about
enhancing the model’s information aggregation capabilities. interactions between different types of tissue organization.
Building upon this, we introduced the Image-text Similarity- In contrast, our ISGPT module connects patches related to
based Graph Prompt Tuning (ISGPT) module, which led to specific cancer types, overcoming local or visual similarity
improvements in all metrics (2.4%-2.7%). This demonstrated connection limitations and enabling a more comprehensive
that utilizing ISGPT for contextual learning also enhances contextual understanding. When we replaced GCN with GAT
model performance, reaffirming the importance of contextual or GraphSAGE, the model’s performance also experienced
information for WSI analysis. When we added both the MHPT varying degrees of decline. We believe complex graph neural
and Non-Parametric Cross-Guided Pooling (NPCGP) module networks are unsuitable for few-shot scenarios.
to the baseline, in comparison to solely adding MHPT, the met- 3) Effects of Instance Aggregation: To validate the effec-
rics improved by 1.7%-1.8%. This indicates that the NPCGP tiveness of our instance aggregation method, we compared
module, compared to attention-based pooling, is more effective Non-Parametric Cross-Guided Pooling (NPCGP) with other
in identifying important patches within the WSI, resulting in aggregation methods (i.e., Mean Pooling, Max Pooling, and
better instance aggregation results. When all modules work Attention-based Pooling). The experimental results based on
together, the baseline is transformed into MSCPT. MSCPT PLIP on the TCGA-NSCLC are presented in Table IV. When
has shown improvements of 7.6% in AUC, 6.1% in F1, and we replaced NPCGP with other methods, the performance
6.5% in ACC compared to the baseline. of the models decreased to varying degrees. This implies
2) Effects of Graph Construction: To validate the effec- that our NPCGP can discern more impactful patches and
tiveness of building adjacency matrices based on image-text aggregate them into bag features. The visualization results
9

(a) TCGA-BRCA CLIP-based Whole Slide Image Patch Selection Similarity (5×) Metaprompt Heatmap (20×) MSCPT Heatmap (20×) CLAM-MB Heatmap (20×)

Image Size: 10,5334×8,8546 px High


Bag Size: 16035 patches (20×) TCGA-EW-A1J1

Similarity/Attention Score
Selected High Similarity Patches (5×) High Similarity Score Patches (20×)
Whole Slide Image Patch Selection Similarity (5×) Metaprompt Heatmap (20×) MSCPT Heatmap (20×) ABMIL Heatmap (20×)
(b) TCGA-RCC PLIP-based

Low

Image Size: 8,8250×7,6950 px


Bag Size: 19571 patches (20×) TCGA-B0-5711

Selected High Similarity Patches (5×) High Similarity Score Patches (20×)

Fig. 4. Visualization of the original WSI, similarity score map for patch selection, heatmaps generated using MSCPT, attention heatmaps of the baseline
(i.e., Metaprompt) and the best-performing traditional MIL-based method, selected high similarity patches and patches with highest similarity scores using
MSCPT. The area surrounded by the red line in the original WSI is the tumor area.

(a) (b) (c) (d)


AUC (CLIP) F1 (CLIP) AUC (PLIP) F1 (PLIP)
66 68
64 66

70 62 64

68 60 62
70
Score (%)

Score (%)

Score (%)

66 58 Score (%) 60
64 68
56 58
62 66
56
60 CLAM-MB 54 CLAM-MB CLAM-MB CLAM-MB
Metaprompt Metaprompt 64 Metaprompt Metaprompt
58 54
CoOp 52 CoOp CoOp CoOp
ViLa-MIL ViLa-MIL 62 ViLa-MIL 52 ViLa-MIL
56
MSCPT (ours) 50 MSCPT (ours) MSCPT (ours) MSCPT (ours)
1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16
Number of training samples per class Number of training samples per class Number of training samples per class Number of training samples per class

Fig. 5. Experiments on TCGA-BRCA with 16, 8, 4, 2, and 1-shot settings. (a) (b) are CLIP-based results, while (c) (d) are PLIP-based results.

in Section IV-D support this point. We also conducted an MSCPT’s robustness across different LLMs and highlights
ablation study on cross-guidance. Instead of computing cross- the benefit of accurate pathological visual descriptions for
scale cosine similarity during feature aggregation, we only model performance. Providing a more accurate description
calculated cosine similarity between patch and description helps improve model performance.
embeddings at the same scale. Removing cross-guidance led
to a drop in performance across all metrics, likely because D. Visualization
LLMs may produce descriptions with incorrect scales.
As shown in Fig. 4, we have visualized a case of TCGA-
4) Effects of Large Language Models: To verify the impact RCC based on PLIP and a case of TCGA-BRCA based on
of different LLMs on model performance, we compared the CLIP. As depicted in Fig. 4a, during patch selection, CLIP
performance of MSCPT when using descriptions generated assigned high similarity scores not only to tumor regions
by different LLMs (i.e., Gemini-1.5-pro [42], Cluade-3 [43], but also to non-tumor areas. This outcome arose because
Llama-3 [44], GPT-3.5 [45] and GPT-4 [41]). The results CLIP was not specifically designed for pathological images,
obtained using CLIP on TCGA-RCC are presented in Table resulting in a less-than-optimal zero-shot capability for this
V. When generating descriptions using Claude-3, MSCPT per- type of imagery. However, after prompt tuning using MSCPT,
forms comparably to the baseline. However, MSCPT outper- the model correctly assigned high scores to the actual tumor
forms the baseline when using other LLMs. This demonstrates regions, while the regions that originally received high scores
10

dropped to lower score ranges (red arrows in Fig. 4a). Mean- [2] S. J. Wagner, D. Reisenbüchler, N. P. West, J. M. Niehues, G. P.
while, CLAM-MB struggled to differentiate tumor and no- Veldhuizen, P. Quirke, H. I. Grabsch, P. A. Brandt, G. G. Hutchins,
S. D. Richman et al., “Fully transformer-based biomarker prediction
tumor. Similarly, Metaprompt assigned high attention weights from colorectal cancer histology: a large-scale multicentric study,” arXiv
to certain non-tumor tissues (red arrows in Fig. 4a). preprint arXiv:2301.09617, 2023.
During selecting patches using PLIP, the model could [3] X. Xing, M. Zhu, Z. Chen, and Y. Yuan, “Comprehensive learning and
adaptive teaching: Distilling multi-modal knowledge for pathological
roughly identify tumor regions but also assigned high scores glioma grading,” Med. Image Anal., vol. 91, p. 102990, 2024.
to a small number of non-tumor areas. However, this issue [4] Q. Guo, L. Qu, J. Zhu, H. Li, Y. Wu, S. Wang, M. Yu, J. Wu, H. Wen,
was mitigated with MSCPT (red arrows in Fig. 4b). While X. Ju et al., “Predicting lymph node metastasis from primary cervical
squamous cell carcinoma based on deep learning in histopathologic
ABMIL could also determine instance importance, it tended images,” Mod. Pathol., vol. 36, no. 12, p. 100316, 2023.
to assign higher scores to certain non-tumor regions compared [5] A. K. Glaser, N. P. Reder, Y. Chen, E. F. McCarty, C. Yin, L. Wei,
to MSCPT (yellow arrows in Fig. 4b). Due to PLIP’s improved Y. Wang, L. D. True, and J. T. Liu, “Light-sheet microscopy for slide-
free non-destructive pathology of large clinical specimens,” Nat. Biomed.
ability to represent pathological images, Metaprompt produced Eng, vol. 1, no. 7, p. 0084, 2017.
visualization results comparable to MSCPT. [6] J. A. Ludwig and J. N. Weinstein, “Biomarkers in cancer staging,
prognosis and treatment selection,” Nat. Rev. Cancer, vol. 5, no. 11,
pp. 845–856, 2005.
E. Results with Fewer Training Samples [7] M. Ilse, J. M. Tomczak, and M. Welling, “Deep multiple instance
learning for digital histopathology,” in MICCAI. Elsevier, 2020, pp.
To further validate MSCPT’s performance, we conducted 521–546.
experiments on TCGA-BRCA with 16, 8, 4, 2, and 1-shot set- [8] L. Qu, K. Fu, M. Wang, Z. Song et al., “The rise of ai language
pathologists: Exploring two-level prompt learning for few-shot weakly-
tings. Based on the results in Table I, we selected several well- supervised whole slide image classification,” NeurIPS, vol. 36, 2024.
performing models (i.e., CLAM-MB, CoOp, Metaprompt, [9] Z. Shao, Y. Chen, H. Bian, J. Zhang, G. Liu, and Y. Zhang, “Hvtsurv:
ViLa-MIL, and MSCPT) for these experiments. It is also worth Hierarchical vision transformer for patient-level survival prediction from
whole slide image,” in AAAI, vol. 37, no. 2, 2023, pp. 2209–2217.
noting that with limited training samples, sample selection [10] G. Campanella, M. G. Hanna, L. Geneslaw, A. Miraflor, V. Werneck
significantly impacts model performance [8]. To address this, Krauss Silva, K. J. Busam, E. Brogi, V. E. Reuter, D. S. Klimstra,
we conducted dataset splitting, model training, and testing and T. J. Fuchs, “Clinical-grade computational pathology using weakly
supervised deep learning on whole slide images,” Nat. Med., vol. 25,
using ten different seeds, excluding the two best and two worst no. 8, pp. 1301–1309, 2019.
results to calculate the average. Due to the sample imbalance in [11] L. Qu, Y. Ma, X. Luo, Q. Guo, M. Wang, and Z. Song, “Rethinking
TCGA-BRCA, we just reported AUC and macro F1-score, as multiple instance learning for whole slide image classification: A good
instance classifier is all you need,” IEEE Trans. Circuits Syst. Video
shown in Fig. 5. When using CLIP as the base model, MSCPT Technol., 2024.
underperforms compared to CLAM-MB and Metaprompt in 1- [12] C. L. Srinidhi, O. Ciga, and A. L. Martel, “Deep neural network models
and 2-shot settings, likely due to MSCPT’s larger parameter for computational histopathology: A survey,” Med. Image Anal., vol. 67,
p. 101813, 2021.
size and CLIP’s limited understanding of pathology descrip- [13] A. Shmatko, N. Ghaffari Laleh, M. Gerstung, and J. N. Kather, “Artificial
tions. However, with 4 or more shots, MSCPT significantly intelligence in histopathology: enhancing cancer research and clinical
outperforms other methods. Additionally, when using PLIP as oncology,” Nat. Cancer, vol. 3, no. 9, pp. 1026–1038, 2022.
[14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
the base model, MSCPT consistently performs better than any G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
other method. visual models from natural language supervision,” in ICML. PMLR,
2021, pp. 8748–8763.
[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
V. C ONCLUSION of deep bidirectional transformers for language understanding,” arXiv
In this paper, we propose Multi-Scale and Context-focused preprint arXiv:1810.04805, 2018.
[16] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou,
Prompt Tuning (MSCPT) to solve the Few-shot Weakly- K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al., “Towards expert-
supervised WSI Classification (FSWC) task. MSCPT gener- level medical question answering with large language models,” arXiv
ates multi-scale pathological visual descriptions using GPT-4, preprint arXiv:2305.09617, 2023.
[17] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang,
guiding hierarchical prompt tuning and instance aggregation. “Biobert: a pre-trained biomedical language representation model for
Experiments on three WSI subtyping datasets and two Vision- biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240,
Language models (VLMs) show that MSCPT achieved state- 2020.
[18] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple:
of-the-art results in FSWC tasks. Furthermore, MSCPT is ap- Multi-modal prompt learning,” in CVPR, 2023, pp. 19 113–19 122.
plicable to fine-tune any VLMs for WSI-level tasks. However, [19] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image
we find that the model performance varies significantly on pre-training for unified vision-language understanding and generation,”
in ICML. PMLR, 2022, pp. 12 888–12 900.
different datasets and VLMs. That is because the performance [20] J. Cheng, X. Pan, K. Yang, S. Cao, B. Liu, Q. Yan, and Y. Yuan,
of model fine-tuning largely depends on the pre-trained VLM “Gexmolgen: Cross-modal generation of hit-like molecules via large
itself. We look forward to the emergence of more compre- language model encoding of gene expression signatures,” bioRxiv,
2024. [Online]. Available: https://www.biorxiv.org/content/early/2024/
hensive and powerful pre-trained pathological VLMs, which 02/19/2023.11.11.566725
will significantly promote the development of FSWC tasks and [21] M. Y. Lu, B. Chen, A. Zhang, D. F. Williamson, R. J. Chen, T. Ding,
even all computational pathology. L. P. Le, Y.-S. Chuang, and F. Mahmood, “Visual language pretrained
multiple instance zero-shot transfer for histopathology images,” in
CVPR, 2023, pp. 19 764–19 775.
R EFERENCES [22] Z. Huang, F. Bianchi, M. Yuksekgonul, T. J. Montine, and J. Zou, “A
visual–language foundation model for pathology image analysis using
[1] Z. Shao, H. Bian, Y. Chen, Y. Wang, J. Zhang, X. Ji et al., “Transmil: medical twitter,” Nat. Med., vol. 29, no. 9, pp. 2307–2316, 2023.
Transformer based correlated multiple instance learning for whole slide [23] M. Y. Lu, B. Chen, D. F. Williamson, R. J. Chen, I. Liang, T. Ding,
image classification,” NeurIPS, vol. 34, pp. 2136–2147, 2021. G. Jaume, I. Odintsov, L. P. Le, G. Gerber et al., “A visual-language
11

foundation model for computational pathology,” Nat. Med., vol. 30, p.


863–874, 2024.
[24] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning
for vision-language models,” in CVPR, 2022, pp. 16 816–16 825.
[25] Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu,
Ziwei, “Learning to prompt for vision-language models,” Int. J. Comput.
Vis., vol. 130, no. 9, pp. 2337–2348, 2022.
[26] C. Zhao, Y. Wang, X. Jiang, Y. Shen, K. Song, D. Li, and D. Miao,
“Learning domain invariant prompt for vision-language models,” IEEE
Trans. Image Process., 2024.
[27] R. J. Chen, C. Chen, Y. Li, T. Y. Chen, A. D. Trister, R. G. Krishnan,
and F. Mahmood, “Scaling vision transformers to gigapixel images via
hierarchical self-supervised learning,” in CVPR, June 2022, pp. 16 144–
16 155.
[28] W. Shao, Y. Zuo, Y. Shi, Y. Wu, J. Tang, J. Zhao, L. Sun, Z. Lu, J. Sheng,
Q. Zhu et al., “Characterizing the survival-associated interactions be-
tween tumor-infiltrating lymphocytes and tumors from pathological
images and multi-omics data,” IEEE Trans. Med. Imaging, 2023.
[29] R. J. Chen, M. Y. Lu, M. Shaban, C. Chen, T. Y. Chen, D. F.
Williamson, and F. Mahmood, “Whole slide images are 2d point clouds:
Context-aware survival prediction using patch-based graph convolutional
networks,” in MICCAI. Springer International Publishing, 2021, pp.
339–349.
[30] M. Han, X. Zhang, D. Yang, T. Liu, H. Kuang, J. Feng, and
L. Zhang, “Multi-scale heterogeneity-aware hypergraph representation
for histopathology whole slide images,” 2024.
[31] M. Y. Lu, D. F. Williamson, T. Y. Chen, R. J. Chen, M. Barbieri,
and F. Mahmood, “Data-efficient and weakly supervised computational
pathology on whole-slide images,” Nat. Biomed. Eng, vol. 5, no. 6, pp.
555–570, 2021.
[32] M. Ilse, J. Tomczak, and M. Welling, “Attention-based deep multiple
instance learning,” in ICML. PMLR, 2018, pp. 2127–2136.
[33] B. Li, Y. Li, and K. W. Eliceiri, “Dual-stream multiple instance learn-
ing network for whole slide image classification with self-supervised
contrastive learning,” in CVPR, 2021, pp. 14 318–14 328.
[34] Y. Zheng, R. H. Gindra, E. J. Green, E. J. Burks, M. Betke, J. E. Beane,
and V. B. Kolachalama, “A graph-transformer for whole slide image
classification,” IEEE Trans. Med. Imaging, vol. 41, no. 11, pp. 3003–
3015, 2022.
[35] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman,
M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-
5b: An open large-scale dataset for training next generation image-text
models,” NeurIPS, vol. 35, pp. 25 278–25 294, 2022.
[36] J. Shi, C. Li, T. Gong, Y. Zheng, and H. Fu, “Vila-mil: Dual-scale vision-
language multiple instance learning for whole slide image classification,”
in CVPR, 2024, pp. 11 248–11 258.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in CVPR, 2016, pp. 770–778.
[38] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
“An image is worth 16x16 words: Transformers for image recognition
at scale,” in ICLR, 2021.
[39] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[40] W. Tang, F. Zhou, S. Huang, X. Zhu, Y. Zhang, and B. Liu, “Feature
re-embedding: Towards foundation model-level performance in compu-
tational pathology,” in CVPR, June 2024, pp. 11 343–11 352.
[41] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4
technical report,” arXiv preprint arXiv:2303.08774, 2023.
[42] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut,
J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly
capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
[43] Anthropic. (2024) Claude 3 haiku: our fastest model yet. [Online].
Available: https://www.anthropic.com/news/claude-3-haiku
[44] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al.,
“Llama: Open and efficient foundation language models,” arXiv preprint
arXiv:2302.13971, 2023.
[45] T. B. Brown, “Language models are few-shot learners,” arXiv preprint
arXiv:2005.14165, 2020.
[46] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Ben-
gio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[47] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation
learning on large graphs,” NeurIPS, vol. 30, 2017.

You might also like