¹¹institutetext: School of Computer Science, The University of Sydney ²²institutetext: School of Software Engineering, South China University of Technology ³³institutetext: Department of Data Science & AI, Monash University ⁴⁴institutetext: School of Computer Science, Northwestern Polytechnical University ⁵⁵institutetext: Department of Computer Science, University of Rochester
⁵⁵email: {wenxi.yue, jing.zhang1, kun.hu, zhiyong.wang}@sydney.edu.au, [email protected], [email protected], [email protected], [email protected]

SurgicalPart-SAM: Part-to-Whole Collaborative Prompting for Surgical Instrument Segmentation

Wenxi Yue 11 Jing Zhang Corresponding Author.11 Kun Hu 11 Qiuxia Wu 22 Zongyuan Ge 33 Yong Xia 44 Jiebo Luo 55 Zhiyong Wang 11

Abstract

The Segment Anything Model (SAM) exhibits promise in generic object segmentation and offers potential for various applications. Existing methods have applied SAM to surgical instrument segmentation (SIS) by tuning SAM-based frameworks with surgical data. However, they fall short in two crucial aspects: (1) Straightforward model tuning with instrument masks treats each instrument as a single entity, neglecting their complex structures and fine-grained details; and (2) Instrument category-based prompts are not flexible and informative enough to describe instrument structures. To address these problems, in this paper, we investigate text promptable SIS and propose SurgicalPart-SAM (SP-SAM), a novel SAM efficient-tuning approach that explicitly integrates instrument structure knowledge with SAM’s generic knowledge, guided by expert knowledge on instrument part compositions. Specifically, we achieve this by proposing (1) Collaborative Prompts that describe instrument structures via collaborating category-level and part-level texts; (2) Cross-Modal Prompt Encoder that encodes text prompts jointly with visual embeddings into discriminative part-level representations; and (3) Part-to-Whole Adaptive Fusion and Hierarchical Decoding that adaptively fuse the part-level representations into a whole for accurate instrument segmentation in surgical scenarios. Built upon them, SP-SAM acquires a better capability to comprehend surgical instruments in terms of both overall structure and part-level details. Extensive experiments on both the EndoVis2018 and EndoVis2017 datasets demonstrate SP-SAM’s state-of-the-art performance with minimal tunable parameters. The code will be available at https://github.com/wenxi-yue/SurgicalPart-SAM.

Keywords:

Segment Anything Model Surgical Instrument Segmentation Efficient-Tuning

1 Introduction

Refer to caption — Figure 1: SP-SAM with Collaborative Prompts incorporates the knowledge of surgical instrument structures. Subfigure (e) is partially is excerpted from [2].

Surgical instrument segmentation (SIS) aims to accurately identify and delineate surgical instruments in operative scenes. It plays a foundational role for many downstream applications, such as surgical planning [7], robotic navigation [43], and skill assessment [24]. We identify two primary problems with the existing methods for this task (Fig. 1(a)). First, they often develop specialist models [32, 20, 46, 15, 47, 5, 3, 48] that require training a large number of parameters, leading to high development costs. Second, current methods lack the capability of human-computer interaction that is highly desired in surgical practice [6, 4].

The Segment Anything Model (SAM) [22] is a pioneering foundation model for promptable segmentation. It holds great potential for addressing the above problems owing to its rich pre-trained knowledge and interactivity. However, employing SAM for surgical instrument segmentation in a zero-shot manner (Fig. 1(b)) poses significant challenges. Firstly, zero-shot frameworks of SAM, including detection-based (MaskTrack-RCNN [39]/Mask2Former [9] + SAM), tracking-based (TrackAnything [38]), and reference-based (PerSAM [45]) frameworks, have demonstrated inferior generalisation on surgical instruments [42]. This deficiency is mainly due to the insufficient surgical data in SAM pre-training and the notable domain disparity between natural objects and surgical instruments. Specifically, compared to generic objects, surgical instruments present more intricate structures and fine-grained details, exacerbating the challenge of generalising SAM to this specialised domain. Secondly, SAM’s reliance on point-or-box prompts is impractical in surgical settings, where it is infeasible for surgeons to provide such prompts for every instrument in each frame.

Initial attempts have been made to address these problems. Yue et al. [42] propose SurgicalSAM (Fig. 1(c)), an instrument category-prompted SAM framework efficiently tuned with surgical data. Additionally, Wang et al. [34] propose an efficient-tuning approach for SAM for SIS employing fixed default prompt embeddings. However, these methods suffer from two crucial problems. First, their straightforward tuning approach using whole instrument masks treats each instrument as a single entity and cannot explicitly handle the complex structures and details of instruments. Despite well-established expert knowledge on instrument structure compositions, they fail to incorporate these insights during tuning. Secondly, they depend on instrument category prompts or fixed default prompts, which lack flexibility and intuitiveness for surgeon-computer interaction and fail to provide informative descriptions of instrument structures. Instead, more flexible and informative prompts such as text are preferred.

In this paper, we explore text promptable surgical instrument segmentation and propose a novel framework, SurgicalPart-SAM (SP-SAM) (Fig. 1(d)), to address the above problems. Specifically, we recognise the well-established expert knowledge regarding the compositions of surgical instrument parts, e.g., Large Needle Driver is composed of shaft, wrist, and tip, Monopolar Curved Scissors is composed of shaft and tip, etc. In SP-SAM, we aim to harness this expert knowledge to guide the tuning of SAM to improve its capability to comprehend instrument structures and identify subtle details.

To integrate part-level information, we first introduce a new form of text prompt, namely Collaborative Prompts, which utilises a text description set: {[part name] of [instrument category name]} for all parts of an instrument category, collaborating category-level and part-level text descriptions. Contrasted with prompting solely with instrument category names, Collaborative Prompts effectively enables the integration of more precise and fine-grained instrument part information (Fig. 1(e)). Next, to correlate the Collaborative Prompts with the instrument parts in the image, we introduce a Cross-Modal Prompt Encoder to learn part-level representations via interaction between the Collaborative Prompts and the image embedding. This enables focused learning of fine-grained features for each instrument part, thereby enhancing the segmentation of subtle details. Finally, we propose Part-to-Whole Adaptive Fusion and Hierarchical Decoding to fuse representations of all instrument parts into a whole and decode them into segmentation masks, capturing both the global structure and the compositional components.

Note that, part-to-whole fusion is non-trivial due to two inherent challenges in surgical scenarios: (1) the varying part compositions across instrument categories, and (2) the frequent occlusions of instruments. These challenges necessitate adaptive fusion of different parts for each instrument in the surgical scene. Therefore, within the Part-to-Whole Adaptive Fusion module, we propose Category Part Attention and Image Part Attention. The former adapts category-specific part weightings to accommodate diverse part compositions across categories, while the latter learns adaptive image-specific part weightings to handle occluded or out-of-view parts in the image. By integrating all components, SP-SAM exhibits a strong capability to adaptively comprehend surgical instrument structures, identify subtle details, and discriminate between fine-grained categories. In summary, our contributions are:

•

We introduce a novel SAM efficient-tuning approach, SurgicalPart-SAM (SP-SAM), for text promptable surgical instrument segmentation. SP-SAM utilises well-established expert knowledge of surgical instrument part compositions to guide SAM tuning, explicitly addressing the structural complexity and subtle details of surgical instruments, thereby enhancing generalisability.
•

We introduce Collaborative Prompts, Cross-Modal Prompt Encoder, and Part-to-Whole Adaptive Fusion and Hierarchical Decoding, to achieve multi-modal embedding learning at both the part level and the category level. These designs enhance comprehension of instrument structures and details during SAM tuning.
•

We propose Category Part Attention and Image Part Attention to integrate category-specific and image-specific weights for adaptively fusing instrument part representations. These mechanisms respectively address two critical challenges in surgical scenarios: the varying part compositions across instrument categories and the frequent occlusions of instruments.
•

We conduct extensive experiments on the challenging EndoVis2018 and EndoVis2017 datasets and show that SP-SAM achieves state-of-the-art performance with only a small number of training parameters.

2 Related Work

2.1 Surgical Instrument Segmentation

Most surgical instrument segmentation methods focus on developing specialist models. Early research adopts a semantic segmentation pipeline with the pioneering work TernausNet introducing a U-Net based encoder-decoder model [32]. Subsequent developments include feature pyramid attention [29] and flow-based temporal priors [20, 46]. An alternative strategy to semantic segmentation is instance segmentation. ISINet adopts Mask-RCNN [15, 16] for this task, which is later enhanced by Baby et al. [5] with a specialised classification module. In addition, TraSeTR utilises a track-to-segment transformer with tracking cues [47] and MATIS employs Mask2Former with a temporal consistency module [3, 9]. Recently, Zhou et al. [48] introduce TP-SIS, a text promptable framework exploiting the pre-trained vision-language model CLIP [30]. Despite the variety of specialist models, they all involve fully training a complete set of model parameters, resulting in high development costs.

To enhance model generalisability and reduce training costs, there is a growing interest in adapting pre-trained foundation models for SIS. SurgicalSAM is proposed as a category-prompted SAM framework efficiently tuned with surgical data [42], while Wang et al. [34] propose an efficient-tuning method for SAM using fixed default embeddings as prompts. However, these approaches rely on less informative prompts and overlook the intricate structures and subtle details of surgical instruments during SAM tuning. In contrast, our SP-SAM employs more informative Collaborative Prompts in text form to explicitly leverage expert knowledge of instrument part compositions to guide the tuning of SAM, enhancing SAM’s comprehension of surgical instruments compared to [42, 34].

2.2 Text Promptable Segmentation

In contrast to traditional segmentation that solely relies on pre-defined class labels, text promptable segmentation uses natural language as prompts that can offer richer contextual information, improved generalisation, and greater flexibility. Early works are primarily based on Convolutional Neural Networks and Recurrent Neural Networks and propose attention mechanisms for extracting and relating visual and textual features [18, 23, 31, 41]. More recent approaches utilise Transformers to perform feature extraction and multi-modal feature fusion [14, 36, 21, 40, 12]. Recently, to leverage the rich knowledge from large-scale pre-training, large vision-language models such as CLIP [30] are utilised for this task [35, 25]. Zhou et al. [48] employ CLIP [30] and introduce TP-SIS, the first text promptable framework for SIS. However, in TP-SIS [48], instrument part masks are used straightforwardly as supervisory signals, neglecting the structural dependencies associated with the parts. Moreover, TP-SIS requires fine-tuning the entire CLIP Image Encoder, resulting in high training costs. In contrast, our SP-SAM explicitly explores category-specific and image-specific part dependencies by incorporating expert knowledge on instrument structures and requires only a very small number of training parameters.

2.3 Segment Anything Model

SAM is recognised as the pioneering foundation model for image segmentation. Owing to extensive pre-training on large-scale data, SAM exhibits impressive generalisation capabilities on various downstream tasks [22]. However, its zero-shot performance in medical contexts tends to fall short due to the significant disparity between natural and medical subjects [11, 17, 27, 19, 10, 42]. Moreover, SAM’s reliance on precise per-frame point-or-box prompts for segmentation [10, 33] requires extensive manual input, infeasible in many medical scenarios, e.g. during surgery. To mitigate the gap between natural and medical domains, some studies have fine-tuned SAM with domain-specific data. However, these methods either have limited interactivity [44, 8, 34], require labour-intensive per-frame points or bounding boxes for prompting [26, 37], or rely on inflexible category IDs [42]. In contrast to these approaches, in SP-SAM we propose Collaborative Prompts that integrate category-level and part-level texts. This method offers a more intuitive and flexible approach for surgeon-computer interaction, enables informative descriptions of instrument structures, and introduces additional cues to SAM from the language modality.

3 Method

In this paper, we address the task of text promptable surgical instrument segmentation. Instrument category names are suboptimal as text prompts due to their coarse nature and lack of structural cues. Therefore, we introduce Collaborative Prompts that combine both category and part information of surgical instruments. To maximise the potential of these Collaborative Prompts and integrate instrument structure information with SAM’s generic knowledge, we propose a part-to-whole collaborative prompting pipeline based on SAM, namely SP-SAM. Given a surgical image $I\in\mathbb{R}^{H\times W\times 3}$ of size $H\times W$ and Collaborative Prompts $T^{(c)}$ for an instrument category $c$ , SP-SAM predicts the binary mask $M^{(c)}\in\{0,1\}^{H\times W}$ for the instrument.

With the Collaborative Prompts, instrument structure information can be easily integrated by establishing a category-part relation matrix $\mathcal{D}_{CP}\in\{0,1\}^{C\times P}$ , where $C$ and $P$ denote the numbers of surgical instrument categories and instrument parts, respectively, and each element $d_{cp}$ in $\mathcal{D}_{CP}$ indicates the presence of part $p$ in category $c$ . For instance, Monopolar Curved Scissors (Fig. 1(e) middle instrument), composed of shaft and tip parts, would have 1s for these parts and 0s for absent parts like the wrist. SP-SAM leverages the expert knowledge on instrument structure $\mathcal{D}_{CP}$ for accurate surgical segmentation.

As shown in Fig. 2, SP-SAM consists of four key components: (1) a frozen SAM Image Encoder that extracts image embeddings from the given image, (2) a Cross-Modal Prompt Encoder (Sec. 3.1) that extracts part embeddings from Collaborative Prompts and generates part sparse and dense embeddings through cross-modal interaction, (3) a Part-to-Whole Adaptive Fusion module (Sec. 3.2) that combines part sparse and dense embeddings into whole sparse and dense embeddings through Category Part Attention and Image Part Attention, considering category-specific and image-specific part contributions, respectively, and (4) a SAM Decoder for Hierarchical Decoding (Sec. 3.3) that decodes these embeddings into masks, thereby enhancing the model’s comprehension of instruments both as a whole and at the part level.

3.1 Cross-Modal Prompt Encoder

The Cross-Modal Prompt Encoder takes the Collaborative Prompts and image embedding as input and performs cross-modal interaction between them via spatial attention, generating part sparse embeddings and part dense embeddings. As shown in Fig. 3, this process can be divided into two steps: feature extraction of Collaborative Prompts and part-level cross-modal encoding.

Feature Extraction of Collaborative Prompts. We introduce a new type of text prompt for surgical instruments that collaboratively integrates both category and part information, namely Collaborative Prompts. Specifically, the Collaborative Prompts for an instrument of category $c$ are formulated into a set of texts containing all $P$ parts: $T^{(c)}=\{[part_{p}]\text{ of }[intrument\text{ }category_{c}]\}_{p=1}^{P}$ , where $instrument\text{ }category_{c}$ and $part_{p}$ represent the names in text for instrument category $c$ and part $p$ , respectively. Next, $T^{(c)}$ is encoded by the CLIP Text Encoder [30] into text-based CLIP part embeddings $\mathcal{T}^{part}_{clip}\in\mathbb{R}^{P\times d_{clip}}$ . A challenge here is the inherent distribution mismatch between the embedding spaces of SAM and CLIP. To transfer the CLIP text embeddings into SAM’s embedding space, a tunable Transfer MLP is devised and applied to $\mathcal{T}^{part}_{clip}$ , leading to the transferred embeddings for the parts, namely part embeddings $\mathcal{T}^{part}\in\mathbb{R}^{P\times d}$ , where $d$ matches the number of embedding channels of SAM’s image features.

Part-Level Cross-Modal Encoding. In this step, the part embeddings $\mathcal{T}^{part}$ interact with the image embedding via spatial attention, and the obtained part-activated features are used to generate part sparse and dense embeddings. Specifically, the SAM Image Encoder extracts the image embedding $\mathcal{F}_{I}\in\mathbb{R}^{h\times w\times d}$ , where $h\times w$ is the feature size. We then design a spatial attention mechanism by computing a similarity map for each part, leading to $\mathcal{S}=\mathcal{T}^{part}\times\mathcal{F}_{I}^{\top}\in\mathbb{R}^{P% \times h\times w}$ , where $\top$ denotes a transpose operator. These similarity maps serve as part-aware spatial attention to activate the image embedding, augmenting $\mathcal{F}_{I}$ into $\mathcal{F}^{\prime}_{I}=\mathcal{S}\circ\mathcal{F}_{I}+\mathcal{F}_{I}\in% \mathbb{R}^{P\times h\times w\times d}$ , where $\mathcal{F}_{I}$ and $\mathcal{S}$ are broadcasted to the same size and $\circ$ denotes the Hadamard product. The part-activated features $\mathcal{F}^{\prime}_{I}$ , containing information of both the image and the Collaborative Prompts, are used to compute part sparse embeddings $\mathcal{F}_{S}^{part}\in\mathbb{R}^{P\times n\times d}$ and part dense embeddings $\mathcal{F}_{D}^{part}\in\mathbb{R}^{P\times h\times w\times d}$ with a two-layer MLP and a three-layer CNN, respectively. Here $n$ represents the number of sparse tokens for each part. These embeddings are then fed into the SAM Decoder to segment the corresponding instrument parts.

3.2 Part-to-Whole Adaptive Fusion

In the Part-to-Whole Adaptive Fusion module, the sparse and dense embeddings for all parts are adaptively fused to form the whole sparse and dense embeddings, $\{\mathcal{F}_{S},\mathcal{F}_{D}\}$ , for the segmentation of the entire instrument. The adaptive fusion is achieved through Category Part Attention and Image Part Attention, as shown in Fig. 4. Specifically, the part sparse and dense embeddings consist of the prompt embeddings of all $P$ parts. However, as established in $\mathcal{D}_{CP}$ , instruments of different categories encompass different part compositions. Therefore, we propose a Category Part Attention that utilises the part weights for the prompted category $c$ in $\mathcal{D}_{CP}$ , i.e., $\mathcal{D}_{c*}=\{d_{cp}\}_{p=1}^{P}\in\mathbb{R}^{1\times P}$ , as the weights to fuse the sparse and dense embeddings from the part level to the whole level. Note that $\mathcal{D}_{CP}$ is initialised with 0s and 1s but is updated dynamically during model training.

While the Category Part Attention provides category-specific part weights, the presence and contribution of each part to an instrument can vary significantly across images due to different field-of-views and occlusion conditions. Therefore, it is necessary to adapt the part-to-whole fusion to the condition of each image. Accordingly, we propose Image Part Attention to compute image-specific part weights by learning a global descriptor of the image and computing its similarity with the part embeddings. Particularly, the global descriptor $\mathcal{F}_{G}\in\mathbb{R}^{1\times d}$ is learned from image embedding $\mathcal{F}_{I}$ with a Global CNN that consists of three convolutional layers and a linear layer. Then, image-specific part weights are computed as: $\mathcal{W}=\mathcal{F}_{G}\times\mathcal{T}^{part\top}\in\mathbb{R}^{1\times P}$ .

Finally, given category-specific part weights $\mathcal{D}_{c*}$ and image-specific part weights $\mathcal{W}$ , we fuse the sparse and dense embeddings $\{\mathcal{F}_{S}^{part},\mathcal{F}_{D}^{part}\}$ of the parts into the sparse and dense embeddings $\{\mathcal{F}_{S},\mathcal{F}_{D}\}$ of the whole instrument. Note that the matrices are all broadcasted to the same size prior to the Hadamard product.

$\displaystyle\mathcal{F}_{S}$	$\displaystyle=$	$\displaystyle\mathcal{F}^{part}_{S}\circ ReLU(\mathcal{D}_{c*})\in\mathbb{R}^{% P\times n\times d},$	(1)
$\displaystyle\mathcal{F}^{\prime}_{D}$	$\displaystyle=$	$\displaystyle\mathcal{F}^{part}_{D}\circ\mathcal{D}_{c*}\circ\mathcal{W}\in% \mathbb{R}^{P\times h\times w\times d},$	(2)
$\displaystyle\mathcal{F}_{D}$	$\displaystyle=$	$\displaystyle\sum_{p=1}^{P}\mathcal{F}^{\prime}_{D}\in\mathbb{R}^{h\times w% \times d}.$	(3)

3.3 Hierarchical Decoding

The sparse and dense embeddings at both the whole level and the part level are fed into SAM Decoder for hierarchical decoding. This explicitly directs the model to learn both the overall structures as well as the part characteristics of surgical instruments. The final loss function thus comprises of the dice losses $\mathcal{L}_{D}$ [28] for both the whole segmentation mask and the part segmentation masks:

	$\displaystyle\mathcal{L}=\mathcal{L}_{D}(M^{(c)},G^{(c)})+\sum_{p=1}^{P}d_{cp}% \mathcal{L}_{D}(M^{(c)}_{p},G^{(c)}_{p}),\qquad$		(4)
	$\displaystyle\mathcal{L}_{D}(M,G)=\frac{2\sum_{i}^{HW}m_{i}g_{i}}{\sum_{i}^{HW% }m_{i}^{2}+\sum_{i}^{HW}g_{i}^{2}},$		(5)

where $m_{i}$ and $g_{i}$ denote the predicted logit value and the ground-truth binary value at pixel $i$ , respectively. $M^{(c)}$ and $\{M^{(c)}_{p}\}_{p=1}^{P}$ are the predicted masks of the instrument and its parts, respectively. $G^{(c)}$ and $\{G^{(c)}_{p}\}_{p=1}^{P}$ are the ground-truth masks of the instrument and its parts, respectively.

4 Experiments

4.1 Datasets and Evaluation Metrics

The effectiveness of SP-SAM is validated using the EndoVis2018 [1] and EndoVis2017 [2] datasets. EndoVis2018 is composed of 11 training videos and four validation videos each with 149 frames, on which we follow the standard experiment and evaluation protocols defined in [32] and [15] to ensure a fair comparison with existing methods. EndoVis2017 contains eight training videos each with 255 frames and ten testing sequences with 900 frames in total. We adopt two evaluation protocols on EndoVis2017 for a fair comparison with different works: (1) average results of four-fold cross-validation, as per [32]; (2) training on the training set and reporting results on the test set, following the official code of [48]. EndoVis2018 and EndoVis2017 offer annotations for five and four instrument parts, respectively, and both datasets include seven instrument categories.

We adopt the standard evaluation metrics used in all existing works [15, 5, 3, 42, 48, 20, 46, 47, 32]: Challenge IoU [2], IoU [15], and mean class IoU (mc IoU). Challenge IoU is computed only for the classes present in an image, whereas IoU considers all classes. We also report the IoU for each instrument category.

4.2 Implementation Details

Images from EndoVis2017 and EndoVis2018 are processed to a size of 1024 $\times$ 1280, as per [32]. Data augmentation strategies are adopted following [32, 3], which include random flipping, random scale and crop, random rotation, and colour jitter. For Transfer MLP, Sparse MLP, Dense CNN, and Global CNN, their feature dimensions are set to 512, 256, 256, and 256, respectively. The number of sparse tokens per part $n$ is set to 2. In terms of training, we initialise SAM Image Encoder and SAM Decoder with SAM’s pre-trained weights of the ViT-H version [13]. We adopt CLIP Text Encoder of version ViT-L/14@336px, following [22]. Our model keeps SAM Image Encoder, CLIP Text Encoder, and the output MLPs of SAM Decoder frozen, while updating the remaining weights using an Adam optimiser with a learning rate of 0.0001. To reduce computational load, we utilise pre-computed image embeddings, employing a batch size of 8. In practice, inspired by [42], we implement our model by inputting all categories into the model and differentiating the positive category (i.e., the prompted category) with negative categories via the positive and negative sparse embeddings of SAM. SP-SAM is trained and evaluated on an Nvidia Tesla V100 16GB GPU.

4.3 Main Results

We compare the performance of SP-SAM against existing methods on the EndoVis2018 and EndoVis2017 datasets, detailed in Table 1 and Table 2, respectively. A visual comparison of the predictions is shown in Fig. 5 (More visualisations are provided in Supplemenary Materials.) The instrument categories include Bipolar Forceps (BF), Prograsp Forceps (PF), Large Needle Driver (LND), Suction Instrument (SI), Vessel Sealer (VS), Clip Applier (CA), Grasping Retractor (GR), Monopolar Curved Scissors (MCS), and Ultrasound Probe (UP). In our comparison, we divide existing methods into two categories: specialist models and SAM-based models. Notably, SP-SAM surpasses both existing fully-trained specialist models and efficient-tuning approaches based on SAM, yet at a substantially lower training cost in terms of tunable parameters.

Instrument Categories Method Category Method Challenge IoU IoU mc IoU BF PF LND SI CA MCS UP #T-Params Specialist Model TernausNet [32] 46.22 39.87 14.19 44.20 4.67 0.00 0.00 0.00 50.44 0.00 32.20M MF-TAPNet [20] 67.87 39.14 24.68 69.23 6.10 11.68 14.00 0.91 70.24 0.57 37.73M Dual-MF [46] 70.40 - 35.09 74.10 6.80 46.00 30.10 7.60 80.90 0.10 203.80M ISINet [15] 73.03 70.94 40.21 73.83 48.61 30.98 37.68 0.00 88.16 2.16 162.52M TraSeTr [47] 76.20 - 47.71 76.30 53.30 46.50 40.60 13.90 86.20 17.15 - S3Net [5] 75.81 74.02 42.58 77.22 50.87 19.83 50.59 0.00 92.12 7.44 68.41M MATIS Frame [3] 82.37 77.01 48.65 83.35 38.82 40.19 64.49 4.32 93.18 16.17 68.72M TP-SIS [48] 84.92 83.61 65.44 84.28 73.18 78.88 92.20 23.73 66.67 39.12 131.08M SAM-based Model MaskTrack-RCNN [39] + SAM 78.49 78.49 56.07 79.83 74.86 43.12 62.88 16.74 91.62 23.45 57.67M Mask2Former [9] + SAM 78.72 78.72 52.50 85.95 82.31 44.08 0.00 49.80 92.17 13.18 68.72M TrackAnything (1 Point) [38] 40.36 38.38 20.62 30.20 12.87 24.46 9.17 0.19 55.03 12.41 - TrackAnything (5 Points) [38] 65.72 60.88 38.60 72.90 31.07 64.73 10.24 12.28 61.05 17.93 - PerSAM (Zero-Shot) [45] 49.21 49.21 34.55 51.26 34.40 46.75 16.45 15.07 52.28 25.62 - PerSAM (Fine-Tune) [45] 52.21 52.21 37.24 57.19 36.13 53.86 14.34 25.94 54.66 18.57 2 SurgicalSAM [42] 80.33 80.33 58.87 83.66 65.63 58.75 54.48 39.78 88.56 21.23 4.65M SP-SAM (Ours) 84.24 84.24 65.71 87.60 65.07 61.95 58.30 59.96 92.08 34.99 8.62M GT Centroid + SAM 60.26 60.26 63.34 44.35 65.92 30.99 87.14 69.69 80.04 65.26 - GT Bbox + SAM 88.04 88.04 84.23 87.10 86.81 72.23 91.21 75.91 93.08 83.24 -

Table 1: Comparison of results on the EndoVis2018 dataset. #T-Params denotes the number of tunable parameters.

Instrument Categories Method Category Method Challenge IoU IoU mc IoU BF PF LND VS GR MCS UP Cross-Fold Average Results Specialist Model TernausNet [32] 35.27 12.67 10.17 13.45 12.39 20.51 5.97 1.08 1.00 16.76 MF-TAPNet [20] 37.25 13.49 10.77 16.39 14.11 19.01 8.11 0.31 4.09 13.40 Dual-MF [46] 45.80 - 26.40 34.40 21.50 64.30 24.10 0.80 17.90 21.80 ISINet [15] 55.62 52.20 28.96 38.70 38.50 50.09 27.43 2.10 28.72 12.56 TraSeTr [47] 60.40 - 32.56 45.20 56.70 55.80 38.90 11.40 31.30 18.20 S3Net [5] 72.54 71.99 46.55 75.08 54.32 61.84 35.50 27.47 43.23 28.38 MATIS Frame [3] 68.79 62.74 37.30 66.18 50.99 52.23 32.84 15.71 19.27 23.90 TP-SIS [48] 63.37 63.37 52.74 66.42 45.46 75.20 73.44 29.95 44.02 34.67 SAM-based Model Mask2Former [9] + SAM 66.21 66.21 55.26 66.84 55.36 83.29 73.52 26.24 36.26 45.34 TrackAnything (1 Point) [38] 54.90 52.46 55.35 47.59 28.71 43.27 82.75 63.10 66.46 55.54 TrackAnything (5 Points) [38] 67.41 64.50 62.97 55.42 44.46 62.43 83.68 62.59 67.03 65.17 PerSAM (Zero-Shot) [45] 42.47 42.47 41.80 53.99 25.89 50.17 52.87 24.24 47.33 38.16 PerSAM (Fine-Tune) [45] 41.90 41.90 39.78 46.21 28.22 53.12 57.98 12.76 41.19 38.99 SurgicalSAM [42] 69.94 69.94 67.03 68.30 51.77 75.52 68.24 57.63 86.95 60.80 SP-SAM (Ours) 73.94 73.94 71.06 68.89 53.16 83.80 73.20 72.40 84.91 61.05 GT Centroid + SAM 44.42 44.42 54.41 63.42 36.03 22.57 54.21 75.18 70.17 59.25 GT Bbox + SAM 76.31 76.31 81.18 89.36 73.44 67.67 90.04 87.79 94.03 65.91 Test Set Results Specialist Model TP-SIS [48] 79.90 77.83 56.22 68.58 73.52 92.74 83.90 0.13 74.70 0.00 SAM-based Model SP-SAM (Ours) 82.01 82.01 56.00 81.64 74.06 91.42 72.00 0.84 72.06 0.00

Table 2: Comparison of results on the EndoVis2017 Dataset.

In terms of zero-shot performance of SAM, the methods using bounding boxes as prompts (MaskTrack-RCNN [39]/Mask2Former [9] + SAM) in general outperform those using point prompts (TrackAnything [38]) and image prompts (PerSAM [45]). However, their performances are still inferior to tuning-based methods. Additionally, they rely on a well-trained detector for bounding box prediction, resulting in a considerable increase in the number of training parameters and higher pipeline complexity. In contrast, tuning-based methods for SAM, including both SurgicalSAM [42] and SP-SAM, adopt an end-to-end efficient tuning pipeline with minimal training parameters, boosting both training efficiency and segmentation performance.

Our method demonstrates considerable enhancement over the existing efficient-tuning approach, SurgicalSAM [42], on both datasets. It shows an improvement of 3.91 and 4.00 in terms of Challenge IoU on EndoVis2018 and EndoVis2017, respectively. Unlike SurgicalSAM, which relies on category ID prompting, our approach is prompted by text, leveraging the extensive information in natural language expressions and pre-trained language models. Additionally, our method significantly outperforms SurgicalSAM in mean class IoU (mc IoU), by a gain of 6.84 for EndoVis2018 and 4.03 for EndoVis2017, indicating superior discrimination of instruments across different categories. This enhancement is largely attributed to our part-to-whole collaborative prompting mechanism that explicitly directs the model to identify the internal structures of instruments and concentrate on the part-level details, in contrast to SurgicalSAM which treats each instrument as a single entity.

In addition, we compare SP-SAM with two oracle scenarios that employ ground-truth centroids and bounding boxes as prompts for SAM. Remarkably, SP-SAM achieves significantly better results than those obtained with ground-truth centroids. Moreover, SP-SAM’s performance closely approaches that of the oracle setting with ground-truth bounding boxes, with only a gap of 3.80 and 2.37 in terms of Challenge IoU for EndoVis2018 and EndoVis2017, respectively, yet our method requires significantly less prompting effort without the need for manual per-frame bounding box guidance.

Finally, SP-SAM achieves superior or competitive performance compared to SOTA specialist models while using significantly fewer tunable parameters. On EndoVis2018, our method surpasses TP-SIS [48] in both IoU and mc IoU, despite TP-SIS utilising 15 times more tunable parameters than ours (131.08M for TP-SIS vs. 8.62M for SP-SAM). On EndoVis2017, for cross-fold averages, we compare with the reproduced results of TP-SIS using the official code; for test set results, we compare with the results reported in [48]. Our methods exhibit improvements in both settings, with a more substantial enhancement in cross-fold averages, a more robust evaluation protocol, further affirming our superiority. In contrast to TP-SIS which utilises instrument part masks straightforwardly as training supervisory signals, our method more effectively integrates instrument structure knowledge and explicitly addresses category-part and image-part relationships. This allows for a more accurate comprehension of instruments, encompassing both their structures and finer details. Similarly, the comparison between SP-SAM and its variant without Part-to-Whole Adaptive Fusion (model C in Table 3) in our ablation study also highlights the improvement our method offers over straightforward use of instrument part masks.

Figure 5 showcases a visual comparison of predicted masks by different methods, highlighting that SP-SAM clearly outperforms existing SAM-based methods in segmentation quality. Notably, SP-SAM outperforms SurgicalSAM, which tends to misidentify instrument categories (Fig. 5(a)), predict incomplete masks with missed critical parts like instrument tips (Fig. 5(b)), and generate rugged edges (Fig. 5(c) and (d)). Owing to the part-to-whole collaborative prompting mechanism, SP-SAM precisely captures fine-grained details such as edges and challenging areas. This precision, especially in identifying edges and tips, is crucial for ensuring safety in surgical settings.

4.4 Ablation Study

Ablation Study of Key Components. We conduct an ablation study on both EndoVis2018 and EndoVis2017 to investigate the effect of the proposed Collaborative Prompts, Part-to-Whole Adaptive Fusion, and Hierarchical Decoding. The results are reported in Table 3. Model A, the baseline, utilises text prompts of category names, which are encoded by the Cross-Model Prompt Encoder into sparse and dense embeddings for the SAM Decoder to generate instrument masks. Model A is then progressively augmented with our proposed modules. In model B, Collaborative Prompts are encoded by the Cross-Modal Prompt Encoder into sparse and dense embeddings for individual parts. These embeddings of different parts are then combined into whole sparse and dense embeddings in a straightforward manner, where the sparse embeddings of all parts are concatenated and the dense embeddings of all parts are summed. Subsequently, these whole sparse and dense embeddings are decoded by the SAM Decoder into instrument masks. Building upon model B, model C additionally incorporates Hierarchical Decoding, enabling the decoding of both whole and part sparse and dense embeddings into their respective masks. Following model C, we investigate the impact of Category Part Attention and Image Part Attention in the Part-to-Whole Adaptive Fusion module. In model D, we set image-specific part weights $\mathcal{W}$ to an all-one vector, while in model E, we fix $\mathcal{D}_{CP}$ as an all-one matrix. Finally, model F represents our proposed SP-SAM.

Part-to-Whole Fusion EndoVis2018 EndoVis2017 Model Collab. Prompts Category Att. Image Att. Hier. Decod. Challenge IoU mc IoU Challenge IoU mc IoU A 81.08 61.14 67.69 63.59 B ✓ 81.90 62.23 68.97 65.30 C ✓ ✓ 82.36 61.26 71.64 66.21 D ✓ ✓ ✓ 82.64 63.24 72.55 67.87 E ✓ ✓ ✓ 82.98 65.37 72.48 68.21 F ✓ ✓ ✓ ✓ 84.24 65.71 73.94 71.06

Table 3: Ablation study of the key components of SP-SAM.

In general, it can be seen that each module individually enhances the Challenge IoU and mc IoU scores with the most significant improvements observed when all modules are integrated. In particular, in EndoVis2018, after incorporating Collaborative Prompts (model B), adding Hierarchical Decoding (model C) shows a marginal improvement in Challenge IoU and a decrease in mc IoU. This confirms that employing surgical part labels as auxiliary signals in a straightforward multi-task learning manner does not yield substantial enhancements. The real breakthrough comes with the proposed Part-to-Whole Adaptive Fusion in conjunction with Hierarchical Decoding (models D, E, and F), unlocking the full potential. A visual comparison of SP-SAM with the baseline of category name prompting is presented in Fig. 6. Prompting without part information results in a substantial loss of details in the predicted masks. In contrast, SP-SAM excels in recognising all instrument parts, particularly with intricate elements like tips (Fig. 6(a) and (b)) and areas with varying materials (Fig. 6(c) and (d)).

Ablation Study of Sparse and Dense Embeddings. We conduct an ablation study on sparse and dense embeddings on both EndoVis2018 and EndoVis2017 datasets, as shown in Table 4. The study evaluates the model’s performance when using only sparse or only dense embeddings. In the sparse-only model, dense embeddings are replaced with the no-mask embeddings from SAM’s pre-trained weights, and in the dense-only model, sparse embeddings are set to be empty tensors. Table 4 reveals that removing either dense or sparse embeddings decreases performance, with the removal of dense embeddings causing a more significant decline. This suggests that dense embeddings are more crucial than sparse embeddings. This aligns with our expectations, since in SAM dense embeddings function as masks, holding more information than the point-based sparse embeddings, thus guiding more accurate decoding. Our method optimally leverages both sparse and dense embeddings and achieves the best results.

EndoVis2018 EndoVis2017 Method Challenge IoU mc IoU Challenge IoU mc IoU Sparse-Only 76.71 57.17 67.15 64.18 Dense-Only 83.09 63.81 72.43 66.61 SP-SAM (Ours) 84.24 65.71 73.94 71.06

Table 4: Ablation study of sparse and dense embeddings.

Impact of Number of Tokens. Details are in Supplementary Materials.

5 Conclusion

In this paper, we present SP-SAM, an efficient-tuning approach of SAM for text promptable surgical instrument segmentation. It leverages a part-to-whole collaborative prompting mechanism to address the challenge of complex structures and fine-grained details in surgical instruments. Specifically, Collaborative Prompts are devised to describe surgical instruments at both category and part levels. Moreover, the proposed Cross-Modal Prompt Encoder, Part-to-Whole Adaptive Fusion, and Hierarchical Decoding modules learn discriminative representations of instrument parts and adaptively assemble them for accurate instrument segmentation. Experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that SP-SAM outperforms both specialist methods and SAM-based methods while only tuning a small number of parameters. Our method demonstrates the great potential of efficiently adapting foundation models for highly specialised tasks and offers valuable insights into the segmentation of challenging targets. In the future, our method can be further improved by exploring other forms of text prompts, incorporating temporal cues, and tackling background targets such as human tissues.

References

[1] Allan, M., Kondo, S., Bodenstedt, S., Leger, S., Kadkhodamohammadi, R., Luengo, I., Fuentes, F., Flouty, E., Mohammed, A., Pedersen, M., et al.: 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190 (2020)
[2] Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al.: 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426 (2019)
[3] Ayobi, N., Pérez-Rondón, A., Rodríguez, S., Arbeláez, P.: MATIS: Masked-attention transformers for surgical instrument segmentation. In: ISBI. pp. 1–5 (2023)
[4] Azofeifa, J.D., Noguez, J., Ruiz, S., Molina-Espinosa, J.M., Magana, A.J., Benes, B.: Systematic review of multimodal human–computer interaction. In: Informatics. vol. 9, p. 13. MDPI (2022)
[5] Baby, B., Thapar, D., Chasmai, M., Banerjee, T., Dargan, K., Suri, A., Banerjee, S., Arora, C.: From forks to forceps: A new framework for instance segmentation of surgical instruments. In: WACV. pp. 6180–6190. IEEE (2023)
[6] Birlo, M., Edwards, P.E., Clarkson, M., Stoyanov, D.: Utility of optical see-through head mounted displays in augmented reality-assisted surgery: A systematic review. Medical Image Analysis 77, 102361 (2022)
[7] Cao, C., Cerfolio, R.J.: Virtual or augmented reality to enhance surgical education and surgical planning. Thoracic Surgery Clinics 29(3), 329–337 (2019)
[8] Chen, T., Zhu, L., Deng, C., Cao, R., Wang, Y., Zhang, S., Li, Z., Sun, L., Zang, Y., Mao, P.: SAM-Adapter: Adapting segment anything in underperformed scenes. In: ICCV Workshop. pp. 3367–3375 (2023)
[9] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR. pp. 1290–1299 (2022)
[10] Cheng, D., Qin, Z., Jiang, Z., Zhang, S., Lao, Q., Li, K.: SAM on medical images: A comprehensive study on three prompt modes. arXiv preprint arXiv:2305.00035 (2023)
[11] Deng, R., Cui, C., Liu, Q., Yao, T., Remedios, L.W., Bao, S., Landman, B.A., Tang, Y., Wheless, L.E., Coburn, L.A., et al.: Segment anything model (SAM) for digital pathology: Assess zero-shot segmentation on whole slide imaging. In: Medical Imaging with Deep Learning, Short Paper Track (2023)
[12] Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: ICCV. pp. 16321–16330 (2021)
[13] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2020)
[14] Feng, G., Hu, Z., Zhang, L., Lu, H.: Encoder fusion network with co-attention embedding for referring image segmentation. In: CVPR. pp. 15506–15515 (2021)
[15] González, C., Bravo-Sánchez, L., Arbelaez, P.: ISINet: An instance-based approach for surgical instrument segmentation. In: MICCAI. pp. 595–605. Springer (2020)
[16] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV. pp. 2961–2969 (2017)
[17] He, S., Bao, R., Li, J., Stout, J., Bjornerud, A., Grant, P.E., Ou, Y.: Computer-vision benchmark segment-anything model (SAM) in medical images: Accuracy in 12 datasets. arXiv preprint arXiv:2304.09324 (2023)
[18] Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: ECCV. pp. 108–124. Springer (2016)
[19] Huang, Y., Yang, X., Liu, L., Zhou, H., Chang, A., Zhou, X., Chen, R., Yu, J., Chen, J., Chen, C., et al.: Segment anything model for medical images? Medical Image Analysis 92, 103061 (2024)
[20] Jin, Y., Cheng, K., Dou, Q., Heng, P.A.: Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video. In: MICCAI. pp. 440–448. Springer (2019)
[21] Kim, N., Kim, D., Lan, C., Zeng, W., Kwak, S.: ReSTR: Convolution-free referring image segmentation using transformers. In: CVPR. pp. 18145–18154 (2022)
[22] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: ICCV. pp. 4015–4026 (2023)
[23] Li, R., Li, K., Kuo, Y.C., Shu, M., Qi, X., Shen, X., Jia, J.: Referring image segmentation via recurrent refinement networks. In: CVPR. pp. 5745–5753 (2018)
[24] Liu, D., Li, Q., Jiang, T., Wang, Y., Miao, R., Shan, F., Li, Z.: Towards unified surgical skill assessment. In: CVPR. pp. 9522–9531 (2021)
[25] Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: CVPR. pp. 7086–7096 (2022)
[26] Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications 15(1), 654 (2024)
[27] Mazurowski, M.A., Dong, H., Gu, H., Yang, J., Konz, N., Zhang, Y.: Segment anything model for medical image analysis: An experimental study. Medical Image Analysis p. 102918 (2023)
[28] Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3DV. pp. 565–571. IEEE (2016)
[29] Ni, Z.L., Bian, G.B., Wang, G.A., Zhou, X.H., Hou, Z.G., Chen, H.B., Xie, X.L.: Pyramid attention aggregation network for semantic segmentation of surgical instruments. In: AAAI. vol. 34, pp. 11782–11790 (2020)
[30] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PMLR (2021)
[31] Shi, H., Li, H., Meng, F., Wu, Q.: Key-word-aware network for referring expression image segmentation. In: ECCV. pp. 38–54 (2018)
[32] Shvets, A.A., Rakhlin, A., Kalinin, A.A., Iglovikov, V.I.: Automatic instrument segmentation in robot-assisted surgery using deep learning. In: ICMLA. pp. 624–628. IEEE (2018)
[33] Wald, T., Roy, S., Koehler, G., Disch, N., Rokuss, M.R., Holzschuh, J., Zimmerer, D., Maier-Hein, K.: SAM. MD: Zero-shot medical image segmentation capabilities of the segment anything model. In: Medical Imaging with Deep Learning, Short Paper Track (2023)
[34] Wang, A., Islam, M., Xu, M., Zhang, Y., Ren, H.: SAM meets robotic surgery: An empirical study on generalization, robustness and adaptation. In: MICCAI Workshop. pp. 234–244 (2023)
[35] Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: CRIS: CLIP-driven referring image segmentation. In: CVPR. pp. 11686–11695 (2022)
[36] Wu, J., Li, X., Li, X., Ding, H., Tong, Y., Tao, D.: Towards robust referring image segmentation. arXiv preprint arXiv:2209.09554 (2022)
[37] Wu, J., Fu, R., Fang, H., Liu, Y., Wang, Z., Xu, Y., Jin, Y., Arbel, T.: Medical SAM adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620 (2023)
[38] Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., Zheng, F.: Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968 (2023)
[39] Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV. pp. 5188–5197 (2019)
[40] Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: Language-aware vision transformer for referring image segmentation. In: CVPR. pp. 18155–18165 (2022)
[41] Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR. pp. 10502–10511 (2019)
[42] Yue, W., Zhang, J., Hu, K., Xia, Y., Luo, J., Wang, Z.: SurgicalSAM: Efficient class promptable surgical instrument segmentation. In: AAAI (2024)
[43] Zang, D., Bian, G.B., Wang, Y., Li, Z.: An extremely fast and precise convolutional neural network for recognition and localization of cataract surgical tools. In: MICCAI. pp. 56–64. Springer (2019)
[44] Zhang, K., Liu, D.: Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785 (2023)
[45] Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Dong, H., Qiao, Y., Gao, P., Li, H.: Personalize segment anything model with one shot. In: ICLR (2024)
[46] Zhao, Z., Jin, Y., Gao, X., Dou, Q., Heng, P.A.: Learning motion flows for semi-supervised instrument segmentation from robotic surgical video. In: MICCAI. pp. 679–689. Springer (2020)
[47] Zhao, Z., Jin, Y., Heng, P.A.: TraSeTr: Track-to-segment transformer with contrastive query for instance-level instrument segmentation in robotic surgery. In: ICRA. pp. 11186–11193. IEEE (2022)
[48] Zhou, Z., Alabi, O., Wei, M., Vercauteren, T., Shi, M.: Text promptable surgical instrument segmentation with vision-language models. In: NeurIPS (2023)