HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: dirtytalk
  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.14481v2 [cs.CV] 23 Mar 2024
11institutetext: School of Computer Science, The University of Sydney 22institutetext: School of Software Engineering, South China University of Technology 33institutetext: Department of Data Science & AI, Monash University 44institutetext: School of Computer Science, Northwestern Polytechnical University 55institutetext: Department of Computer Science, University of Rochester
55email: {wenxi.yue, jing.zhang1, kun.hu, zhiyong.wang}@sydney.edu.au, [email protected], [email protected], [email protected], [email protected]

SurgicalPart-SAM: Part-to-Whole Collaborative Prompting for Surgical Instrument Segmentation

Wenxi Yue 11    Jing Zhang Corresponding Author.11    Kun Hu 11    Qiuxia Wu 22    Zongyuan Ge 33    Yong Xia 44    Jiebo Luo 55    Zhiyong Wang 11
Abstract

The Segment Anything Model (SAM) exhibits promise in generic object segmentation and offers potential for various applications. Existing methods have applied SAM to surgical instrument segmentation (SIS) by tuning SAM-based frameworks with surgical data. However, they fall short in two crucial aspects: (1) Straightforward model tuning with instrument masks treats each instrument as a single entity, neglecting their complex structures and fine-grained details; and (2) Instrument category-based prompts are not flexible and informative enough to describe instrument structures. To address these problems, in this paper, we investigate text promptable SIS and propose SurgicalPart-SAM (SP-SAM), a novel SAM efficient-tuning approach that explicitly integrates instrument structure knowledge with SAM’s generic knowledge, guided by expert knowledge on instrument part compositions. Specifically, we achieve this by proposing (1) Collaborative Prompts that describe instrument structures via collaborating category-level and part-level texts; (2) Cross-Modal Prompt Encoder that encodes text prompts jointly with visual embeddings into discriminative part-level representations; and (3) Part-to-Whole Adaptive Fusion and Hierarchical Decoding that adaptively fuse the part-level representations into a whole for accurate instrument segmentation in surgical scenarios. Built upon them, SP-SAM acquires a better capability to comprehend surgical instruments in terms of both overall structure and part-level details. Extensive experiments on both the EndoVis2018 and EndoVis2017 datasets demonstrate SP-SAM’s state-of-the-art performance with minimal tunable parameters. The code will be available at https://github.com/wenxi-yue/SurgicalPart-SAM.

Keywords:
Segment Anything Model Surgical Instrument Segmentation Efficient-Tuning

1 Introduction

Refer to caption
Figure 1: SP-SAM with Collaborative Prompts incorporates the knowledge of surgical instrument structures. Subfigure (e) is partially is excerpted from [2].

Surgical instrument segmentation (SIS) aims to accurately identify and delineate surgical instruments in operative scenes. It plays a foundational role for many downstream applications, such as surgical planning [7], robotic navigation [43], and skill assessment [24]. We identify two primary problems with the existing methods for this task (Fig. 1(a)). First, they often develop specialist models [32, 20, 46, 15, 47, 5, 3, 48] that require training a large number of parameters, leading to high development costs. Second, current methods lack the capability of human-computer interaction that is highly desired in surgical practice [6, 4].

The Segment Anything Model (SAM) [22] is a pioneering foundation model for promptable segmentation. It holds great potential for addressing the above problems owing to its rich pre-trained knowledge and interactivity. However, employing SAM for surgical instrument segmentation in a zero-shot manner (Fig. 1(b)) poses significant challenges. Firstly, zero-shot frameworks of SAM, including detection-based (MaskTrack-RCNN [39]/Mask2Former [9] + SAM), tracking-based (TrackAnything [38]), and reference-based (PerSAM [45]) frameworks, have demonstrated inferior generalisation on surgical instruments [42]. This deficiency is mainly due to the insufficient surgical data in SAM pre-training and the notable domain disparity between natural objects and surgical instruments. Specifically, compared to generic objects, surgical instruments present more intricate structures and fine-grained details, exacerbating the challenge of generalising SAM to this specialised domain. Secondly, SAM’s reliance on point-or-box prompts is impractical in surgical settings, where it is infeasible for surgeons to provide such prompts for every instrument in each frame.

Initial attempts have been made to address these problems. Yue et al. [42] propose SurgicalSAM (Fig. 1(c)), an instrument category-prompted SAM framework efficiently tuned with surgical data. Additionally, Wang et al. [34] propose an efficient-tuning approach for SAM for SIS employing fixed default prompt embeddings. However, these methods suffer from two crucial problems. First, their straightforward tuning approach using whole instrument masks treats each instrument as a single entity and cannot explicitly handle the complex structures and details of instruments. Despite well-established expert knowledge on instrument structure compositions, they fail to incorporate these insights during tuning. Secondly, they depend on instrument category prompts or fixed default prompts, which lack flexibility and intuitiveness for surgeon-computer interaction and fail to provide informative descriptions of instrument structures. Instead, more flexible and informative prompts such as text are preferred.

In this paper, we explore text promptable surgical instrument segmentation and propose a novel framework, SurgicalPart-SAM (SP-SAM) (Fig. 1(d)), to address the above problems. Specifically, we recognise the well-established expert knowledge regarding the compositions of surgical instrument parts, e.g., Large Needle Driver is composed of shaft, wrist, and tip, Monopolar Curved Scissors is composed of shaft and tip, etc. In SP-SAM, we aim to harness this expert knowledge to guide the tuning of SAM to improve its capability to comprehend instrument structures and identify subtle details.

To integrate part-level information, we first introduce a new form of text prompt, namely Collaborative Prompts, which utilises a text description set: {[part name] of [instrument category name]} for all parts of an instrument category, collaborating category-level and part-level text descriptions. Contrasted with prompting solely with instrument category names, Collaborative Prompts effectively enables the integration of more precise and fine-grained instrument part information (Fig. 1(e)). Next, to correlate the Collaborative Prompts with the instrument parts in the image, we introduce a Cross-Modal Prompt Encoder to learn part-level representations via interaction between the Collaborative Prompts and the image embedding. This enables focused learning of fine-grained features for each instrument part, thereby enhancing the segmentation of subtle details. Finally, we propose Part-to-Whole Adaptive Fusion and Hierarchical Decoding to fuse representations of all instrument parts into a whole and decode them into segmentation masks, capturing both the global structure and the compositional components.

Note that, part-to-whole fusion is non-trivial due to two inherent challenges in surgical scenarios: (1) the varying part compositions across instrument categories, and (2) the frequent occlusions of instruments. These challenges necessitate adaptive fusion of different parts for each instrument in the surgical scene. Therefore, within the Part-to-Whole Adaptive Fusion module, we propose Category Part Attention and Image Part Attention. The former adapts category-specific part weightings to accommodate diverse part compositions across categories, while the latter learns adaptive image-specific part weightings to handle occluded or out-of-view parts in the image. By integrating all components, SP-SAM exhibits a strong capability to adaptively comprehend surgical instrument structures, identify subtle details, and discriminate between fine-grained categories. In summary, our contributions are:

  • We introduce a novel SAM efficient-tuning approach, SurgicalPart-SAM (SP-SAM), for text promptable surgical instrument segmentation. SP-SAM utilises well-established expert knowledge of surgical instrument part compositions to guide SAM tuning, explicitly addressing the structural complexity and subtle details of surgical instruments, thereby enhancing generalisability.

  • We introduce Collaborative Prompts, Cross-Modal Prompt Encoder, and Part-to-Whole Adaptive Fusion and Hierarchical Decoding, to achieve multi-modal embedding learning at both the part level and the category level. These designs enhance comprehension of instrument structures and details during SAM tuning.

  • We propose Category Part Attention and Image Part Attention to integrate category-specific and image-specific weights for adaptively fusing instrument part representations. These mechanisms respectively address two critical challenges in surgical scenarios: the varying part compositions across instrument categories and the frequent occlusions of instruments.

  • We conduct extensive experiments on the challenging EndoVis2018 and EndoVis2017 datasets and show that SP-SAM achieves state-of-the-art performance with only a small number of training parameters.

2 Related Work

2.1 Surgical Instrument Segmentation

Most surgical instrument segmentation methods focus on developing specialist models. Early research adopts a semantic segmentation pipeline with the pioneering work TernausNet introducing a U-Net based encoder-decoder model [32]. Subsequent developments include feature pyramid attention [29] and flow-based temporal priors [20, 46]. An alternative strategy to semantic segmentation is instance segmentation. ISINet adopts Mask-RCNN [15, 16] for this task, which is later enhanced by Baby et al. [5] with a specialised classification module. In addition, TraSeTR utilises a track-to-segment transformer with tracking cues [47] and MATIS employs Mask2Former with a temporal consistency module [3, 9]. Recently, Zhou et al. [48] introduce TP-SIS, a text promptable framework exploiting the pre-trained vision-language model CLIP [30]. Despite the variety of specialist models, they all involve fully training a complete set of model parameters, resulting in high development costs.

To enhance model generalisability and reduce training costs, there is a growing interest in adapting pre-trained foundation models for SIS. SurgicalSAM is proposed as a category-prompted SAM framework efficiently tuned with surgical data [42], while Wang et al. [34] propose an efficient-tuning method for SAM using fixed default embeddings as prompts. However, these approaches rely on less informative prompts and overlook the intricate structures and subtle details of surgical instruments during SAM tuning. In contrast, our SP-SAM employs more informative Collaborative Prompts in text form to explicitly leverage expert knowledge of instrument part compositions to guide the tuning of SAM, enhancing SAM’s comprehension of surgical instruments compared to [42, 34].

2.2 Text Promptable Segmentation

In contrast to traditional segmentation that solely relies on pre-defined class labels, text promptable segmentation uses natural language as prompts that can offer richer contextual information, improved generalisation, and greater flexibility. Early works are primarily based on Convolutional Neural Networks and Recurrent Neural Networks and propose attention mechanisms for extracting and relating visual and textual features [18, 23, 31, 41]. More recent approaches utilise Transformers to perform feature extraction and multi-modal feature fusion [14, 36, 21, 40, 12]. Recently, to leverage the rich knowledge from large-scale pre-training, large vision-language models such as CLIP [30] are utilised for this task [35, 25]. Zhou et al. [48] employ CLIP [30] and introduce TP-SIS, the first text promptable framework for SIS. However, in TP-SIS [48], instrument part masks are used straightforwardly as supervisory signals, neglecting the structural dependencies associated with the parts. Moreover, TP-SIS requires fine-tuning the entire CLIP Image Encoder, resulting in high training costs. In contrast, our SP-SAM explicitly explores category-specific and image-specific part dependencies by incorporating expert knowledge on instrument structures and requires only a very small number of training parameters.

2.3 Segment Anything Model

SAM is recognised as the pioneering foundation model for image segmentation. Owing to extensive pre-training on large-scale data, SAM exhibits impressive generalisation capabilities on various downstream tasks [22]. However, its zero-shot performance in medical contexts tends to fall short due to the significant disparity between natural and medical subjects  [11, 17, 27, 19, 10, 42]. Moreover, SAM’s reliance on precise per-frame point-or-box prompts for segmentation [10, 33] requires extensive manual input, infeasible in many medical scenarios, e.g. during surgery. To mitigate the gap between natural and medical domains, some studies have fine-tuned SAM with domain-specific data. However, these methods either have limited interactivity [44, 8, 34], require labour-intensive per-frame points or bounding boxes for prompting  [26, 37], or rely on inflexible category IDs [42]. In contrast to these approaches, in SP-SAM we propose Collaborative Prompts that integrate category-level and part-level texts. This method offers a more intuitive and flexible approach for surgeon-computer interaction, enables informative descriptions of instrument structures, and introduces additional cues to SAM from the language modality.

3 Method

Refer to caption
Figure 2: Overview of SP-SAM. SP-SAM consists of four main components: a SAM Image Encoder, a Cross-Modal Prompt Encoder, a Part-to-Whole Adaptive Fusion module, and a SAM Decoder. The SAM Image Encoder, CLIP Text Encoder (within the Cross-Modal Prompt Encoder), and output MLPs in SAM Decoder are frozen and the remaining weights are tuned.

In this paper, we address the task of text promptable surgical instrument segmentation. Instrument category names are suboptimal as text prompts due to their coarse nature and lack of structural cues. Therefore, we introduce Collaborative Prompts that combine both category and part information of surgical instruments. To maximise the potential of these Collaborative Prompts and integrate instrument structure information with SAM’s generic knowledge, we propose a part-to-whole collaborative prompting pipeline based on SAM, namely SP-SAM. Given a surgical image IH×W×3𝐼superscript𝐻𝑊3I\in\mathbb{R}^{H\times W\times 3}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT of size H×W𝐻𝑊H\times Witalic_H × italic_W and Collaborative Prompts T(c)superscript𝑇𝑐T^{(c)}italic_T start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT for an instrument category c𝑐citalic_c, SP-SAM predicts the binary mask M(c){0,1}H×Wsuperscript𝑀𝑐superscript01𝐻𝑊M^{(c)}\in\{0,1\}^{H\times W}italic_M start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT for the instrument.

With the Collaborative Prompts, instrument structure information can be easily integrated by establishing a category-part relation matrix 𝒟CP{0,1}C×Psubscript𝒟𝐶𝑃superscript01𝐶𝑃\mathcal{D}_{CP}\in\{0,1\}^{C\times P}caligraphic_D start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_C × italic_P end_POSTSUPERSCRIPT, where C𝐶Citalic_C and P𝑃Pitalic_P denote the numbers of surgical instrument categories and instrument parts, respectively, and each element dcpsubscript𝑑𝑐𝑝d_{cp}italic_d start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT in 𝒟CPsubscript𝒟𝐶𝑃\mathcal{D}_{CP}caligraphic_D start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT indicates the presence of part p𝑝pitalic_p in category c𝑐citalic_c. For instance, Monopolar Curved Scissors (Fig. 1(e) middle instrument), composed of shaft and tip parts, would have 1s for these parts and 0s for absent parts like the wrist. SP-SAM leverages the expert knowledge on instrument structure 𝒟CPsubscript𝒟𝐶𝑃\mathcal{D}_{CP}caligraphic_D start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT for accurate surgical segmentation.

As shown in Fig. 2, SP-SAM consists of four key components: (1) a frozen SAM Image Encoder that extracts image embeddings from the given image, (2) a Cross-Modal Prompt Encoder (Sec. 3.1) that extracts part embeddings from Collaborative Prompts and generates part sparse and dense embeddings through cross-modal interaction, (3) a Part-to-Whole Adaptive Fusion module (Sec. 3.2) that combines part sparse and dense embeddings into whole sparse and dense embeddings through Category Part Attention and Image Part Attention, considering category-specific and image-specific part contributions, respectively, and (4) a SAM Decoder for Hierarchical Decoding (Sec. 3.3) that decodes these embeddings into masks, thereby enhancing the model’s comprehension of instruments both as a whole and at the part level.

3.1 Cross-Modal Prompt Encoder

Refer to caption
Figure 3: Cross-Modal Prompt Encoder consists of feature extraction of Collaborative Prompts and part-level cross-modal encoding.

The Cross-Modal Prompt Encoder takes the Collaborative Prompts and image embedding as input and performs cross-modal interaction between them via spatial attention, generating part sparse embeddings and part dense embeddings. As shown in Fig. 3, this process can be divided into two steps: feature extraction of Collaborative Prompts and part-level cross-modal encoding.

Feature Extraction of Collaborative Prompts. We introduce a new type of text prompt for surgical instruments that collaboratively integrates both category and part information, namely Collaborative Prompts. Specifically, the Collaborative Prompts for an instrument of category c𝑐citalic_c are formulated into a set of texts containing all P𝑃Pitalic_P parts: T(c)={[partp] of [intrument categoryc]}p=1Psuperscript𝑇𝑐superscriptsubscriptdelimited-[]𝑝𝑎𝑟subscript𝑡𝑝 of delimited-[]𝑖𝑛𝑡𝑟𝑢𝑚𝑒𝑛𝑡 𝑐𝑎𝑡𝑒𝑔𝑜𝑟subscript𝑦𝑐𝑝1𝑃T^{(c)}=\{[part_{p}]\text{ of }[intrument\text{ }category_{c}]\}_{p=1}^{P}italic_T start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT = { [ italic_p italic_a italic_r italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ] of [ italic_i italic_n italic_t italic_r italic_u italic_m italic_e italic_n italic_t italic_c italic_a italic_t italic_e italic_g italic_o italic_r italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ] } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT, where instrument categoryc𝑖𝑛𝑠𝑡𝑟𝑢𝑚𝑒𝑛𝑡 𝑐𝑎𝑡𝑒𝑔𝑜𝑟subscript𝑦𝑐instrument\text{ }category_{c}italic_i italic_n italic_s italic_t italic_r italic_u italic_m italic_e italic_n italic_t italic_c italic_a italic_t italic_e italic_g italic_o italic_r italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and partp𝑝𝑎𝑟subscript𝑡𝑝part_{p}italic_p italic_a italic_r italic_t start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT represent the names in text for instrument category c𝑐citalic_c and part p𝑝pitalic_p, respectively. Next, T(c)superscript𝑇𝑐T^{(c)}italic_T start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT is encoded by the CLIP Text Encoder [30] into text-based CLIP part embeddings 𝒯clippartP×dclipsubscriptsuperscript𝒯𝑝𝑎𝑟𝑡𝑐𝑙𝑖𝑝superscript𝑃subscript𝑑𝑐𝑙𝑖𝑝\mathcal{T}^{part}_{clip}\in\mathbb{R}^{P\times d_{clip}}caligraphic_T start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_d start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. A challenge here is the inherent distribution mismatch between the embedding spaces of SAM and CLIP. To transfer the CLIP text embeddings into SAM’s embedding space, a tunable Transfer MLP is devised and applied to 𝒯clippartsubscriptsuperscript𝒯𝑝𝑎𝑟𝑡𝑐𝑙𝑖𝑝\mathcal{T}^{part}_{clip}caligraphic_T start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_l italic_i italic_p end_POSTSUBSCRIPT, leading to the transferred embeddings for the parts, namely part embeddings 𝒯partP×dsuperscript𝒯𝑝𝑎𝑟𝑡superscript𝑃𝑑\mathcal{T}^{part}\in\mathbb{R}^{P\times d}caligraphic_T start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d matches the number of embedding channels of SAM’s image features.

Part-Level Cross-Modal Encoding. In this step, the part embeddings 𝒯partsuperscript𝒯𝑝𝑎𝑟𝑡\mathcal{T}^{part}caligraphic_T start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT interact with the image embedding via spatial attention, and the obtained part-activated features are used to generate part sparse and dense embeddings. Specifically, the SAM Image Encoder extracts the image embedding Ih×w×dsubscript𝐼superscript𝑤𝑑\mathcal{F}_{I}\in\mathbb{R}^{h\times w\times d}caligraphic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_d end_POSTSUPERSCRIPT, where h×w𝑤h\times witalic_h × italic_w is the feature size. We then design a spatial attention mechanism by computing a similarity map for each part, leading to 𝒮=𝒯part×IP×h×w𝒮superscript𝒯𝑝𝑎𝑟𝑡superscriptsubscript𝐼topsuperscript𝑃𝑤\mathcal{S}=\mathcal{T}^{part}\times\mathcal{F}_{I}^{\top}\in\mathbb{R}^{P% \times h\times w}caligraphic_S = caligraphic_T start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT × caligraphic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_h × italic_w end_POSTSUPERSCRIPT, where top\top denotes a transpose operator. These similarity maps serve as part-aware spatial attention to activate the image embedding, augmenting Isubscript𝐼\mathcal{F}_{I}caligraphic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT into I=𝒮I+IP×h×w×dsubscriptsuperscript𝐼𝒮subscript𝐼subscript𝐼superscript𝑃𝑤𝑑\mathcal{F}^{\prime}_{I}=\mathcal{S}\circ\mathcal{F}_{I}+\mathcal{F}_{I}\in% \mathbb{R}^{P\times h\times w\times d}caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = caligraphic_S ∘ caligraphic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + caligraphic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_h × italic_w × italic_d end_POSTSUPERSCRIPT, where Isubscript𝐼\mathcal{F}_{I}caligraphic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and 𝒮𝒮\mathcal{S}caligraphic_S are broadcasted to the same size and \circ denotes the Hadamard product. The part-activated features Isubscriptsuperscript𝐼\mathcal{F}^{\prime}_{I}caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, containing information of both the image and the Collaborative Prompts, are used to compute part sparse embeddings SpartP×n×dsuperscriptsubscript𝑆𝑝𝑎𝑟𝑡superscript𝑃𝑛𝑑\mathcal{F}_{S}^{part}\in\mathbb{R}^{P\times n\times d}caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_n × italic_d end_POSTSUPERSCRIPT and part dense embeddings DpartP×h×w×dsuperscriptsubscript𝐷𝑝𝑎𝑟𝑡superscript𝑃𝑤𝑑\mathcal{F}_{D}^{part}\in\mathbb{R}^{P\times h\times w\times d}caligraphic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_h × italic_w × italic_d end_POSTSUPERSCRIPT with a two-layer MLP and a three-layer CNN, respectively. Here n𝑛nitalic_n represents the number of sparse tokens for each part. These embeddings are then fed into the SAM Decoder to segment the corresponding instrument parts.

3.2 Part-to-Whole Adaptive Fusion

In the Part-to-Whole Adaptive Fusion module, the sparse and dense embeddings for all parts are adaptively fused to form the whole sparse and dense embeddings, {S,D}subscript𝑆subscript𝐷\{\mathcal{F}_{S},\mathcal{F}_{D}\}{ caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT }, for the segmentation of the entire instrument. The adaptive fusion is achieved through Category Part Attention and Image Part Attention, as shown in Fig. 4. Specifically, the part sparse and dense embeddings consist of the prompt embeddings of all P𝑃Pitalic_P parts. However, as established in 𝒟CPsubscript𝒟𝐶𝑃\mathcal{D}_{CP}caligraphic_D start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT, instruments of different categories encompass different part compositions. Therefore, we propose a Category Part Attention that utilises the part weights for the prompted category c𝑐citalic_c in 𝒟CPsubscript𝒟𝐶𝑃\mathcal{D}_{CP}caligraphic_D start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT, i.e., 𝒟c*={dcp}p=1P1×Psubscript𝒟𝑐superscriptsubscriptsubscript𝑑𝑐𝑝𝑝1𝑃superscript1𝑃\mathcal{D}_{c*}=\{d_{cp}\}_{p=1}^{P}\in\mathbb{R}^{1\times P}caligraphic_D start_POSTSUBSCRIPT italic_c * end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_P end_POSTSUPERSCRIPT, as the weights to fuse the sparse and dense embeddings from the part level to the whole level. Note that 𝒟CPsubscript𝒟𝐶𝑃\mathcal{D}_{CP}caligraphic_D start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT is initialised with 0s and 1s but is updated dynamically during model training.

Refer to caption
Figure 4: Part-to-Whole Adaptive Fusion Module adaptively assembles the sparse and dense embeddings of the parts into the sparse and dense embeddings of the whole instrument via Category Part Attention and Image Part Attention.

While the Category Part Attention provides category-specific part weights, the presence and contribution of each part to an instrument can vary significantly across images due to different field-of-views and occlusion conditions. Therefore, it is necessary to adapt the part-to-whole fusion to the condition of each image. Accordingly, we propose Image Part Attention to compute image-specific part weights by learning a global descriptor of the image and computing its similarity with the part embeddings. Particularly, the global descriptor G1×dsubscript𝐺superscript1𝑑\mathcal{F}_{G}\in\mathbb{R}^{1\times d}caligraphic_F start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_d end_POSTSUPERSCRIPT is learned from image embedding Isubscript𝐼\mathcal{F}_{I}caligraphic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT with a Global CNN that consists of three convolutional layers and a linear layer. Then, image-specific part weights are computed as: 𝒲=G×𝒯part1×P𝒲subscript𝐺superscript𝒯limit-from𝑝𝑎𝑟𝑡topsuperscript1𝑃\mathcal{W}=\mathcal{F}_{G}\times\mathcal{T}^{part\top}\in\mathbb{R}^{1\times P}caligraphic_W = caligraphic_F start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT × caligraphic_T start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_P end_POSTSUPERSCRIPT.

Finally, given category-specific part weights 𝒟c*subscript𝒟𝑐\mathcal{D}_{c*}caligraphic_D start_POSTSUBSCRIPT italic_c * end_POSTSUBSCRIPT and image-specific part weights 𝒲𝒲\mathcal{W}caligraphic_W, we fuse the sparse and dense embeddings {Spart,Dpart}superscriptsubscript𝑆𝑝𝑎𝑟𝑡superscriptsubscript𝐷𝑝𝑎𝑟𝑡\{\mathcal{F}_{S}^{part},\mathcal{F}_{D}^{part}\}{ caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT } of the parts into the sparse and dense embeddings {S,D}subscript𝑆subscript𝐷\{\mathcal{F}_{S},\mathcal{F}_{D}\}{ caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT } of the whole instrument. Note that the matrices are all broadcasted to the same size prior to the Hadamard product.

Ssubscript𝑆\displaystyle\mathcal{F}_{S}caligraphic_F start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT =\displaystyle== SpartReLU(𝒟c*)P×n×d,subscriptsuperscript𝑝𝑎𝑟𝑡𝑆𝑅𝑒𝐿𝑈subscript𝒟𝑐superscript𝑃𝑛𝑑\displaystyle\mathcal{F}^{part}_{S}\circ ReLU(\mathcal{D}_{c*})\in\mathbb{R}^{% P\times n\times d},caligraphic_F start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∘ italic_R italic_e italic_L italic_U ( caligraphic_D start_POSTSUBSCRIPT italic_c * end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_n × italic_d end_POSTSUPERSCRIPT , (1)
Dsubscriptsuperscript𝐷\displaystyle\mathcal{F}^{\prime}_{D}caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT =\displaystyle== Dpart𝒟c*𝒲P×h×w×d,subscriptsuperscript𝑝𝑎𝑟𝑡𝐷subscript𝒟𝑐𝒲superscript𝑃𝑤𝑑\displaystyle\mathcal{F}^{part}_{D}\circ\mathcal{D}_{c*}\circ\mathcal{W}\in% \mathbb{R}^{P\times h\times w\times d},caligraphic_F start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∘ caligraphic_D start_POSTSUBSCRIPT italic_c * end_POSTSUBSCRIPT ∘ caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_h × italic_w × italic_d end_POSTSUPERSCRIPT , (2)
Dsubscript𝐷\displaystyle\mathcal{F}_{D}caligraphic_F start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT =\displaystyle== p=1PDh×w×d.superscriptsubscript𝑝1𝑃subscriptsuperscript𝐷superscript𝑤𝑑\displaystyle\sum_{p=1}^{P}\mathcal{F}^{\prime}_{D}\in\mathbb{R}^{h\times w% \times d}.∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_d end_POSTSUPERSCRIPT . (3)

3.3 Hierarchical Decoding

The sparse and dense embeddings at both the whole level and the part level are fed into SAM Decoder for hierarchical decoding. This explicitly directs the model to learn both the overall structures as well as the part characteristics of surgical instruments. The final loss function thus comprises of the dice losses Dsubscript𝐷\mathcal{L}_{D}caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT [28] for both the whole segmentation mask and the part segmentation masks:

=D(M(c),G(c))+p=1PdcpD(Mp(c),Gp(c)),subscript𝐷superscript𝑀𝑐superscript𝐺𝑐superscriptsubscript𝑝1𝑃subscript𝑑𝑐𝑝subscript𝐷subscriptsuperscript𝑀𝑐𝑝subscriptsuperscript𝐺𝑐𝑝\displaystyle\mathcal{L}=\mathcal{L}_{D}(M^{(c)},G^{(c)})+\sum_{p=1}^{P}d_{cp}% \mathcal{L}_{D}(M^{(c)}_{p},G^{(c)}_{p}),\qquadcaligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_M start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c italic_p end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_M start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_G start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , (4)
D(M,G)=2iHWmigiiHWmi2+iHWgi2,subscript𝐷𝑀𝐺2superscriptsubscript𝑖𝐻𝑊subscript𝑚𝑖subscript𝑔𝑖superscriptsubscript𝑖𝐻𝑊superscriptsubscript𝑚𝑖2superscriptsubscript𝑖𝐻𝑊superscriptsubscript𝑔𝑖2\displaystyle\mathcal{L}_{D}(M,G)=\frac{2\sum_{i}^{HW}m_{i}g_{i}}{\sum_{i}^{HW% }m_{i}^{2}+\sum_{i}^{HW}g_{i}^{2}},caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_M , italic_G ) = divide start_ARG 2 ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_W end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_W end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_W end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (5)

where misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the predicted logit value and the ground-truth binary value at pixel i𝑖iitalic_i, respectively. M(c)superscript𝑀𝑐M^{(c)}italic_M start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT and {Mp(c)}p=1Psuperscriptsubscriptsubscriptsuperscript𝑀𝑐𝑝𝑝1𝑃\{M^{(c)}_{p}\}_{p=1}^{P}{ italic_M start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT are the predicted masks of the instrument and its parts, respectively. G(c)superscript𝐺𝑐G^{(c)}italic_G start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT and {Gp(c)}p=1Psuperscriptsubscriptsubscriptsuperscript𝐺𝑐𝑝𝑝1𝑃\{G^{(c)}_{p}\}_{p=1}^{P}{ italic_G start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT are the ground-truth masks of the instrument and its parts, respectively.

4 Experiments

4.1 Datasets and Evaluation Metrics

The effectiveness of SP-SAM is validated using the EndoVis2018 [1] and EndoVis2017 [2] datasets. EndoVis2018 is composed of 11 training videos and four validation videos each with 149 frames, on which we follow the standard experiment and evaluation protocols defined in [32] and [15] to ensure a fair comparison with existing methods. EndoVis2017 contains eight training videos each with 255 frames and ten testing sequences with 900 frames in total. We adopt two evaluation protocols on EndoVis2017 for a fair comparison with different works: (1) average results of four-fold cross-validation, as per [32]; (2) training on the training set and reporting results on the test set, following the official code of [48]. EndoVis2018 and EndoVis2017 offer annotations for five and four instrument parts, respectively, and both datasets include seven instrument categories.

We adopt the standard evaluation metrics used in all existing works [15, 5, 3, 42, 48, 20, 46, 47, 32]: Challenge IoU [2], IoU [15], and mean class IoU (mc IoU). Challenge IoU is computed only for the classes present in an image, whereas IoU considers all classes. We also report the IoU for each instrument category.

4.2 Implementation Details

Images from EndoVis2017 and EndoVis2018 are processed to a size of 1024×\times×1280, as per [32]. Data augmentation strategies are adopted following [32, 3], which include random flipping, random scale and crop, random rotation, and colour jitter. For Transfer MLP, Sparse MLP, Dense CNN, and Global CNN, their feature dimensions are set to 512, 256, 256, and 256, respectively. The number of sparse tokens per part n𝑛nitalic_n is set to 2. In terms of training, we initialise SAM Image Encoder and SAM Decoder with SAM’s pre-trained weights of the ViT-H version [13]. We adopt CLIP Text Encoder of version ViT-L/14@336px, following [22]. Our model keeps SAM Image Encoder, CLIP Text Encoder, and the output MLPs of SAM Decoder frozen, while updating the remaining weights using an Adam optimiser with a learning rate of 0.0001. To reduce computational load, we utilise pre-computed image embeddings, employing a batch size of 8. In practice, inspired by [42], we implement our model by inputting all categories into the model and differentiating the positive category (i.e., the prompted category) with negative categories via the positive and negative sparse embeddings of SAM. SP-SAM is trained and evaluated on an Nvidia Tesla V100 16GB GPU.

4.3 Main Results

We compare the performance of SP-SAM against existing methods on the EndoVis2018 and EndoVis2017 datasets, detailed in Table 1 and Table 2, respectively. A visual comparison of the predictions is shown in Fig. 5 (More visualisations are provided in Supplemenary Materials.) The instrument categories include Bipolar Forceps (BF), Prograsp Forceps (PF), Large Needle Driver (LND), Suction Instrument (SI), Vessel Sealer (VS), Clip Applier (CA), Grasping Retractor (GR), Monopolar Curved Scissors (MCS), and Ultrasound Probe (UP). In our comparison, we divide existing methods into two categories: specialist models and SAM-based models. Notably, SP-SAM surpasses both existing fully-trained specialist models and efficient-tuning approaches based on SAM, yet at a substantially lower training cost in terms of tunable parameters.

Instrument Categories Method Category Method Challenge IoU IoU mc IoU BF PF LND SI CA MCS UP #T-Params Specialist Model TernausNet [32] 46.22 39.87 14.19 44.20 4.67 0.00 0.00 0.00 50.44 0.00 32.20M MF-TAPNet [20] 67.87 39.14 24.68 69.23 6.10 11.68 14.00 0.91 70.24 0.57 37.73M Dual-MF [46] 70.40 - 35.09 74.10 6.80 46.00 30.10 7.60 80.90 0.10 203.80M ISINet [15] 73.03 70.94 40.21 73.83 48.61 30.98 37.68 0.00 88.16 2.16 162.52M TraSeTr [47] 76.20 - 47.71 76.30 53.30 46.50 40.60 13.90 86.20 17.15 - S3Net [5] 75.81 74.02 42.58 77.22 50.87 19.83 50.59 0.00 92.12 7.44 68.41M MATIS Frame [3] 82.37 77.01 48.65 83.35 38.82 40.19 64.49 4.32 93.18 16.17 68.72M TP-SIS [48] 84.92 83.61 65.44 84.28 73.18 78.88 92.20 23.73 66.67 39.12 131.08M SAM-based Model MaskTrack-RCNN [39] + SAM 78.49 78.49 56.07 79.83 74.86 43.12 62.88 16.74 91.62 23.45 57.67M Mask2Former [9] + SAM 78.72 78.72 52.50 85.95 82.31 44.08 0.00 49.80 92.17 13.18 68.72M TrackAnything (1 Point) [38] 40.36 38.38 20.62 30.20 12.87 24.46 9.17 0.19 55.03 12.41 - TrackAnything (5 Points) [38] 65.72 60.88 38.60 72.90 31.07 64.73 10.24 12.28 61.05 17.93 - PerSAM (Zero-Shot) [45] 49.21 49.21 34.55 51.26 34.40 46.75 16.45 15.07 52.28 25.62 - PerSAM (Fine-Tune) [45] 52.21 52.21 37.24 57.19 36.13 53.86 14.34 25.94 54.66 18.57 2 SurgicalSAM [42] 80.33 80.33 58.87 83.66 65.63 58.75 54.48 39.78 88.56 21.23 4.65M SP-SAM (Ours) 84.24 84.24 65.71 87.60 65.07 61.95 58.30 59.96 92.08 34.99 8.62M GT Centroid + SAM 60.26 60.26 63.34 44.35 65.92 30.99 87.14 69.69 80.04 65.26 - GT Bbox + SAM 88.04 88.04 84.23 87.10 86.81 72.23 91.21 75.91 93.08 83.24 -

Table 1: Comparison of results on the EndoVis2018 dataset. #T-Params denotes the number of tunable parameters.

Instrument Categories Method Category Method Challenge IoU IoU mc IoU BF PF LND VS GR MCS UP Cross-Fold Average Results Specialist Model TernausNet [32] 35.27 12.67 10.17 13.45 12.39 20.51 5.97 1.08 1.00 16.76 MF-TAPNet [20] 37.25 13.49 10.77 16.39 14.11 19.01 8.11 0.31 4.09 13.40 Dual-MF [46] 45.80 - 26.40 34.40 21.50 64.30 24.10 0.80 17.90 21.80 ISINet [15] 55.62 52.20 28.96 38.70 38.50 50.09 27.43 2.10 28.72 12.56 TraSeTr [47] 60.40 - 32.56 45.20 56.70 55.80 38.90 11.40 31.30 18.20 S3Net [5] 72.54 71.99 46.55 75.08 54.32 61.84 35.50 27.47 43.23 28.38 MATIS Frame [3] 68.79 62.74 37.30 66.18 50.99 52.23 32.84 15.71 19.27 23.90 TP-SIS [48] 63.37 63.37 52.74 66.42 45.46 75.20 73.44 29.95 44.02 34.67 SAM-based Model Mask2Former [9] + SAM 66.21 66.21 55.26 66.84 55.36 83.29 73.52 26.24 36.26 45.34 TrackAnything (1 Point) [38] 54.90 52.46 55.35 47.59 28.71 43.27 82.75 63.10 66.46 55.54 TrackAnything (5 Points) [38] 67.41 64.50 62.97 55.42 44.46 62.43 83.68 62.59 67.03 65.17 PerSAM (Zero-Shot) [45] 42.47 42.47 41.80 53.99 25.89 50.17 52.87 24.24 47.33 38.16 PerSAM (Fine-Tune) [45] 41.90 41.90 39.78 46.21 28.22 53.12 57.98 12.76 41.19 38.99 SurgicalSAM [42] 69.94 69.94 67.03 68.30 51.77 75.52 68.24 57.63 86.95 60.80 SP-SAM (Ours) 73.94 73.94 71.06 68.89 53.16 83.80 73.20 72.40 84.91 61.05 GT Centroid + SAM 44.42 44.42 54.41 63.42 36.03 22.57 54.21 75.18 70.17 59.25 GT Bbox + SAM 76.31 76.31 81.18 89.36 73.44 67.67 90.04 87.79 94.03 65.91 Test Set Results Specialist Model TP-SIS [48] 79.90 77.83 56.22 68.58 73.52 92.74 83.90 0.13 74.70 0.00 SAM-based Model SP-SAM (Ours) 82.01 82.01 56.00 81.64 74.06 91.42 72.00 0.84 72.06 0.00

Table 2: Comparison of results on the EndoVis2017 Dataset.

In terms of zero-shot performance of SAM, the methods using bounding boxes as prompts (MaskTrack-RCNN [39]/Mask2Former [9] + SAM) in general outperform those using point prompts (TrackAnything [38]) and image prompts (PerSAM [45]). However, their performances are still inferior to tuning-based methods. Additionally, they rely on a well-trained detector for bounding box prediction, resulting in a considerable increase in the number of training parameters and higher pipeline complexity. In contrast, tuning-based methods for SAM, including both SurgicalSAM [42] and SP-SAM, adopt an end-to-end efficient tuning pipeline with minimal training parameters, boosting both training efficiency and segmentation performance.

Our method demonstrates considerable enhancement over the existing efficient-tuning approach, SurgicalSAM [42], on both datasets. It shows an improvement of 3.91 and 4.00 in terms of Challenge IoU on EndoVis2018 and EndoVis2017, respectively. Unlike SurgicalSAM, which relies on category ID prompting, our approach is prompted by text, leveraging the extensive information in natural language expressions and pre-trained language models. Additionally, our method significantly outperforms SurgicalSAM in mean class IoU (mc IoU), by a gain of 6.84 for EndoVis2018 and 4.03 for EndoVis2017, indicating superior discrimination of instruments across different categories. This enhancement is largely attributed to our part-to-whole collaborative prompting mechanism that explicitly directs the model to identify the internal structures of instruments and concentrate on the part-level details, in contrast to SurgicalSAM which treats each instrument as a single entity.

In addition, we compare SP-SAM with two oracle scenarios that employ ground-truth centroids and bounding boxes as prompts for SAM. Remarkably, SP-SAM achieves significantly better results than those obtained with ground-truth centroids. Moreover, SP-SAM’s performance closely approaches that of the oracle setting with ground-truth bounding boxes, with only a gap of 3.80 and 2.37 in terms of Challenge IoU for EndoVis2018 and EndoVis2017, respectively, yet our method requires significantly less prompting effort without the need for manual per-frame bounding box guidance.

Finally, SP-SAM achieves superior or competitive performance compared to SOTA specialist models while using significantly fewer tunable parameters. On EndoVis2018, our method surpasses TP-SIS [48] in both IoU and mc IoU, despite TP-SIS utilising 15 times more tunable parameters than ours (131.08M for TP-SIS vs. 8.62M for SP-SAM). On EndoVis2017, for cross-fold averages, we compare with the reproduced results of TP-SIS using the official code; for test set results, we compare with the results reported in [48]. Our methods exhibit improvements in both settings, with a more substantial enhancement in cross-fold averages, a more robust evaluation protocol, further affirming our superiority. In contrast to TP-SIS which utilises instrument part masks straightforwardly as training supervisory signals, our method more effectively integrates instrument structure knowledge and explicitly addresses category-part and image-part relationships. This allows for a more accurate comprehension of instruments, encompassing both their structures and finer details. Similarly, the comparison between SP-SAM and its variant without Part-to-Whole Adaptive Fusion (model C in Table 3) in our ablation study also highlights the improvement our method offers over straightforward use of instrument part masks.

Figure 5 showcases a visual comparison of predicted masks by different methods, highlighting that SP-SAM clearly outperforms existing SAM-based methods in segmentation quality. Notably, SP-SAM outperforms SurgicalSAM, which tends to misidentify instrument categories (Fig. 5(a)), predict incomplete masks with missed critical parts like instrument tips (Fig. 5(b)), and generate rugged edges (Fig. 5(c) and (d)). Owing to the part-to-whole collaborative prompting mechanism, SP-SAM precisely captures fine-grained details such as edges and challenging areas. This precision, especially in identifying edges and tips, is crucial for ensuring safety in surgical settings.

Refer to caption
Figure 5: Visual comparison of predicted masks by different methods.

4.4 Ablation Study

Ablation Study of Key Components. We conduct an ablation study on both EndoVis2018 and EndoVis2017 to investigate the effect of the proposed Collaborative Prompts, Part-to-Whole Adaptive Fusion, and Hierarchical Decoding. The results are reported in Table 3. Model A, the baseline, utilises text prompts of category names, which are encoded by the Cross-Model Prompt Encoder into sparse and dense embeddings for the SAM Decoder to generate instrument masks. Model A is then progressively augmented with our proposed modules. In model B, Collaborative Prompts are encoded by the Cross-Modal Prompt Encoder into sparse and dense embeddings for individual parts. These embeddings of different parts are then combined into whole sparse and dense embeddings in a straightforward manner, where the sparse embeddings of all parts are concatenated and the dense embeddings of all parts are summed. Subsequently, these whole sparse and dense embeddings are decoded by the SAM Decoder into instrument masks. Building upon model B, model C additionally incorporates Hierarchical Decoding, enabling the decoding of both whole and part sparse and dense embeddings into their respective masks. Following model C, we investigate the impact of Category Part Attention and Image Part Attention in the Part-to-Whole Adaptive Fusion module. In model D, we set image-specific part weights 𝒲𝒲\mathcal{W}caligraphic_W to an all-one vector, while in model E, we fix 𝒟CPsubscript𝒟𝐶𝑃\mathcal{D}_{CP}caligraphic_D start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT as an all-one matrix. Finally, model F represents our proposed SP-SAM.

Part-to-Whole Fusion EndoVis2018 EndoVis2017 Model Collab. Prompts Category Att. Image Att. Hier. Decod. Challenge IoU mc IoU Challenge IoU mc IoU A 81.08 61.14 67.69 63.59 B 81.90 62.23 68.97 65.30 C 82.36 61.26 71.64 66.21 D 82.64 63.24 72.55 67.87 E 82.98 65.37 72.48 68.21 F 84.24 65.71 73.94 71.06

Table 3: Ablation study of the key components of SP-SAM.
Refer to caption
Figure 6: Visual comparison: SP-SAM vs. instrument category name prompting.

In general, it can be seen that each module individually enhances the Challenge IoU and mc IoU scores with the most significant improvements observed when all modules are integrated. In particular, in EndoVis2018, after incorporating Collaborative Prompts (model B), adding Hierarchical Decoding (model C) shows a marginal improvement in Challenge IoU and a decrease in mc IoU. This confirms that employing surgical part labels as auxiliary signals in a straightforward multi-task learning manner does not yield substantial enhancements. The real breakthrough comes with the proposed Part-to-Whole Adaptive Fusion in conjunction with Hierarchical Decoding (models D, E, and F), unlocking the full potential. A visual comparison of SP-SAM with the baseline of category name prompting is presented in Fig. 6. Prompting without part information results in a substantial loss of details in the predicted masks. In contrast, SP-SAM excels in recognising all instrument parts, particularly with intricate elements like tips (Fig. 6(a) and (b)) and areas with varying materials (Fig. 6(c) and (d)).

Ablation Study of Sparse and Dense Embeddings. We conduct an ablation study on sparse and dense embeddings on both EndoVis2018 and EndoVis2017 datasets, as shown in Table 4. The study evaluates the model’s performance when using only sparse or only dense embeddings. In the sparse-only model, dense embeddings are replaced with the no-mask embeddings from SAM’s pre-trained weights, and in the dense-only model, sparse embeddings are set to be empty tensors. Table 4 reveals that removing either dense or sparse embeddings decreases performance, with the removal of dense embeddings causing a more significant decline. This suggests that dense embeddings are more crucial than sparse embeddings. This aligns with our expectations, since in SAM dense embeddings function as masks, holding more information than the point-based sparse embeddings, thus guiding more accurate decoding. Our method optimally leverages both sparse and dense embeddings and achieves the best results.

EndoVis2018 EndoVis2017 Method Challenge IoU mc IoU Challenge IoU mc IoU Sparse-Only 76.71 57.17 67.15 64.18 Dense-Only 83.09 63.81 72.43 66.61 SP-SAM (Ours) 84.24 65.71 73.94 71.06

Table 4: Ablation study of sparse and dense embeddings.

Impact of Number of Tokens. Details are in Supplementary Materials.

5 Conclusion

In this paper, we present SP-SAM, an efficient-tuning approach of SAM for text promptable surgical instrument segmentation. It leverages a part-to-whole collaborative prompting mechanism to address the challenge of complex structures and fine-grained details in surgical instruments. Specifically, Collaborative Prompts are devised to describe surgical instruments at both category and part levels. Moreover, the proposed Cross-Modal Prompt Encoder, Part-to-Whole Adaptive Fusion, and Hierarchical Decoding modules learn discriminative representations of instrument parts and adaptively assemble them for accurate instrument segmentation. Experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that SP-SAM outperforms both specialist methods and SAM-based methods while only tuning a small number of parameters. Our method demonstrates the great potential of efficiently adapting foundation models for highly specialised tasks and offers valuable insights into the segmentation of challenging targets. In the future, our method can be further improved by exploring other forms of text prompts, incorporating temporal cues, and tackling background targets such as human tissues.

References

  • [1] Allan, M., Kondo, S., Bodenstedt, S., Leger, S., Kadkhodamohammadi, R., Luengo, I., Fuentes, F., Flouty, E., Mohammed, A., Pedersen, M., et al.: 2018 robotic scene segmentation challenge. arXiv preprint arXiv:2001.11190 (2020)
  • [2] Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al.: 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426 (2019)
  • [3] Ayobi, N., Pérez-Rondón, A., Rodríguez, S., Arbeláez, P.: MATIS: Masked-attention transformers for surgical instrument segmentation. In: ISBI. pp. 1–5 (2023)
  • [4] Azofeifa, J.D., Noguez, J., Ruiz, S., Molina-Espinosa, J.M., Magana, A.J., Benes, B.: Systematic review of multimodal human–computer interaction. In: Informatics. vol. 9, p. 13. MDPI (2022)
  • [5] Baby, B., Thapar, D., Chasmai, M., Banerjee, T., Dargan, K., Suri, A., Banerjee, S., Arora, C.: From forks to forceps: A new framework for instance segmentation of surgical instruments. In: WACV. pp. 6180–6190. IEEE (2023)
  • [6] Birlo, M., Edwards, P.E., Clarkson, M., Stoyanov, D.: Utility of optical see-through head mounted displays in augmented reality-assisted surgery: A systematic review. Medical Image Analysis 77, 102361 (2022)
  • [7] Cao, C., Cerfolio, R.J.: Virtual or augmented reality to enhance surgical education and surgical planning. Thoracic Surgery Clinics 29(3), 329–337 (2019)
  • [8] Chen, T., Zhu, L., Deng, C., Cao, R., Wang, Y., Zhang, S., Li, Z., Sun, L., Zang, Y., Mao, P.: SAM-Adapter: Adapting segment anything in underperformed scenes. In: ICCV Workshop. pp. 3367–3375 (2023)
  • [9] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR. pp. 1290–1299 (2022)
  • [10] Cheng, D., Qin, Z., Jiang, Z., Zhang, S., Lao, Q., Li, K.: SAM on medical images: A comprehensive study on three prompt modes. arXiv preprint arXiv:2305.00035 (2023)
  • [11] Deng, R., Cui, C., Liu, Q., Yao, T., Remedios, L.W., Bao, S., Landman, B.A., Tang, Y., Wheless, L.E., Coburn, L.A., et al.: Segment anything model (SAM) for digital pathology: Assess zero-shot segmentation on whole slide imaging. In: Medical Imaging with Deep Learning, Short Paper Track (2023)
  • [12] Ding, H., Liu, C., Wang, S., Jiang, X.: Vision-language transformer and query generation for referring segmentation. In: ICCV. pp. 16321–16330 (2021)
  • [13] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2020)
  • [14] Feng, G., Hu, Z., Zhang, L., Lu, H.: Encoder fusion network with co-attention embedding for referring image segmentation. In: CVPR. pp. 15506–15515 (2021)
  • [15] González, C., Bravo-Sánchez, L., Arbelaez, P.: ISINet: An instance-based approach for surgical instrument segmentation. In: MICCAI. pp. 595–605. Springer (2020)
  • [16] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV. pp. 2961–2969 (2017)
  • [17] He, S., Bao, R., Li, J., Stout, J., Bjornerud, A., Grant, P.E., Ou, Y.: Computer-vision benchmark segment-anything model (SAM) in medical images: Accuracy in 12 datasets. arXiv preprint arXiv:2304.09324 (2023)
  • [18] Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. In: ECCV. pp. 108–124. Springer (2016)
  • [19] Huang, Y., Yang, X., Liu, L., Zhou, H., Chang, A., Zhou, X., Chen, R., Yu, J., Chen, J., Chen, C., et al.: Segment anything model for medical images? Medical Image Analysis 92, 103061 (2024)
  • [20] Jin, Y., Cheng, K., Dou, Q., Heng, P.A.: Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video. In: MICCAI. pp. 440–448. Springer (2019)
  • [21] Kim, N., Kim, D., Lan, C., Zeng, W., Kwak, S.: ReSTR: Convolution-free referring image segmentation using transformers. In: CVPR. pp. 18145–18154 (2022)
  • [22] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollar, P., Girshick, R.: Segment anything. In: ICCV. pp. 4015–4026 (2023)
  • [23] Li, R., Li, K., Kuo, Y.C., Shu, M., Qi, X., Shen, X., Jia, J.: Referring image segmentation via recurrent refinement networks. In: CVPR. pp. 5745–5753 (2018)
  • [24] Liu, D., Li, Q., Jiang, T., Wang, Y., Miao, R., Shan, F., Li, Z.: Towards unified surgical skill assessment. In: CVPR. pp. 9522–9531 (2021)
  • [25] Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: CVPR. pp. 7086–7096 (2022)
  • [26] Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications 15(1),  654 (2024)
  • [27] Mazurowski, M.A., Dong, H., Gu, H., Yang, J., Konz, N., Zhang, Y.: Segment anything model for medical image analysis: An experimental study. Medical Image Analysis p. 102918 (2023)
  • [28] Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3DV. pp. 565–571. IEEE (2016)
  • [29] Ni, Z.L., Bian, G.B., Wang, G.A., Zhou, X.H., Hou, Z.G., Chen, H.B., Xie, X.L.: Pyramid attention aggregation network for semantic segmentation of surgical instruments. In: AAAI. vol. 34, pp. 11782–11790 (2020)
  • [30] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PMLR (2021)
  • [31] Shi, H., Li, H., Meng, F., Wu, Q.: Key-word-aware network for referring expression image segmentation. In: ECCV. pp. 38–54 (2018)
  • [32] Shvets, A.A., Rakhlin, A., Kalinin, A.A., Iglovikov, V.I.: Automatic instrument segmentation in robot-assisted surgery using deep learning. In: ICMLA. pp. 624–628. IEEE (2018)
  • [33] Wald, T., Roy, S., Koehler, G., Disch, N., Rokuss, M.R., Holzschuh, J., Zimmerer, D., Maier-Hein, K.: SAM. MD: Zero-shot medical image segmentation capabilities of the segment anything model. In: Medical Imaging with Deep Learning, Short Paper Track (2023)
  • [34] Wang, A., Islam, M., Xu, M., Zhang, Y., Ren, H.: SAM meets robotic surgery: An empirical study on generalization, robustness and adaptation. In: MICCAI Workshop. pp. 234–244 (2023)
  • [35] Wang, Z., Lu, Y., Li, Q., Tao, X., Guo, Y., Gong, M., Liu, T.: CRIS: CLIP-driven referring image segmentation. In: CVPR. pp. 11686–11695 (2022)
  • [36] Wu, J., Li, X., Li, X., Ding, H., Tong, Y., Tao, D.: Towards robust referring image segmentation. arXiv preprint arXiv:2209.09554 (2022)
  • [37] Wu, J., Fu, R., Fang, H., Liu, Y., Wang, Z., Xu, Y., Jin, Y., Arbel, T.: Medical SAM adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620 (2023)
  • [38] Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., Zheng, F.: Track anything: Segment anything meets videos. arXiv preprint arXiv:2304.11968 (2023)
  • [39] Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV. pp. 5188–5197 (2019)
  • [40] Yang, Z., Wang, J., Tang, Y., Chen, K., Zhao, H., Torr, P.H.: LAVT: Language-aware vision transformer for referring image segmentation. In: CVPR. pp. 18155–18165 (2022)
  • [41] Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for referring image segmentation. In: CVPR. pp. 10502–10511 (2019)
  • [42] Yue, W., Zhang, J., Hu, K., Xia, Y., Luo, J., Wang, Z.: SurgicalSAM: Efficient class promptable surgical instrument segmentation. In: AAAI (2024)
  • [43] Zang, D., Bian, G.B., Wang, Y., Li, Z.: An extremely fast and precise convolutional neural network for recognition and localization of cataract surgical tools. In: MICCAI. pp. 56–64. Springer (2019)
  • [44] Zhang, K., Liu, D.: Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785 (2023)
  • [45] Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Dong, H., Qiao, Y., Gao, P., Li, H.: Personalize segment anything model with one shot. In: ICLR (2024)
  • [46] Zhao, Z., Jin, Y., Gao, X., Dou, Q., Heng, P.A.: Learning motion flows for semi-supervised instrument segmentation from robotic surgical video. In: MICCAI. pp. 679–689. Springer (2020)
  • [47] Zhao, Z., Jin, Y., Heng, P.A.: TraSeTr: Track-to-segment transformer with contrastive query for instance-level instrument segmentation in robotic surgery. In: ICRA. pp. 11186–11193. IEEE (2022)
  • [48] Zhou, Z., Alabi, O., Wei, M., Vercauteren, T., Shi, M.: Text promptable surgical instrument segmentation with vision-language models. In: NeurIPS (2023)