Graph Relation Distillation For Efficient Biomedical Instance Segmentation
Graph Relation Distillation For Efficient Biomedical Instance Segmentation
Abstract—Instance-aware embeddings predicted by deep neu- is of paramount importance, particularly in the case of 3D
ral networks have revolutionized biomedical instance segmenta- CNNs. Consequently, there exists a trade-off between model
tion, but its resource requirements are substantial. Knowledge simplification for speed and maintaining optimal performance,
distillation offers a solution by transferring distilled knowledge
as the simplified models often compromise their performance.
arXiv:2401.06370v1 [cs.CV] 12 Jan 2024
structural information. Incorporating these global relations • We extend the IGD and AGD schemes from the intra-
becomes instrumental in constructing a well-structured feature image level to the inter-image level by introducing a
space and attaining more precise instance segmentation results. memory bank mechanism to capture the global relation
In this paper, we propose a novel graph relation distilla- across different input images.
tion method tailored for biomedical instance segmentation, • The superiority of our knowledge distillation approach
which addresses the challenges faced by existing distillation over existing methods is demonstrated on three 2D
methods. To tackle the first challenge, we introduce two biomedical datasets and two 3D biomedical datasets.
distillation schemes to extract crucial knowledge for instance This work is a substantial extension of our preliminary
segmentation, including instance-level features, instance rela- work [18] in the following aspects:
tions, and pixel-level boundaries. The first scheme, instance • We distill global relation information across different
graph distillation (IGD), constructs an instance graph using input images using an inter-image instance graph and
central embeddings of the corresponding instances as nodes inter-image affinity. This enables the student network
and measuring feature similarity between nodes as edges. to learn from a broader range of instance features and
By enforcing graph consistency, IGD effectively transfers boundary structures and improve its ability to handle
knowledge of instance features and relations. The second complex instances.
scheme, affinity graph distillation (AGD), regards each pixel • We verify the effectiveness of our proposed method
embedding as a graph node and converts pixel embeddings for more teacher-student network pairs with different
into a structured affinity graph by calculating the distance architectures, demonstrating its versatility and robustness.
between pixel embeddings, which is used to mitigate boundary • We conduct extensive experiments on biomedical instance
ambiguity in the lightweight student network. AGD ensures segmentation datasets with different modalities and mul-
affinity graph consistency between teacher and student net- tiple categories, further demonstrating the effectiveness
works, facilitating boundary-related knowledge transfer. By and generalizability of our approach.
employing the above two distillation schemes, our approach
enhances the performance of student networks in biomedical II. R ELATED W ORKS
instance segmentation.
A. Biomedical Instance Segmentation
To address the second challenge, we extend the IGD and
AGD schemes to capture global structural information at the Deep learning-based instance segmentation methods for
inter-image level. Specifically, we construct instance graphs biomedical images can be classified into two main categories:
and affinity graphs by considering relations between instances proposal-free and proposal-based methods.
and pixel embeddings from different input images, respec- Proposal-based methods [5]–[7], [19]–[21] utilize object
tively. To fully explore the graph relations between different detection [22]–[25] and object segmentation heads to predict
input images, we need to increase the batch size of the bounding boxes and foreground masks for each object, re-
network by including as many input samples as possible. spectively. However, these methods heavily rely on accurate
Under the constraint of limited GPU memory, we introduce a bounding box predictions, which may fail to differentiate ad-
memory bank mechanism to store past predicted feature maps jacent instances due to their overlap, and the size of instances.
as much as possible. This enables us to calculate relations The sizes of instances in images may exceed the receptive
between the current input image and sampled images from field of the model, making it challenging to locate complete
the memory bank, effectively capturing long-range inter-image instances using bounding boxes.
relations. Overall, our approach offers a practical solution for The proposal-free methods [9], [11], [26], [27] predict
distilling knowledge vital to biomedical instance segmentation, specially designed instance-aware features and morphology
addressing limitations and improving performance. properties which can encode morphology characteristics [28],
Extensive experimental results show that our knowledge structures, and spatial arrangement, and then cluster the
distillation approach greatly benefits the lightweight network, predicted mid-representations into instances using a post-
leading to significant improvements in performance while processing algorithm [29]–[32]. Pixel embedding-based meth-
maintaining efficiency during inference. The student networks ods [33]–[37] excel in encoding each pixel of an image
trained using our approach achieve promising performance into a high-dimensional feature space, facilitating the grouping
with less than 1% of parameters and less than 10% of inference of similar pixels to form distinct instance regions. These
time compared to the teacher networks. methods exhibit exceptional performance in tackling complex
scenes with overlapping and crowded objects, making them a
The contributions of this paper are summarized as follows:
popular choice for biomedical instance segmentation and other
• We propose a graph relation distillation method tailored applications.
for biomedical instance segmentation to obtain efficient However, the computational and memory demands of pixel
and high-performance networks. embedding-based methods hinder their practical use. Knowl-
• We propose an IGD scheme to force the student network edge distillation offers a solution by transferring knowledge
to mimic instance-level features and instance relations of from a large, complex model (the teacher) to a smaller, simpler
the teacher network via an instance graph, along with model (the student). This compression process effectively pre-
an AGD scheme for pixel-level boundary knowledge serves the teacher network’s knowledge in the student network,
transfer. maintaining comparable performance. By applying knowledge
3
Teacher Model
Intra-Image Affinity Map (T)
Inter-Image Affinity Map (T)
…
Graph Distillation Loss Graph Distillation Loss Distillation Distillation
Loss Loss
Memory Bank
Student Model
GroundTruth Affinity Map
Fig. 1. Workflow of our proposed graph relation distillation method for biomedical instance segmentation, which includes two schemes. The instance graph
distillation (IGD) scheme constructs instance graphs from embeddings of the teacher-student network pair and enforces the consistency of graphs constructed
by the teacher, while the affinity graph distillation (AGD) scheme converts pixel embeddings into pixel affinities that encode structured information of instance
boundaries and enforces the student model to generate affinities similar to its teacher model. These two schemes take charge of the knowledge distillation
mechanism and are carried out at both intra-image and inter-image levels for global instance relations. The symbol ⊙ represents dot product operation. The
red arrow indicates the loss function.
distillation to pixel embedding-based segmentation models, sification [41]–[43], semantic segmentation [16], [44], object
we can significantly reduce their computational and storage detection [45], and speech recognition [46].
requirements while preserving their exceptional segmentation However, there is currently no proposed knowledge distil-
performance. lation method tailored for biomedical instance segmentation,
which is a challenging task due to the complexity and hetero-
geneity of biomedical images with instances varying signifi-
B. Knowledge Distillation cantly in size, shape, and distribution. Therefore, the proposed
method is a novel application of knowledge distillation to
The goal of knowledge distillation is to transfer knowledge address the challenges of biomedical instance segmentation.
from a computationally expensive but powerful teacher net- Existing works [41], [45] consider relation distillation for
work to a lightweight student network, thereby enhancing its image classification and object detection by constructing a
performance while preserving its efficiency. Related work on graph. In this paper, we extend this idea to biomedical instance
knowledge distillation can be broadly classified into the fol- segmentation by constructing a graph from predicted pixel
lowing several categories, depending on the specific methods embeddings and considering cross-image relations with cor-
used for distillation. responding domain knowledge. Furthermore, we extend this
One approach is to distill knowledge from a larger teacher concept to biomedical instance segmentation. Specifically, we
network to a smaller student network. This can be done by construct an instance graph and an affinity graph based on
directly transferring the output probabilities of the teacher the predicted pixel embeddings and incorporate cross-image
network to the student network [12], or by transferring in- relations, by leveraging relevant domain knowledge. By doing
termediate representations from the teacher network to the so, we aim to improve the performance of biomedical instance
student network [13]. Other methods use attention mechanisms segmentation and enhance the understanding of inter-instance
or gating functions to allow the student network to selectively relations and pixel relations within the biomedical domain.
focus on the most important information from the teacher
network [15]. Another approach is to use ensemble models
III. M ETHODOLOGY
as teachers to transfer knowledge to a single student network.
This can be done by using the logits or probabilities output The workflow of our proposed distillation method is pre-
by the ensemble as a soft label for the student network [38], sented in Fig. 1 and can be applied to both 2D and 3D
[39], or by transferring feature maps or attention maps from networks for images and volumes. We illustrate it using a
the ensemble to the student network [40]. In addition to 2D image example for easy visualization and description. The
these methods for general tasks, there are also methods for method involves a heavy teacher network T and a lightweight
task-specific knowledge distillation, such as for image clas- student network S, which both predict a set of feature maps,
4
i.e., embedding map E ∈ RD×H×W for an input image of 2) Inter-Image Distillation: With the goal of transferring
size H × W . The embedding vector of a pixel p is denoted the instance relations across different input images, we extend
as ep ∈ RD and can be clustered into instances through the above-mentioned instance graph distillation scheme to the
post-processing, and D is the dimension of the embedding inter-image level. Given the limitation of GPU memory, we
vectors. Given M training images, the segmentationMnetwork follow [47], [48] to introduce a shared online feature map
can extract M embedding maps Em ∈ RD×H×W m=1 . Two queue between the student and teacher networks, which stores
specially designed schemes, instance graph distillation (IGD) a vast quantity of feature maps in a memory bank generated
and affinity graph distillation (AGD), are employed to distill from the predictions of the teacher network in previous itera-
effective knowledge from embedding maps at both intra-image tions. It allows us to retrieve abundant feature maps efficiently.
and inter-image levels. More details of our proposed method map queue canKstore K feature maps, and is no-
The feature
are provided below. tated as Ek ∈ RH×W ×d k=1 . In each training iteration, we
enqueue batch-size B feature maps to the memory bank and
randomly sample L feature maps from it. Given the mth input
A. Instance Graph Distillation image from the training image, the segmentation network can
Embeddings of pixels p ∈ Si belonging to the same instance predict the embedding map Em ∈ RD×H×W . Meanwhile, we
L
i and located in the corresponding area Si exhibit similarity, can obtain L embedding maps El ∈ RD×H×W l=1 sampled
while the embeddings of pixels belonging to different instances from the memory bank.
demonstrate dissimilarity. This ensures that the different in- We then calculate the corresponding node features vim and
l
stances i ∈ I of an input image are distinguished in the feature vj extracted from these two feature maps Em and El , where
space. Therefore, the distribution of embeddings in the feature i ∈ Im and j ∈ Il represent different instances from the mth
space contains valuable knowledge related to instance-level input image and the lth sampled image from the memory bank,
features and instance relations. To transfer this key knowledge, respectively. The edge feature between two nodes vim and vjl
we propose an instance graph distillation scheme. is calculated as εml
ij by the above-mentioned cosine distance.
1) Intra-Image Distillation: To effectively distill this Given that the relations between instances within an image
knowledge at the intra-image level, we construct an intra- have been leveraged by intra-image distillation, we build the
image instance graph that encodes the knowledge of instance- inter-image instance graph by only connecting nodes from
level features and instance relations by nodes and edges, different input images. We enforce consistency between two
respectively. The nodes are extracted from pixel embeddings inter-image graphs respectively constructed from the student
of the embedding map with the guidance of labeled instance network and the teacher network, by using an MSE loss
masks which provide precise areas to calculate instance central function. It is formulated as follows:
features, denoted as
L X X
1 X 1 X
vi = i ep . LInter
Edge = (εml S ml T
ij ) − (εij ) . (4)
|S | (1) L |Im | |Il | 2
i
l=1 i∈Im j∈Il
p∈S
their teacher networks. We also conduct more detailed ablation 50 test images using a random split, ensuring that both sets
experiments in Ablation Study V-D4 to explore the distillation represent the complete dataset adequately. We use the same
performance of a series of small models obtained by reducing four metrics as those used in the BBBC039V1 dataset for
the width of each network layer in different ratios. quantitative evaluation.
2) We employ the well-established lightweight network 4) AC3/4: AC3 and AC4 are two labeled subsets extracted
MobileNetV2 [54] as the student network. MobileNetV2 from the mouse somatosensory cortex dataset [59], a widely
utilizes depth-wise and point-wise convolutions, resulting in used electron microscope (EM) dataset for 3D instance seg-
reduced parameters and computation compared to traditional mentation of individual neurons in 2D image sequences. These
convolutions. This network architecture is widely adopted in sequences were acquired at a resolution of 3 × 3 × 29 nm. The
mobile and embedded vision applications. It is worth noting AC3 dataset consists of 256 sequential images, while the AC4
that MobileNetV2 differs significantly in its network structure dataset contains 100 sequential images. For evaluating our
from the teacher networks ResUNet and NestedUNet, which proposed method, we partition the data as follows: we use the
further highlights the distinctiveness of our approach. top 80 sections of AC4 for training, the remaining 20 sections
for validation, and the top 100 sections of AC3 for testing.
IV. E XPERIMENTS We adopt two widely used metrics to quantitatively evaluate
the segmentation results: the variation of Information (VOI)
A. Datasets and Metrics and the adapted rand error (ARAND). VOI [60] measures
1) CVPPP: The CVPPP A1 dataset [55] is a well- the distance between two segmentation masks, taking into
established plant phenotype dataset, aiming to reveal the rela- account both the over-merge and over-segmentation errors.
tionship between plant phenotypes and genotypes, thus helping ARAND [57] is a variation of the Rand Index that takes into
to understand the genetic characteristics and genetic mech- account the uneven distribution of object sizes in EM image
anisms in biomedical research. The dataset contains images segmentation. Note that lower values of these two metrics
of leaves with complex shapes and significant occlusions and indicate better segmentation performance.
serves as a benchmark dataset for a highly regarded biological 5) CREMI: The CREMI dataset [61], which is imaged from
instance segmentation task. Each image has a resolution of adult Drosophila melanogaster brain tissue at a resolution of
530 × 500 pixels. In this study, we randomly select 108 4 × 4 × 40 nm, is another EM dataset used for 3D neuron
images from the dataset for training and 20 images for testing. segmentation. It is composed of three sub-volumes (CREMI-
To evaluate the quality of the segmentation results, we use A/B/C) that correspond to different neuron types, with each
two widely adopted metrics: symmetric best dice (SBD) and sub-volume consisting of 125 consecutive images. Each sub-
absolute difference in counting (|DiC|). SBD measures the volume is split into 50 sections for training, 25 sections for
similarity between the predicted and ground truth segmenta- validation, and 50 sections for testing. We adopt the same
tion masks, while |DiC| counts the absolute difference between quantitative metrics (VOI and ARAND) as those used for the
the predicted and ground truth number of objects in the image. AC3/4 to evaluate the results on the CREMI dataset.
These metrics are commonly utilized to assess the accuracy
of instance segmentation results in computer vision tasks. B. Implementation Details
2) BBBC039V1: The BBBC039V1 dataset [56] consists
Throughout our experiments, we conduct our computations
of 200 Fluorescence Microscopy (FM) images, each with a
within a well-defined environment comprising PyTorch 1.0.1,
resolution of 696 × 520 pixels. These images capture U2OS
CUDA 9.0, and Python 3.7.4. To optimize model training, we
cells exhibiting diverse shapes and densities. We follow the
utilize the Adam optimizer with β1 = 0.9 and β2 = 0.99, a
official data split, employing 100 images for training, 50 for
learning rate of 10−4 , and a batch size of 2. These choices
validation, and the remaining 50 for testing. To quantitatively
ensure efficient and effective training processes. We utilize
evaluate the segmentation results, we adopt four widely used
a single NVIDIA TitanXP GPU for training, and conduct
metrics for cell segmentation in FM images. Aggregated
300K iterations for each model. To address GPU memory
Jaccard Index (AJI) [57] measures the similarity between the
limitations, we follow [36] to set the embedding dimension
ground truth and predicted segmentation. Object-level F1 score
of the last layer to 16. Additionally, we compute affinities by
(F 1) [58] measures the accuracy of predicted segmentation
considering adjacent pixel embeddings within N = 1 voxel
at the level of individual cells. Panoptic Quality (P Q) [20]
stride for 3D networks, and within N = 27 pixel strides for
measures the number of correctly segmented instances and
2D networks. The hyper-parameters K and L of the memory
the accuracy of the semantic labeling. The pixel-level Dice
bank mechanism are set as 32 and 12, respectively.
score (Dice) measures the similarity between the ground truth
segmentation and the predicted segmentation at the pixel level.
V. E XPERIMENTAL RESULTS
3) C.elegans: The C.elegans dataset [56] is a challenging
dataset for image analysis with a large number of organisms A. Baseline Methods
in each image. C.elegans itself has a slender shape and often We perform a comparative analysis between our proposed
appears in complex overlapping poses, making it difficult to method and three state-of-the-art knowledge distillation meth-
accurately segment individual organisms. The dataset consists ods that are widely used for feature maps, which include:
of 100 grayscale images, each with a resolution of 696 × 520 1) Attention Transferring (AT) [15]: This method involves
pixels. We partition the dataset into 50 training images and the transfer of attention maps from a teacher network to a
7
TABLE II
Q UANTITATIVE COMPARISON OF DIFFERENT KNOWLEDGE DISTILLATION METHODS ON 2D BIOMEDICAL INSTANCE SEGMENTATION DATASETS . W E
CONDUCT EXPERIMENTS ON FOUR SETS OF TEACHER - STUDENT NETWORK PAIRS CONSISTING OF TWO TEACHER NETWORKS AND TWO STUDENT
NETWORKS . A BOLD SCORE REPRESENTS THE BEST PERFORMANCE ON THE CORRESPONDING DATASET.
T1 & S1 + AT [15] 83.9 1.60 0.749 0.875 0.904 0.731 0.875 0.937 0.959 0.854
T1 & S1 + SPKD [14] 85.2 1.30 0.740 0.853 0.904 0.709 0.881 0.940 0.962 0.858
T1 & S1 + ReKD [17] 85.6 1.25 0.708 0.859 0.883 0.723 0.879 0.945 0.961 0.861
T1 & S1 + BISKD [18] 86.4 1.15 0.760 0.865 0.912 0.734 0.872 0.934 0.959 0.851
T1 & S1 + Ours 87.0 1.15 0.765 0.884 0.918 0.759 0.884 0.946 0.961 0.868
T1 & S2 + AT [15] 82.8 1.40 0.545 0.627 0.817 0.451 0.760 0.931 0.893 0.740
T1 & S2 + SPKD [14] 73.8 3.95 0.552 0.612 0.819 0.444 0.749 0.917 0.890 0.725
T1 & S2 + ReKD [17] 81.5 1.60 0.595 0.732 0.828 0.528 0.757 0.924 0.894 0.737
T1 & S2 + BISKD [18] 84.7 1.40 0.655 0.799 0.851 0.611 0.766 0.930 0.895 0.746
T1 & S2 + Ours 86.0 1.10 0.672 0.839 0.857 0.645 0.771 0.938 0.896 0.753
T2 & S1 + AT [15] 84.7 1.25 0.750 0.869 0.908 0.727 0.877 0.940 0.960 0.855
T2 & S1 + SPKD [14] 84.0 1.60 0.739 0.844 0.912 0.702 0.878 0.942 0.959 0.859
T2 & S1 + ReKD [17] 85.1 1.45 0.749 0.861 0.904 0.726 0.880 0.942 0.959 0.857
T2 & S1 + BISKD [18] 83.6 1.20 0.702 0.853 0.882 0.713 0.883 0.947 0.961 0.865
T2 & S1 + Ours 85.8 1.10 0.751 0.874 0.912 0.746 0.884 0.948 0.963 0.870
T2 & S2 + AT [15] 84.4 1.00 0.549 0.630 0.816 0.454 0.763 0.930 0.892 0.738
T2 & S2 + SPKD [14] 72.9 4.60 0.558 0.625 0.822 0.454 0.751 0.912 0.889 0.724
T2 & S2 + ReKD [17] 79.7 2.65 0.563 0.652 0.816 0.478 0.750 0.919 0.890 0.727
T2 & S2 + BISKD [18] 84.9 1.25 0.679 0.834 0.864 0.645 0.768 0.933 0.897 0.749
T2 & S2 + Ours 85.3 1.15 0.697 0.871 0.867 0.679 0.776 0.939 0.890 0.757
CVPPP
C.elegans
BBBC039V1
Raw Image GroundTruth Teacher Student Student+AT Student+SPKD Student+ReKD Student+BISKD Student+Ours
Fig. 2. Visual comparisons on three 2D datasets. We use networks ResUNet (T1) and MobileNet (S2) as the teacher and student networks, respectively.
Over-merge and over-segmentation in the results of the student network are highlighted by red and white boxes, respectively.
student network. These attention maps highlight the most student networks. By minimizing the difference between
relevant regions of the input image for the task at hand, these feature maps, the student network is encouraged to
providing guidance to the student network during training. produce similar results to the teacher network.
2) Similarity Preserving Knowledge Distillation 3) Review Knowledge Distillation (ReKD) [17]: This
(SPKD) [14]: This method focuses on maintaining similarity method adopts a novel review mechanism for knowledge
between the intermediate feature maps of the teacher and distillation, which utilizes the multi-level information from the
8
TABLE III
Q UANTITATIVE COMPARISON OF DIFFERENT KNOWLEDGE DISTILLATION METHODS ON 3D BIOMEDICAL INSTANCE DATASETS , WHERE WE USE THE 3D
UN ET MALA AND ITS CORRESPONDING TINY VERSION AS THE TEACHER - STUDENT NETWORK PAIR . T WO POST- PROCESSING ALGORITHMS
( WATERZ [49] AND LMC [29]) ARE ADOPTED TO GENERATE FINAL SEGMENTATION RESULTS . VOI/ARAND ARE ADOPTED AS METRICS .
T: MALA 1.296 / 0.115 1.261 / 0.110 0.853 / 0.132 0.846 / 0.132 1.653 / 0.129 1.503 / 0.091 1.522 / 0.123 1.618 / 0.205
S: MALA-tiny 1.649 / 0.122 1.565 / 0.122 1.098 / 0.182 0.961 / 0.147 2.037 / 0.171 1.782 / 0.120 2.085 / 0.241 1.733 / 0.203
AT [15] 1.496 / 0.119 1.469 / 0.115 1.068 / 0.176 0.905 / 0.132 1.961 / 0.165 1.774 / 0.155 1.805 / 0.151 1.691 / 0.226
SPKD [14] 1.463 / 0.115 1.444 / 0.113 0.962 / 0.150 0.895 / 0.140 1.785 / 0.150 1.716 / 0.117 1.750 / 0.163 1.674 / 0.227
ReKD [17] 1.428 / 0.115 1.385 / 0.109 0.932 / 0.149 0.879 / 0.135 1.887 / 0.148 1.655 / 0.115 1.649 / 0.126 1.684 / 0.199
BISKD [18] 1.384 / 0.120 1.334 / 0.116 0.892 / 0.139 0.856 / 0.136 1.739 / 0.140 1.598 / 0.113 1.595 / 0.119 1.567 / 0.159
Ours 1.320 / 0.108 1.279 / 0.103 0.853 / 0.138 0.821 / 0.135 1.524 / 0.100 1.542 / 0.127 1.568 / 0.125 1.470 / 0.102
CREMI-C
AC3/4
Raw Image GroundTruth Teacher Student Student+AT Student+SPKD Student+ReKD Student+BISKD Student+Ours
Fig. 3. 2D visual comparisons of segmentation results on the CREMI-C and AC3/4 dataset.
teacher network to guide the one-level feature learning of the BBBC039V1 dataset, the improvements are (37.5%, 82.1%,
student network. 66.7%, 73.7%) and (21.0%, 64.7%, 12.0%, 26.0%).
4) Biomedical Instance Segmentation Knowledge Distilla- (2) Our knowledge distillation method proves to be highly
tion (BISKD): This is our preliminary work [18] tailored for effective even when dealing with teacher-student network pairs
biomedical instance segmentation. that have significantly different network structures, such as
experimental settings with MobileNet as the student network.
This highlights the versatility of our method and demonstrates
B. Results on 2D Datasets its ability to reduce the performance gap between such teacher
We demonstrate the effectiveness of our knowledge dis- and student networks.
tillation method on three 2D biomedical datasets CVPPP, (3) Baseline methods AT, SPKD, and ReKD ignore the key
C.elegans, and BBBC039V1. From the results in Table II, we knowledge of instance-level features and instance relations,
can observe that: which hinders their ability to guide the student network in en-
(1) Our proposed method consistently outperforms existing larging the difference between adjacent instances and reducing
distillation methods and significantly reduces the performance the feature variance of pixels within the same instance. This
gap between student and teacher networks in various experi- limitation often leads to significant over-merging and over-
mental results. Compared to the second best distillation meth- segmentation. Furthermore, these baseline methods neglect the
ods, on the CVPPP dataset, the ResUNet-tiny and MobileNet importance of instance boundary structure knowledge, which
student networks achieve improvements of 5.9% and 19.6% for leads to additional segmentation errors and coarse boundaries.
the SBD metric. On the C.elegans dataset, the improvements (4) Our preliminary work BISKD only focuses on individual
for the (AJI, Dice, F1, PQ) metrics are (4.8%, 5.6%, 1.9%, input images and neglects inter-image semantic instance rela-
11.0%) and (27.1%, 33.1%, 6.8%, 53.6%) respectively. On tions. This limits the effectiveness of the knowledge transfer
the BBBC039V1 dataset, the improvements are (2.2%, 1.1%, process and leads to suboptimal segmentation results.
0.4%, 3.6%) and (5.3%, 3.4%, 1.0%, 6.8%) respectively. Ad- In addition to the quantitative results, we conduct visual
ditionally, our method reduces the performance gap between comparisons between the segmentation results of our proposed
ResUNet and ResUNet-tiny networks by 71.6% and 84.4% for distillation method and those of the baseline methods on chal-
the SBD metric on the CVPPP dataset, and (40.7%, 59.5%, lenging cases, as depicted in Fig. 2. These visual comparisons
56.7%, 68.2%) and (46.3%, 74.5%, 37.8%, 57.7%) for the clearly demonstrate the superiority of our distillation method
(AJI, Dice, F1, PQ) metrics on the C.elegans dataset. On the in terms of segmentation performance. Notably, our method
9
CREMI-C
AC3/4
Fig. 4. 3D visual comparisons on the CREMI-C and AC3/4 dataset. Red and black arrows indicate over-segmentation and over-merge, respectively.
C. Results on 3D Datasets
We compare various knowledge distillation methods for the Student+SPKD Student+ReKD Student+BISKD Student+Ours
3D U-Net MALA on the AC3/4 dataset and three CREMI
subvolumes, as presented in Table III. Our proposed method Fig. 5. A visual example of different embedding maps predicted by student
consistently outperforms the competing distillation methods, networks distilled with different knowledge distillation methods. We use
networks ResUNet (T1) and MobileNet (S2) as the teacher and student
exhibiting statistically significant improvements in the major- networks, respectively.
ity of experiments. Specifically, our method achieves a remark-
able reduction in the performance gap between the student
and teacher networks, with reductions exceeding 93.3% for leading to embeddings of neighboring instances having similar
the key VOI metric on the AC3/4 dataset and over 72.9% RGB color, i.e., similar feature representation. Furthermore,
for the VOI metric on the CREMI datasets. The student the instance boundary regions in the embedding map appear
network demonstrates substantial improvements of 20.0%, blur and lack accurate structural information.
22.7%, 25.5%, and 24.9% for the VOI metric on the AC3/4, (2) When compared to baseline methods, the visualized
CREMI-A, CREMI-B, and CREMI-C datasets, respectively. embeddings obtained from the student network using our
We present the 2D visual comparison in Fig. 3, showcasing knowledge distillation method exhibit more distinct color dif-
the superior performance of our method in enabling the student ferences among adjacent instances. Additionally, the instance
network to accurately distinguish instances and address over- boundary regions demonstrate clear and accurate structures.
segmentation and over-merge errors. Additionally, the 3D These observations indicate that our proposed IGD and AGD
visual comparison in Fig. 4 highlights the distinct advantage schemes effectively facilitate the student network in learning
of our proposed method in preserving the accuracy of neuron instance relations in the feature space and capturing pixel-level
structures compared to existing methods. boundary structure information.
1) Analysis on visualized embeddings: To facilitate a com- (3) In comparison to the visualized embeddings from the
prehensive analysis of the functionality of the proposed knowl- ‘Student+BISKD’ approach, the embeddings from the ‘Stu-
edge distillation method, we present a visualization of the dent+Ours’ method exhibit more pure colors within each
embeddings generated by the student networks, which have instance area. This observation confirms the importance of
been distilled using various distillation methods. To achieve considering cross-image relations.
this, we employ the PCA technique to project the embeddings
from a high-dimensional space onto a 3-dimensional RGB
color space in Fig. 5. Based on the visual results, we have D. Ablation Study
made three observations: 1) Effectiveness of different distillation components: To
(1) The embeddings predicted by the student model may verify the effectiveness of the distillation components of our
not adequately capture the relation between adjacent instances, method, we conduct the ablation study on the proposed IGD
10
TABLE IV
A N ABLATION STUDY IS CONDUCTED ON THE CVPPP DATASET TO
SBD
SBD
R ES UN ET AND M OBILE N ET NETWORKS . T HE CHECK MARK AND CROSS
MARK INDICATE THE USAGE AND NON - USAGE OF THIS COMPONENT,
RESPECTIVELY.
LIntra
N ode LIntra
Edge LIntra
AGD LInter
Edge LInter
AGD SBD ↑ |DiC| ↓
Queue Size K Sampling Number L
% % % % % 71.9 5.00
! % % % % 76.5 2.45 Fig. 6. Ablation study on the queue size K and sampling number L.
! ! % % % 79.4 2.50 Experiments are performed for the teacher-student network pair of ResUNet
! ! ! % % 84.7 1.40 and MobileNet networks on the CVPPP dataset. ‘Memory Cost’ denotes the
occupied GPU memory size (M B).
! ! ! ! % 85.2 1.35
! ! ! ! ! 86.0 1.10
TABLE VI
A BLATION STUDY OF THE DISTILLATION PERFORMANCE ON A SERIES OF
TABLE V SMALL MODELS OBTAINED BY REDUCING THE NUMBER OF CHANNELS OF
A BLATION STUDY ON LOSS WEIGHT HYPERPARAMETERS ON THE CVPPP EACH NETWORK LAYER IN DIFFERENT RATIOS . S 1/N REPRESENTS THE
DATASET. W E ADOPT THE SAME HYPERPARAMETERS FOR ALL STUDENT NETWORKS OBTAINED BY REDUCING THE NUMBER OF
1
EXPERIMENTAL SETTINGS AND USE NETWORKS R ES UN ET (T1) AND CHANNELS OF THE TEACHER NETWORK R ES UN ET BY N .
M OBILE N ET (S2) AS THE TEACHER AND STUDENT NETWORKS FOR
ANALYSIS .
ResUNet SBD ↑ |DiC| ↓ #Params (M) FLOPs (GMAC)
λ1 λ2 λ3 λ4 λ5 SBD ↑ |DiC| ↓
S1/20 w/o KD 74.6 3.95 0.07 1.50
0.1 0.1 10 1 1 86.0 1.10 S1/20 w/ KD 80.9 2.25 0.07 1.50
1 0.1 10 1 1 85.3 1.10
0.01 0.1 10 1 1 85.6 1.30 S1/15 w/o KD 78.7 3.10 0.17 3.28
0.1 1 10 1 1 85.8 1.35 S1/15 w/ KD 84.2 1.85 0.17 3.28
0.1 0.01 10 1 1 85.2 1.35 S1/10 w/o KD 81.9 2.15 0.30 5.76
0.1 0.1 100 1 1 84.9 1.20 S1/10 w/ KD 87.0 1.15 0.30 5.76
0.1 0.1 1 1 1 84.7 1.25
0.1 0.1 10 0.1 1 84.9 1.15 S1/5 w/o KD 85.1 1.50 0.90 17.31
0.1 0.1 10 10 1 84.8 1.45 S1/5 w/ KD 87.6 1.25 0.90 17.31
0.1 0.1 10 1 0.1 85.3 0.80
0.1 0.1 10 1 10 84.9 1.30
4) Student networks with different widths: We conduct an [12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
ablation study on reduced-size models to evaluate the effec- network,” arXiv preprint arXiv:1503.02531, 2015.
[13] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Ben-
tiveness of distillation. The models are created by reducing the gio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550,
number of channels in each layer of the networks. Specifically, 2014.
we generate student networks with width reductions of approx- [14] F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in
1 1 1 Proc. Int. Conf. Comput. Vis., 2019, pp. 1365–1374.
imately 20 , 15 , 10 , and 15 compared to the original width. [15] N. Zagoruyko, Komodakis, “Paying more attention to attention: Im-
The experiments utilize ResUet network pairs on the CVPPP proving the performance of convolutional neural networks via attention
dataset. The results in Tab. VI demonstrate that our knowledge transfer,” arXiv preprint arXiv:1612.03928, 2016.
[16] D. Qin, J.-J. Bu, Z. Liu, X. Shen, S. Zhou, J.-J. Gu, Z.-H. Wang,
distillation method improves the performance of all student L. Wu, and H.-F. Dai, “Efficient medical image segmentation based on
networks, even when they have very few parameters. However, knowledge distillation,” IEEE Trans. Med. Imag., vol. 40, no. 12, pp.
it is important to note that the effectiveness of knowledge 3820–3831, 2021.
[17] P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge
distillation depends on the initial performance gap between review,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp.
the teacher and student networks. When this gap is minimal, 5008–5017.
achieving significant improvements becomes challenging. [18] X. Liu, B. Hu, W. Huang, Y. Zhang, and Z. Xiong, “Efficient biomedical
instance segmentation via knowledge distillation,” in MICCAI. Springer,
2022, pp. 14–24.
VI. C ONCLUSION [19] D. Liu, D. Zhang, Y. Song, H. Huang, and W. Cai, “Panoptic feature
In this paper, we propose a novel graph relation distil- fusion net: a novel instance segmentation paradigm for biomedical and
biological images,” IEEE Trans. Image Process., vol. 30, pp. 2045–2059,
lation approach for biomedical instance segmentation that 2021.
effectively transfers instance-level features, instance relations, [20] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid
and pixel-level boundaries from a heavy teacher network to a networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp.
6399–6408.
lightweight student network, through a unique combination [21] J. Yi, P. Wu, M. Jiang, Q. Huang, D. J. Hoeppner, and D. N. Metaxas,
of instance graph distillation and affinity graph distillation “Attentive neural cell instance segmentation,” Medical image analysis,
schemes. Furthermore, we extend these two schemes beyond vol. 55, pp. 228–240, 2019.
[22] R. Girshick, “Fast r-cnn,” in Proc. Int. Conf. Comput. Vis., 2015, pp.
the intra-image level to the inter-image level by incorporating 1440–1448.
a memory bank mechanism, which captures the global rela- [23] X. Zhang, H. Li, F. Meng, Z. Song, and L. Xu, “Segmenting beyond
tion information across different input images. Experimental the bounding box for instance segmentation,” IEEE Trans. Circuits Syst.
Video Technol., vol. 32, no. 2, pp. 704–714, 2021.
results on both 2D and 3D biomedical datasets demonstrate [24] H. Zhang, Y. Tian, K. Wang, W. Zhang, and F.-Y. Wang, “Mask ssd: An
that our method surpasses existing distillation methods and effective single-stage approach to object instance segmentation,” IEEE
effectively bridges the performance gap between the heavy Trans. Circuits Syst. Video Technol., vol. 29, pp. 2078–2093, 2019.
[25] L. Yang, H. Li, F. Meng, Q. Wu, and K. N. Ngan, “Task-specific loss
teacher networks and their corresponding lightweight student for robust instance segmentation with noisy class labels,” IEEE Trans.
networks. Circuits Syst. Video Technol., 2021.
[26] B. De Brabandere, D. Neven, and L. Van Gool, “Semantic instance
R EFERENCES segmentation with a discriminative loss function,” arXiv preprint
arXiv:1708.02551, 2017.
[1] H. Chen, X. Qi, L. Yu, and P.-A. Heng, “Dcan: deep contour-aware [27] M. Lalit, P. Tomancak, and F. Jug, “Embedseg: Embedding-based
networks for accurate gland segmentation,” in CVPR, 2016. instance segmentation for biomedical microscopy data,” Medical image
[2] M. Li, C. Chen, X. Liu, W. Huang, Y. Zhang, and Z. Xiong, “Advanced analysis, vol. 81, p. 102523, 2022.
deep networks for 3d mitochondria instance segmentation,” in ISBI. [28] J.-H. Shi, Q. Zhang, Y.-H. Tang, and Z.-Q. Zhang, “Polyp-mixer: An
IEEE, 2022, pp. 1–5. efficient context-aware mlp-based paradigm for polyp segmentation,”
[3] N. Kumar, R. Verma, S. Sharma, S. Bhargava, A. Vahadane, and IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 1, pp. 30–42,
A. Sethi, “A dataset and a technique for generalized nuclear segmen- 2022.
tation for computational pathology,” IEEE Trans. Med. Imag., vol. 36, [29] T. Beier, C. Pape, N. Rahaman, T. Prange, S. Berg, D. D. Bock,
no. 7, pp. 1550–1560, 2017. A. Cardona, G. W. Knott, S. M. Plaza, L. K. Scheffer et al., “Multicut
[4] Z. Song, P. Wang, J. Zhou, Z. Yang, Y. Yang, Z. Gong, and N. Zheng, brings automated neurite segmentation closer to human performance,”
“Muscleparsenet: a novel framework for parsing muscles of drosophila Nature methods, vol. 14, no. 2, pp. 101–102, 2017.
larva in light-sheet fluorescence microscopy images,” IEEE Trans. [30] K. Fukunaga and L. Hostetler, “The estimation of the gradient of a
Circuits Syst. Video Technol., 2023. density function, with applications in pattern recognition,” IEEE Trans.
[5] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proc. Inf. Theory, vol. 21, no. 1, pp. 32–40, 1975.
Int. Conf. Comput. Vis., 2017, pp. 2961–2969. [31] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward
[6] D. Liu, D. Zhang, Y. Song, C. Zhang, F. Zhang, L. O’Donnell, and feature space analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,
W. Cai, “Nuclei segmentation via a deep panoptic model with semantic no. 5, pp. 603–619, 2002.
feature fusion.” in IJCAI, 2019, pp. 861–868. [32] R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering
[7] D. Zhang, Y. Song, D. Liu, H. Jia, S. Liu, Y. Xia, H. Huang, and W. Cai, based on hierarchical density estimates,” in Pacific-Asia conference on
“Panoptic segmentation with an end-to-end cell r-cnn for pathology knowledge discovery and data mining. Springer, 2013, pp. 160–172.
image analysis,” in MICCAI. Springer, 2018, pp. 237–244. [33] A. Wolny, Q. Yu, C. Pape, and A. Kreshuk, “Sparse object-level
[8] L. Chen, M. Strauch, and D. Merhof, “Instance segmentation of supervision for instance segmentation with pixel embeddings,” in Proc.
biomedical images with an object-aware embedding learned with local IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4402–4411.
constraints,” in MICCAI. Springer, 2019, pp. 451–459. [34] M. Lalit, P. Tomancak, and F. Jug, “Embedding-based instance segmen-
[9] V. Kulikov and V. Lempitsky, “Instance segmentation of biological tation in microscopy,” in MIDL, 2021.
images using harmonic embeddings,” in Proc. IEEE Conf. Comput. Vis. [35] C. Payer, D. Štern, M. Feiner, H. Bischof, and M. Urschler, “Segmenting
Pattern Recog., 2020, pp. 3843–3851. and tracking cell instances with cosine embeddings and recurrent hour-
[10] K. Lee, R. Lu, K. Luther, and H. S. Seung, “Learning and segmenting glass networks,” Medical image analysis, vol. 57, pp. 106–119, 2019.
dense voxel embeddings for 3d neuron reconstruction,” IEEE Trans. [36] W. Huang, S. Deng, C. Chen, X. Fu, and Z. Xiong, “Learning to model
Med. Imag., vol. 40, no. 12, pp. 3801–3811, 2021. pixel-embedded affinity for homogeneous instance segmentation,” in
[11] C. Payer, D. Štern, T. Neff, H. Bischof, and M. Urschler, “Instance seg- AAAI, vol. 36, no. 1, 2022, pp. 1007–1015.
mentation and tracking with cosine embeddings and recurrent hourglass [37] X. Liu, W. Huang, Y. Zhang, and Z. Xiong, “Biological instance
networks,” in MICCAI. Springer, 2018, pp. 3–11. segmentation with a superpixel-guided graph.” in IJCAI, 2022.
12