0% found this document useful (0 votes)
11 views12 pages

Graph Relation Distillation For Efficient Biomedical Instance Segmentation

Uploaded by

liangpengchen8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Graph Relation Distillation For Efficient Biomedical Instance Segmentation

Uploaded by

liangpengchen8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1

Graph Relation Distillation for Efficient Biomedical


Instance Segmentation
Xiaoyu Liu, Yueyi Zhang, Zhiwei Xiong, Wei Huang, Bo Hu, Xiaoyan Sun and Feng Wu

Abstract—Instance-aware embeddings predicted by deep neu- is of paramount importance, particularly in the case of 3D
ral networks have revolutionized biomedical instance segmenta- CNNs. Consequently, there exists a trade-off between model
tion, but its resource requirements are substantial. Knowledge simplification for speed and maintaining optimal performance,
distillation offers a solution by transferring distilled knowledge
as the simplified models often compromise their performance.
arXiv:2401.06370v1 [cs.CV] 12 Jan 2024

from heavy teacher networks to lightweight yet high-performance


student networks. However, existing knowledge distillation meth- Knowledge distillation has emerged as a highly promising
ods struggle to extract knowledge for distinguishing instances and approach to reduce computational costs while maintaining
overlook global relation information. To address these challenges, satisfactory performance [12]–[15]. Through knowledge dis-
we propose a graph relation distillation approach for efficient tillation, it becomes possible to train a lightweight student net-
biomedical instance segmentation, which considers three essential
types of knowledge: instance-level features, instance relations, work that achieves high performance by leveraging effective
and pixel-level boundaries. We introduce two graph distillation knowledge distilled from a well-trained and computationally
schemes deployed at both the intra-image level and the inter- intensive teacher network. Recently, knowledge distillation
image level: instance graph distillation (IGD) and affinity graph methods have been primarily designed for image-level classi-
distillation (AGD). IGD constructs a graph representing instance fication and semantic segmentation tasks. For instance, feature
features and relations, transferring these two types of knowledge
by enforcing instance graph consistency. AGD constructs an distillation has been used in [15], which transfers attention
affinity graph representing pixel relations to capture structured maps distilled from mid-level feature maps of the teacher
knowledge of instance boundaries, transferring boundary-related network to the student network. Logit distillation proposed by
knowledge by ensuring pixel affinity consistency. Experimental [16] introduces a technique that distills both the output logits
results on a number of biomedical datasets validate the effec- and semantic region information from the teacher network to
tiveness of our approach, enabling student models with less than
1% parameters and less than 10% inference time while achieving guide the student network. These advancements demonstrate
promising performance compared to teacher models. The code the potential of knowledge distillation to enhance various
is available at https://github.com/liuxy1103/GRDBIS. aspects of model training and transfer valuable insights from
Index Terms—Biomedical Instance Segmentation, Pixel Em- heavy models to lightweight counterparts.
beddings, Knowledge Distillation However, applying existing knowledge distillation meth-
ods directly to instance segmentation encounters two main
challenges. Firstly, instance-level features and the relations
I. I NTRODUCTION between instances play a crucial role in distinguishing be-
tween different instances based on their feature space dis-
B IOMEDICAL instance segmentation is a crucial and
highly challenging task in the field of biomedical image
analysis. Its objective is to assign a unique identification to
tances. Unfortunately, current methods often overlook this
vital information. Unlike the straightforward logit distillation
each instance in the entire image. The two most common employed in semantic segmentation, instance segmentation
approaches are based on either performing semantic seg- demands more sophisticated techniques to distill structural
mentation of instance contours [1]–[4] or object detection information of instance boundaries from feature maps. While
approaches [5]–[7]. These methods often struggle when faced a general distillation method in [17] is proposed for multiple
with unclear boundaries, densely distributed instances, and vision tasks, which incorporates a review mechanism involving
significant occlusions. Recent advancements [8]–[11] in this multiple feature maps, this approach encounters difficulties in
field employ convolutional neural networks (CNNs) to pre- learning knowledge that contains redundant information. This
dict instance-aware embedding vectors that are unaffected by is primarily due to the limited ability of lightweight networks
object morphology. Subsequently, post-processing algorithms to adequately attend to features at each pixel location. As
are employed to cluster pixel embeddings into instances. a result, there has been a rare exploration of knowledge
Nonetheless, these methods heavily rely on complex models distillation methods tailored to address the unique challenges
that demand substantial computational resources to capture de- posed by biomedical instance segmentation.
tailed instance-aware features for pixel-level dense estimation. Secondly, existing knowledge distillation methods primarily
Such resource-intensive models are impractical for real-world focus on extracting valuable knowledge based on individual
scenarios. Therefore, the development of simplified networks input images to guide a student network. Unfortunately, they
tend to neglect the essential inter-image instance relations at
The authors are with the University of Science and Technology of China both pixel-level and instance-level, hindering effective knowl-
and the Institute of Artificial Intelligence, Hefei Comprehensive National Sci-
ence Center (e-mail: {liuxyu, weih527, hubosist}@mail.ustc.edu.cn; {zhyuey, edge transfer. It is important to highlight that global relations
zwxiong, sunxiaoyan, fengwu}@ustc.edu.cn). across different input images encompass valuable instance
2

structural information. Incorporating these global relations • We extend the IGD and AGD schemes from the intra-
becomes instrumental in constructing a well-structured feature image level to the inter-image level by introducing a
space and attaining more precise instance segmentation results. memory bank mechanism to capture the global relation
In this paper, we propose a novel graph relation distilla- across different input images.
tion method tailored for biomedical instance segmentation, • The superiority of our knowledge distillation approach
which addresses the challenges faced by existing distillation over existing methods is demonstrated on three 2D
methods. To tackle the first challenge, we introduce two biomedical datasets and two 3D biomedical datasets.
distillation schemes to extract crucial knowledge for instance This work is a substantial extension of our preliminary
segmentation, including instance-level features, instance rela- work [18] in the following aspects:
tions, and pixel-level boundaries. The first scheme, instance • We distill global relation information across different
graph distillation (IGD), constructs an instance graph using input images using an inter-image instance graph and
central embeddings of the corresponding instances as nodes inter-image affinity. This enables the student network
and measuring feature similarity between nodes as edges. to learn from a broader range of instance features and
By enforcing graph consistency, IGD effectively transfers boundary structures and improve its ability to handle
knowledge of instance features and relations. The second complex instances.
scheme, affinity graph distillation (AGD), regards each pixel • We verify the effectiveness of our proposed method
embedding as a graph node and converts pixel embeddings for more teacher-student network pairs with different
into a structured affinity graph by calculating the distance architectures, demonstrating its versatility and robustness.
between pixel embeddings, which is used to mitigate boundary • We conduct extensive experiments on biomedical instance
ambiguity in the lightweight student network. AGD ensures segmentation datasets with different modalities and mul-
affinity graph consistency between teacher and student net- tiple categories, further demonstrating the effectiveness
works, facilitating boundary-related knowledge transfer. By and generalizability of our approach.
employing the above two distillation schemes, our approach
enhances the performance of student networks in biomedical II. R ELATED W ORKS
instance segmentation.
A. Biomedical Instance Segmentation
To address the second challenge, we extend the IGD and
AGD schemes to capture global structural information at the Deep learning-based instance segmentation methods for
inter-image level. Specifically, we construct instance graphs biomedical images can be classified into two main categories:
and affinity graphs by considering relations between instances proposal-free and proposal-based methods.
and pixel embeddings from different input images, respec- Proposal-based methods [5]–[7], [19]–[21] utilize object
tively. To fully explore the graph relations between different detection [22]–[25] and object segmentation heads to predict
input images, we need to increase the batch size of the bounding boxes and foreground masks for each object, re-
network by including as many input samples as possible. spectively. However, these methods heavily rely on accurate
Under the constraint of limited GPU memory, we introduce a bounding box predictions, which may fail to differentiate ad-
memory bank mechanism to store past predicted feature maps jacent instances due to their overlap, and the size of instances.
as much as possible. This enables us to calculate relations The sizes of instances in images may exceed the receptive
between the current input image and sampled images from field of the model, making it challenging to locate complete
the memory bank, effectively capturing long-range inter-image instances using bounding boxes.
relations. Overall, our approach offers a practical solution for The proposal-free methods [9], [11], [26], [27] predict
distilling knowledge vital to biomedical instance segmentation, specially designed instance-aware features and morphology
addressing limitations and improving performance. properties which can encode morphology characteristics [28],
Extensive experimental results show that our knowledge structures, and spatial arrangement, and then cluster the
distillation approach greatly benefits the lightweight network, predicted mid-representations into instances using a post-
leading to significant improvements in performance while processing algorithm [29]–[32]. Pixel embedding-based meth-
maintaining efficiency during inference. The student networks ods [33]–[37] excel in encoding each pixel of an image
trained using our approach achieve promising performance into a high-dimensional feature space, facilitating the grouping
with less than 1% of parameters and less than 10% of inference of similar pixels to form distinct instance regions. These
time compared to the teacher networks. methods exhibit exceptional performance in tackling complex
scenes with overlapping and crowded objects, making them a
The contributions of this paper are summarized as follows:
popular choice for biomedical instance segmentation and other
• We propose a graph relation distillation method tailored applications.
for biomedical instance segmentation to obtain efficient However, the computational and memory demands of pixel
and high-performance networks. embedding-based methods hinder their practical use. Knowl-
• We propose an IGD scheme to force the student network edge distillation offers a solution by transferring knowledge
to mimic instance-level features and instance relations of from a large, complex model (the teacher) to a smaller, simpler
the teacher network via an instance graph, along with model (the student). This compression process effectively pre-
an AGD scheme for pixel-level boundary knowledge serves the teacher network’s knowledge in the student network,
transfer. maintaining comparable performance. By applying knowledge
3

Teacher Model
Intra-Image Affinity Map (T)
Inter-Image Affinity Map (T)

Intra-Image Instance Graph (T) Inter-Image Instance Graph (T)


Affinity Affinity
Feature Map (T) Enqueue Sample


Graph Distillation Loss Graph Distillation Loss Distillation Distillation
Loss Loss
Memory Bank

Intra-Image Instance Graph (S)


Inter-Image Instance Graph (S)
Feature Map (S)

Inter-Image Affinity Map (S)


Intra-Image Affinity Map (S)

Student Model
GroundTruth Affinity Map

Fig. 1. Workflow of our proposed graph relation distillation method for biomedical instance segmentation, which includes two schemes. The instance graph
distillation (IGD) scheme constructs instance graphs from embeddings of the teacher-student network pair and enforces the consistency of graphs constructed
by the teacher, while the affinity graph distillation (AGD) scheme converts pixel embeddings into pixel affinities that encode structured information of instance
boundaries and enforces the student model to generate affinities similar to its teacher model. These two schemes take charge of the knowledge distillation
mechanism and are carried out at both intra-image and inter-image levels for global instance relations. The symbol ⊙ represents dot product operation. The
red arrow indicates the loss function.

distillation to pixel embedding-based segmentation models, sification [41]–[43], semantic segmentation [16], [44], object
we can significantly reduce their computational and storage detection [45], and speech recognition [46].
requirements while preserving their exceptional segmentation However, there is currently no proposed knowledge distil-
performance. lation method tailored for biomedical instance segmentation,
which is a challenging task due to the complexity and hetero-
geneity of biomedical images with instances varying signifi-
B. Knowledge Distillation cantly in size, shape, and distribution. Therefore, the proposed
method is a novel application of knowledge distillation to
The goal of knowledge distillation is to transfer knowledge address the challenges of biomedical instance segmentation.
from a computationally expensive but powerful teacher net- Existing works [41], [45] consider relation distillation for
work to a lightweight student network, thereby enhancing its image classification and object detection by constructing a
performance while preserving its efficiency. Related work on graph. In this paper, we extend this idea to biomedical instance
knowledge distillation can be broadly classified into the fol- segmentation by constructing a graph from predicted pixel
lowing several categories, depending on the specific methods embeddings and considering cross-image relations with cor-
used for distillation. responding domain knowledge. Furthermore, we extend this
One approach is to distill knowledge from a larger teacher concept to biomedical instance segmentation. Specifically, we
network to a smaller student network. This can be done by construct an instance graph and an affinity graph based on
directly transferring the output probabilities of the teacher the predicted pixel embeddings and incorporate cross-image
network to the student network [12], or by transferring in- relations, by leveraging relevant domain knowledge. By doing
termediate representations from the teacher network to the so, we aim to improve the performance of biomedical instance
student network [13]. Other methods use attention mechanisms segmentation and enhance the understanding of inter-instance
or gating functions to allow the student network to selectively relations and pixel relations within the biomedical domain.
focus on the most important information from the teacher
network [15]. Another approach is to use ensemble models
III. M ETHODOLOGY
as teachers to transfer knowledge to a single student network.
This can be done by using the logits or probabilities output The workflow of our proposed distillation method is pre-
by the ensemble as a soft label for the student network [38], sented in Fig. 1 and can be applied to both 2D and 3D
[39], or by transferring feature maps or attention maps from networks for images and volumes. We illustrate it using a
the ensemble to the student network [40]. In addition to 2D image example for easy visualization and description. The
these methods for general tasks, there are also methods for method involves a heavy teacher network T and a lightweight
task-specific knowledge distillation, such as for image clas- student network S, which both predict a set of feature maps,
4

i.e., embedding map E ∈ RD×H×W for an input image of 2) Inter-Image Distillation: With the goal of transferring
size H × W . The embedding vector of a pixel p is denoted the instance relations across different input images, we extend
as ep ∈ RD and can be clustered into instances through the above-mentioned instance graph distillation scheme to the
post-processing, and D is the dimension of the embedding inter-image level. Given the limitation of GPU memory, we
vectors. Given M training images, the segmentationMnetwork follow [47], [48] to introduce a shared online feature map
can extract M embedding maps Em ∈ RD×H×W m=1 . Two queue between the student and teacher networks, which stores
specially designed schemes, instance graph distillation (IGD) a vast quantity of feature maps in a memory bank generated
and affinity graph distillation (AGD), are employed to distill from the predictions of the teacher network in previous itera-
effective knowledge from embedding maps at both intra-image tions. It allows us to retrieve abundant feature maps efficiently.
and inter-image levels. More details of our proposed method  map queue canKstore K feature maps, and is no-
The feature
are provided below. tated as Ek ∈ RH×W ×d k=1 . In each training iteration, we
enqueue batch-size B feature maps to the memory bank and
randomly sample L feature maps from it. Given the mth input
A. Instance Graph Distillation image from the training image, the segmentation network can
Embeddings of pixels p ∈ Si belonging to the same instance predict the embedding map Em ∈ RD×H×W . Meanwhile, we
L
i and located in the corresponding area Si exhibit similarity, can obtain L embedding maps El ∈ RD×H×W l=1 sampled
while the embeddings of pixels belonging to different instances from the memory bank.
demonstrate dissimilarity. This ensures that the different in- We then calculate the corresponding node features vim and
l
stances i ∈ I of an input image are distinguished in the feature vj extracted from these two feature maps Em and El , where
space. Therefore, the distribution of embeddings in the feature i ∈ Im and j ∈ Il represent different instances from the mth
space contains valuable knowledge related to instance-level input image and the lth sampled image from the memory bank,
features and instance relations. To transfer this key knowledge, respectively. The edge feature between two nodes vim and vjl
we propose an instance graph distillation scheme. is calculated as εml
ij by the above-mentioned cosine distance.
1) Intra-Image Distillation: To effectively distill this Given that the relations between instances within an image
knowledge at the intra-image level, we construct an intra- have been leveraged by intra-image distillation, we build the
image instance graph that encodes the knowledge of instance- inter-image instance graph by only connecting nodes from
level features and instance relations by nodes and edges, different input images. We enforce consistency between two
respectively. The nodes are extracted from pixel embeddings inter-image graphs respectively constructed from the student
of the embedding map with the guidance of labeled instance network and the teacher network, by using an MSE loss
masks which provide precise areas to calculate instance central function. It is formulated as follows:
features, denoted as
L X X
1 X 1 X
vi = i ep . LInter
Edge = (εml S ml T
ij ) − (εij ) . (4)
|S | (1) L |Im | |Il | 2
i
l=1 i∈Im j∈Il
p∈S

The edges are defined as the cosine distance between two


nodes, denoted as B. Affinity Graph Distillation
vi⊤ · vj To transfer the structured information of instance bound-
εij = Cos(vi , vj ) = , (2)
∥vi ∥ ∥vj ∥ aries, we further propose an AGD scheme to convert pixel
embeddings into an affinity graph that encodes pixel relations
where instances i, j ∈ I, i ̸= j, and ⊤ represents vector and calculate the mean square error loss between two affinity
transpose operation, and Cos is a function to calculate cosine graphs respectively from the teacher and student networks.
similarity distance between two vectors.
1) Intra-Image Distillation: The pixel affinity, as a node in
We then force the instance graph of the student network to the affinity graph, describes the relation between pixel embed-
be consistent with the instance graph of the teacher network. dings. We first compute the pixel affinity at the intra-image
The distillation loss of this scheme Lintra
IGD can be divided into level, which is formulated as an,p = Cos(ep , ep+n ), where
two parts, respectively related to nodes and edges: pixel p and pixel p + n are constrained to be locally adjacent
LIntra Intra Intra within n pixel strides to ensure efficient use of the embedding
IGD = λ1 LN ode + λ2 LEdge ,
space, as demonstrated in [8]. Accordingly, a feature map
1 X
LIntra
N ode = (vi )S − (vi )T 2 , E is converted into the affinity map A ∈ RN ×H×W , where
|I| (3) N represents the adjacent relations within N different pixel
i∈I
Intra 1 XX strides.
LEdge = 2 (εij )S − (εij )T 2
,
|I| i∈I j∈I To distill instance structure information at the intra-image
level from the teacher network to the student network, we
where λ1 and λ2 are weighting coefficients to balance the two align the affinity maps generated by the teacher and student
terms, and the superscripts T and S represent the teacher and networks. It can improve the ability of students to capture the
student networks. structure of instance boundaries. We denote the teacher and
5

student affinity maps as AT and AS , respectively, and then TABLE I


force AS to be consistent with AT : M ODEL COMPLEXITY AND INFERENCE TIME ON TEST DATASETS FOR
DIFFERENT TEACHER AND STUDENT NETWORKS . T HE FLOPS AND
LIntra S
AGD = A − A
T
2
INFERENCE TIME ARE ESTIMATED WITH THE INPUT SIZE OF 544 × 544 ON
THE CVPPP DATASET FOR 2D BACKBONES AND WITH THE INPUT SIZE OF
N H×W
1 X X (5) 84 × 268 × 268 ON THE C REMI DATASET FOR 3D UN ET-MALA
= aSn,p − aTn,p . BACKBONES . T HE INFERENCE TIME IS CALCULATED AS THE SUM OF ALL
N × H × W n=1 p=1 2
INPUT SAMPLES IN THE ENTIRE TEST SET. T HE SYMBOLS ‘T’ AND ‘S’
REPRESENT THE TEACHER NETWORK AND THE STUDENT NETWORK ,
The affinity map converted from the embedding map of RESPECTIVELY.

the last layer of the student network is also supervised by


the affinity label  from the ground-truth segmentation, we Models #Params (M) FLOPs (GMAC) #Infer Time (s)
formulate the loss as
T: ResUNet 33.61 582.97 2.0 ± 0.1
Laf f = AS − Â . (6) T: NestedUNet 36.36 624.14 2.1 ± 0.1
2 T: MALA 84.02 367.04 53.5 ± 2.0
2) Inter-Image Distillation: To distill global instance struc- S: ResUNet-tiny 0.30 5.76 0.20 ± 0.03
ture information across different input images, we define inter- S: MobileNet 4.79 25.82 0.78 ± 0.05
image affinity as the similarity between pixel embeddings from S: MALA-tiny 0.37 22.01 2.7 ± 0.2
different input images. By comparing the pixel embeddings
from different input images, we can infer the similarities and
differences between the instances present in each image. This D. Network Structure
allows the student network to extract and analyze the global
instance structure information. Similar to the inter-image graph 1) Teacher networks: As shown in Table I, we provide an
distillation, We calculate the inter-image affinity between pixel overview of the model complexity and inference time on test
embeddings em l ml m l datasets for different teacher and student networks. We adopt
i ∈ Em and ej ∈ El as aij = Cos(ei , ej ).
ml
The inter-image affinity map, denoted as A ∈ R HW ×HW
, two state-of-the-art heavy networks as teachers, namely 3D
is calculated by taking the dot product of the pixel embeddings U-Net MALA [49] for 3D datasets, and 2D ResUnet [51]
of two images, i.e., Aml = Em ⊤
El . This calculation captures and NestedUNet [52] for 2D datasets, to demonstrate the
inter-image pair-wise relations among all pixel embeddings. effectiveness of our knowledge distillation method.
We guide the inter-image affinity map (Aml )S produced from U-Net MALA is a modified 3D version of U-Net [53]
the student network to be consistent with (Aml )T from the for 3D EM image segmentation. It has 4 levels with at
teacher network. It is formulated as least 1 convolution pass per level, using max pooling for
downsampling and transposed convolution for upsampling.
L
1X The resulting maps are concatenated with feature maps from
LInter
AGD = (Aml )S − (Aml )T
L 2 the downsampling pass of the same level, and cropped to
l=1
L H×W
account for context loss. ResUnet and NestedUNet are both
1 X X variants of U-Net, which have impressive performance on
= 2
(aml S ml T
ij ) − (aij ) .
L × (H × W ) i,j=1
2 biomedical image segmentation tasks. ResUnet uses residual
l=1
(7) blocks to avoid vanishing gradient, while NestedUNet has
multiple nested skip connections for multi-level feature access.
2) Student networks: To fully validate the effectiveness and
C. Overall Optimization
versatility of the proposed distillation methods, we conduct
Given a well-trained teacher network, the objective function experiments for two kinds of student networks, which are
integrates all knowledge distillation schemes mentioned above described as follows.
for training the student network. The total loss is formulated 1) The dimension of pixel embeddings plays a crucial role
as in capturing instance features accurately. Recognizing that
the high-dimensional feature maps in the teacher network
Ltotal = Laf f + λ1 LIntra Intra
N ode + λ2 LEdge contain redundancy, we adopt a straightforward approach. We
| {z }
LIntra
IGD
(8) proportionally reduce the width of each layer in the teacher
networks to match their corresponding lightweight student
+ λ3 LIntra
AGD + λ4 LInter
Edge + λ5 LInter
AGD ,
networks. This allows the student networks to maintain the
where λ1 , λ2 , λ3 , λ4 , and λ5 are empirically set as 0.1, 0.1, 10, same architecture as the teacher networks, facilitating the
1, and 1 through experiments to balance these terms. During learning process as they absorb the distilled knowledge from
the inference phase, we employ standard post-processing al- their teachers. As shown in Table I, the student networks
gorithms to generate the ultimate instance-level segmentation ‘MALA-tiny’ and ‘ResUNet-tiny’ have reduced the width of
1
results from the predicted embedding maps, following [36]. each layer in the two teacher networks to approximately 10
1
For 3D networks, we utilize well-established post-processing and 5 of their original width, respectively. These two student
algorithms such as Waterz [49] and LMC [29]. For 2D networks have only 0.4% and 0.9% of the parameters of their
networks, we rely on the Mutex algorithm [50] to generate corresponding teacher networks, ‘MALA’ and ‘ResUNet’, and
the segmentation results. consume only 5% and 10% of the inference time required by
6

their teacher networks. We also conduct more detailed ablation 50 test images using a random split, ensuring that both sets
experiments in Ablation Study V-D4 to explore the distillation represent the complete dataset adequately. We use the same
performance of a series of small models obtained by reducing four metrics as those used in the BBBC039V1 dataset for
the width of each network layer in different ratios. quantitative evaluation.
2) We employ the well-established lightweight network 4) AC3/4: AC3 and AC4 are two labeled subsets extracted
MobileNetV2 [54] as the student network. MobileNetV2 from the mouse somatosensory cortex dataset [59], a widely
utilizes depth-wise and point-wise convolutions, resulting in used electron microscope (EM) dataset for 3D instance seg-
reduced parameters and computation compared to traditional mentation of individual neurons in 2D image sequences. These
convolutions. This network architecture is widely adopted in sequences were acquired at a resolution of 3 × 3 × 29 nm. The
mobile and embedded vision applications. It is worth noting AC3 dataset consists of 256 sequential images, while the AC4
that MobileNetV2 differs significantly in its network structure dataset contains 100 sequential images. For evaluating our
from the teacher networks ResUNet and NestedUNet, which proposed method, we partition the data as follows: we use the
further highlights the distinctiveness of our approach. top 80 sections of AC4 for training, the remaining 20 sections
for validation, and the top 100 sections of AC3 for testing.
IV. E XPERIMENTS We adopt two widely used metrics to quantitatively evaluate
the segmentation results: the variation of Information (VOI)
A. Datasets and Metrics and the adapted rand error (ARAND). VOI [60] measures
1) CVPPP: The CVPPP A1 dataset [55] is a well- the distance between two segmentation masks, taking into
established plant phenotype dataset, aiming to reveal the rela- account both the over-merge and over-segmentation errors.
tionship between plant phenotypes and genotypes, thus helping ARAND [57] is a variation of the Rand Index that takes into
to understand the genetic characteristics and genetic mech- account the uneven distribution of object sizes in EM image
anisms in biomedical research. The dataset contains images segmentation. Note that lower values of these two metrics
of leaves with complex shapes and significant occlusions and indicate better segmentation performance.
serves as a benchmark dataset for a highly regarded biological 5) CREMI: The CREMI dataset [61], which is imaged from
instance segmentation task. Each image has a resolution of adult Drosophila melanogaster brain tissue at a resolution of
530 × 500 pixels. In this study, we randomly select 108 4 × 4 × 40 nm, is another EM dataset used for 3D neuron
images from the dataset for training and 20 images for testing. segmentation. It is composed of three sub-volumes (CREMI-
To evaluate the quality of the segmentation results, we use A/B/C) that correspond to different neuron types, with each
two widely adopted metrics: symmetric best dice (SBD) and sub-volume consisting of 125 consecutive images. Each sub-
absolute difference in counting (|DiC|). SBD measures the volume is split into 50 sections for training, 25 sections for
similarity between the predicted and ground truth segmenta- validation, and 50 sections for testing. We adopt the same
tion masks, while |DiC| counts the absolute difference between quantitative metrics (VOI and ARAND) as those used for the
the predicted and ground truth number of objects in the image. AC3/4 to evaluate the results on the CREMI dataset.
These metrics are commonly utilized to assess the accuracy
of instance segmentation results in computer vision tasks. B. Implementation Details
2) BBBC039V1: The BBBC039V1 dataset [56] consists
Throughout our experiments, we conduct our computations
of 200 Fluorescence Microscopy (FM) images, each with a
within a well-defined environment comprising PyTorch 1.0.1,
resolution of 696 × 520 pixels. These images capture U2OS
CUDA 9.0, and Python 3.7.4. To optimize model training, we
cells exhibiting diverse shapes and densities. We follow the
utilize the Adam optimizer with β1 = 0.9 and β2 = 0.99, a
official data split, employing 100 images for training, 50 for
learning rate of 10−4 , and a batch size of 2. These choices
validation, and the remaining 50 for testing. To quantitatively
ensure efficient and effective training processes. We utilize
evaluate the segmentation results, we adopt four widely used
a single NVIDIA TitanXP GPU for training, and conduct
metrics for cell segmentation in FM images. Aggregated
300K iterations for each model. To address GPU memory
Jaccard Index (AJI) [57] measures the similarity between the
limitations, we follow [36] to set the embedding dimension
ground truth and predicted segmentation. Object-level F1 score
of the last layer to 16. Additionally, we compute affinities by
(F 1) [58] measures the accuracy of predicted segmentation
considering adjacent pixel embeddings within N = 1 voxel
at the level of individual cells. Panoptic Quality (P Q) [20]
stride for 3D networks, and within N = 27 pixel strides for
measures the number of correctly segmented instances and
2D networks. The hyper-parameters K and L of the memory
the accuracy of the semantic labeling. The pixel-level Dice
bank mechanism are set as 32 and 12, respectively.
score (Dice) measures the similarity between the ground truth
segmentation and the predicted segmentation at the pixel level.
V. E XPERIMENTAL RESULTS
3) C.elegans: The C.elegans dataset [56] is a challenging
dataset for image analysis with a large number of organisms A. Baseline Methods
in each image. C.elegans itself has a slender shape and often We perform a comparative analysis between our proposed
appears in complex overlapping poses, making it difficult to method and three state-of-the-art knowledge distillation meth-
accurately segment individual organisms. The dataset consists ods that are widely used for feature maps, which include:
of 100 grayscale images, each with a resolution of 696 × 520 1) Attention Transferring (AT) [15]: This method involves
pixels. We partition the dataset into 50 training images and the transfer of attention maps from a teacher network to a
7

TABLE II
Q UANTITATIVE COMPARISON OF DIFFERENT KNOWLEDGE DISTILLATION METHODS ON 2D BIOMEDICAL INSTANCE SEGMENTATION DATASETS . W E
CONDUCT EXPERIMENTS ON FOUR SETS OF TEACHER - STUDENT NETWORK PAIRS CONSISTING OF TWO TEACHER NETWORKS AND TWO STUDENT
NETWORKS . A BOLD SCORE REPRESENTS THE BEST PERFORMANCE ON THE CORRESPONDING DATASET.

CVPPP C.elegans BBBC039V1


Methods
SBD ↑ |DiC| ↓ AJI ↑ Dice ↑ F1 ↑ PQ ↑ AJI ↑ Dice ↑ F1 ↑ PQ ↑
T1: ResUNet 88.6 1.15 0.816 0.916 0.931 0.794 0.899 0.956 0.962 0.878
T2: NestedUNet 88.4 1.10 0.810 0.902 0.928 0.782 0.900 0.957 0.962 0.879
S1: ResUNet-tiny 81.9 2.15 0.730 0.837 0.901 0.684 0.865 0.928 0.959 0.840
S2: MobileNet 71.9 5.00 0.548 0.614 0.812 0.442 0.737 0.905 0.887 0.709

T1 & S1 + AT [15] 83.9 1.60 0.749 0.875 0.904 0.731 0.875 0.937 0.959 0.854
T1 & S1 + SPKD [14] 85.2 1.30 0.740 0.853 0.904 0.709 0.881 0.940 0.962 0.858
T1 & S1 + ReKD [17] 85.6 1.25 0.708 0.859 0.883 0.723 0.879 0.945 0.961 0.861
T1 & S1 + BISKD [18] 86.4 1.15 0.760 0.865 0.912 0.734 0.872 0.934 0.959 0.851
T1 & S1 + Ours 87.0 1.15 0.765 0.884 0.918 0.759 0.884 0.946 0.961 0.868

T1 & S2 + AT [15] 82.8 1.40 0.545 0.627 0.817 0.451 0.760 0.931 0.893 0.740
T1 & S2 + SPKD [14] 73.8 3.95 0.552 0.612 0.819 0.444 0.749 0.917 0.890 0.725
T1 & S2 + ReKD [17] 81.5 1.60 0.595 0.732 0.828 0.528 0.757 0.924 0.894 0.737
T1 & S2 + BISKD [18] 84.7 1.40 0.655 0.799 0.851 0.611 0.766 0.930 0.895 0.746
T1 & S2 + Ours 86.0 1.10 0.672 0.839 0.857 0.645 0.771 0.938 0.896 0.753

T2 & S1 + AT [15] 84.7 1.25 0.750 0.869 0.908 0.727 0.877 0.940 0.960 0.855
T2 & S1 + SPKD [14] 84.0 1.60 0.739 0.844 0.912 0.702 0.878 0.942 0.959 0.859
T2 & S1 + ReKD [17] 85.1 1.45 0.749 0.861 0.904 0.726 0.880 0.942 0.959 0.857
T2 & S1 + BISKD [18] 83.6 1.20 0.702 0.853 0.882 0.713 0.883 0.947 0.961 0.865
T2 & S1 + Ours 85.8 1.10 0.751 0.874 0.912 0.746 0.884 0.948 0.963 0.870

T2 & S2 + AT [15] 84.4 1.00 0.549 0.630 0.816 0.454 0.763 0.930 0.892 0.738
T2 & S2 + SPKD [14] 72.9 4.60 0.558 0.625 0.822 0.454 0.751 0.912 0.889 0.724
T2 & S2 + ReKD [17] 79.7 2.65 0.563 0.652 0.816 0.478 0.750 0.919 0.890 0.727
T2 & S2 + BISKD [18] 84.9 1.25 0.679 0.834 0.864 0.645 0.768 0.933 0.897 0.749
T2 & S2 + Ours 85.3 1.15 0.697 0.871 0.867 0.679 0.776 0.939 0.890 0.757
CVPPP
C.elegans
BBBC039V1

Raw Image GroundTruth Teacher Student Student+AT Student+SPKD Student+ReKD Student+BISKD Student+Ours

Fig. 2. Visual comparisons on three 2D datasets. We use networks ResUNet (T1) and MobileNet (S2) as the teacher and student networks, respectively.
Over-merge and over-segmentation in the results of the student network are highlighted by red and white boxes, respectively.

student network. These attention maps highlight the most student networks. By minimizing the difference between
relevant regions of the input image for the task at hand, these feature maps, the student network is encouraged to
providing guidance to the student network during training. produce similar results to the teacher network.
2) Similarity Preserving Knowledge Distillation 3) Review Knowledge Distillation (ReKD) [17]: This
(SPKD) [14]: This method focuses on maintaining similarity method adopts a novel review mechanism for knowledge
between the intermediate feature maps of the teacher and distillation, which utilizes the multi-level information from the
8

TABLE III
Q UANTITATIVE COMPARISON OF DIFFERENT KNOWLEDGE DISTILLATION METHODS ON 3D BIOMEDICAL INSTANCE DATASETS , WHERE WE USE THE 3D
UN ET MALA AND ITS CORRESPONDING TINY VERSION AS THE TEACHER - STUDENT NETWORK PAIR . T WO POST- PROCESSING ALGORITHMS
( WATERZ [49] AND LMC [29]) ARE ADOPTED TO GENERATE FINAL SEGMENTATION RESULTS . VOI/ARAND ARE ADOPTED AS METRICS .

AC3/4 CREMI-A CREMI-B CREMI-C


U-Net MALA
Waterz ↓ LMC ↓ Waterz ↓ LMC ↓ Waterz ↓ LMC ↓ Waterz ↓ LMC ↓

T: MALA 1.296 / 0.115 1.261 / 0.110 0.853 / 0.132 0.846 / 0.132 1.653 / 0.129 1.503 / 0.091 1.522 / 0.123 1.618 / 0.205
S: MALA-tiny 1.649 / 0.122 1.565 / 0.122 1.098 / 0.182 0.961 / 0.147 2.037 / 0.171 1.782 / 0.120 2.085 / 0.241 1.733 / 0.203

AT [15] 1.496 / 0.119 1.469 / 0.115 1.068 / 0.176 0.905 / 0.132 1.961 / 0.165 1.774 / 0.155 1.805 / 0.151 1.691 / 0.226
SPKD [14] 1.463 / 0.115 1.444 / 0.113 0.962 / 0.150 0.895 / 0.140 1.785 / 0.150 1.716 / 0.117 1.750 / 0.163 1.674 / 0.227
ReKD [17] 1.428 / 0.115 1.385 / 0.109 0.932 / 0.149 0.879 / 0.135 1.887 / 0.148 1.655 / 0.115 1.649 / 0.126 1.684 / 0.199
BISKD [18] 1.384 / 0.120 1.334 / 0.116 0.892 / 0.139 0.856 / 0.136 1.739 / 0.140 1.598 / 0.113 1.595 / 0.119 1.567 / 0.159
Ours 1.320 / 0.108 1.279 / 0.103 0.853 / 0.138 0.821 / 0.135 1.524 / 0.100 1.542 / 0.127 1.568 / 0.125 1.470 / 0.102
CREMI-C
AC3/4

Raw Image GroundTruth Teacher Student Student+AT Student+SPKD Student+ReKD Student+BISKD Student+Ours

Fig. 3. 2D visual comparisons of segmentation results on the CREMI-C and AC3/4 dataset.

teacher network to guide the one-level feature learning of the BBBC039V1 dataset, the improvements are (37.5%, 82.1%,
student network. 66.7%, 73.7%) and (21.0%, 64.7%, 12.0%, 26.0%).
4) Biomedical Instance Segmentation Knowledge Distilla- (2) Our knowledge distillation method proves to be highly
tion (BISKD): This is our preliminary work [18] tailored for effective even when dealing with teacher-student network pairs
biomedical instance segmentation. that have significantly different network structures, such as
experimental settings with MobileNet as the student network.
This highlights the versatility of our method and demonstrates
B. Results on 2D Datasets its ability to reduce the performance gap between such teacher
We demonstrate the effectiveness of our knowledge dis- and student networks.
tillation method on three 2D biomedical datasets CVPPP, (3) Baseline methods AT, SPKD, and ReKD ignore the key
C.elegans, and BBBC039V1. From the results in Table II, we knowledge of instance-level features and instance relations,
can observe that: which hinders their ability to guide the student network in en-
(1) Our proposed method consistently outperforms existing larging the difference between adjacent instances and reducing
distillation methods and significantly reduces the performance the feature variance of pixels within the same instance. This
gap between student and teacher networks in various experi- limitation often leads to significant over-merging and over-
mental results. Compared to the second best distillation meth- segmentation. Furthermore, these baseline methods neglect the
ods, on the CVPPP dataset, the ResUNet-tiny and MobileNet importance of instance boundary structure knowledge, which
student networks achieve improvements of 5.9% and 19.6% for leads to additional segmentation errors and coarse boundaries.
the SBD metric. On the C.elegans dataset, the improvements (4) Our preliminary work BISKD only focuses on individual
for the (AJI, Dice, F1, PQ) metrics are (4.8%, 5.6%, 1.9%, input images and neglects inter-image semantic instance rela-
11.0%) and (27.1%, 33.1%, 6.8%, 53.6%) respectively. On tions. This limits the effectiveness of the knowledge transfer
the BBBC039V1 dataset, the improvements are (2.2%, 1.1%, process and leads to suboptimal segmentation results.
0.4%, 3.6%) and (5.3%, 3.4%, 1.0%, 6.8%) respectively. Ad- In addition to the quantitative results, we conduct visual
ditionally, our method reduces the performance gap between comparisons between the segmentation results of our proposed
ResUNet and ResUNet-tiny networks by 71.6% and 84.4% for distillation method and those of the baseline methods on chal-
the SBD metric on the CVPPP dataset, and (40.7%, 59.5%, lenging cases, as depicted in Fig. 2. These visual comparisons
56.7%, 68.2%) and (46.3%, 74.5%, 37.8%, 57.7%) for the clearly demonstrate the superiority of our distillation method
(AJI, Dice, F1, PQ) metrics on the C.elegans dataset. On the in terms of segmentation performance. Notably, our method
9

CREMI-C
AC3/4

GroundTruth Teacher Student Student+AT Student+SPKD Student+ReKD Student+BISKD Student+Ours

Fig. 4. 3D visual comparisons on the CREMI-C and AC3/4 dataset. Red and black arrows indicate over-segmentation and over-merge, respectively.

exhibits a stronger ability to differentiate adjacent instances


and predict more precise instance boundaries, effectively mit-
igating issues related to over-merging and over-segmentation.
This capability is particularly crucial for challenging cases in-
volving objects with complex shapes and distributions, where
baseline methods tend to struggle with errors of over-merging RAW Teacher Student Student+AT
and over-segmentation. Our distillation method effectively
overcomes these limitations, resulting in more accurate and
refined segmentation results.

C. Results on 3D Datasets
We compare various knowledge distillation methods for the Student+SPKD Student+ReKD Student+BISKD Student+Ours
3D U-Net MALA on the AC3/4 dataset and three CREMI
subvolumes, as presented in Table III. Our proposed method Fig. 5. A visual example of different embedding maps predicted by student
consistently outperforms the competing distillation methods, networks distilled with different knowledge distillation methods. We use
networks ResUNet (T1) and MobileNet (S2) as the teacher and student
exhibiting statistically significant improvements in the major- networks, respectively.
ity of experiments. Specifically, our method achieves a remark-
able reduction in the performance gap between the student
and teacher networks, with reductions exceeding 93.3% for leading to embeddings of neighboring instances having similar
the key VOI metric on the AC3/4 dataset and over 72.9% RGB color, i.e., similar feature representation. Furthermore,
for the VOI metric on the CREMI datasets. The student the instance boundary regions in the embedding map appear
network demonstrates substantial improvements of 20.0%, blur and lack accurate structural information.
22.7%, 25.5%, and 24.9% for the VOI metric on the AC3/4, (2) When compared to baseline methods, the visualized
CREMI-A, CREMI-B, and CREMI-C datasets, respectively. embeddings obtained from the student network using our
We present the 2D visual comparison in Fig. 3, showcasing knowledge distillation method exhibit more distinct color dif-
the superior performance of our method in enabling the student ferences among adjacent instances. Additionally, the instance
network to accurately distinguish instances and address over- boundary regions demonstrate clear and accurate structures.
segmentation and over-merge errors. Additionally, the 3D These observations indicate that our proposed IGD and AGD
visual comparison in Fig. 4 highlights the distinct advantage schemes effectively facilitate the student network in learning
of our proposed method in preserving the accuracy of neuron instance relations in the feature space and capturing pixel-level
structures compared to existing methods. boundary structure information.
1) Analysis on visualized embeddings: To facilitate a com- (3) In comparison to the visualized embeddings from the
prehensive analysis of the functionality of the proposed knowl- ‘Student+BISKD’ approach, the embeddings from the ‘Stu-
edge distillation method, we present a visualization of the dent+Ours’ method exhibit more pure colors within each
embeddings generated by the student networks, which have instance area. This observation confirms the importance of
been distilled using various distillation methods. To achieve considering cross-image relations.
this, we employ the PCA technique to project the embeddings
from a high-dimensional space onto a 3-dimensional RGB
color space in Fig. 5. Based on the visual results, we have D. Ablation Study
made three observations: 1) Effectiveness of different distillation components: To
(1) The embeddings predicted by the student model may verify the effectiveness of the distillation components of our
not adequately capture the relation between adjacent instances, method, we conduct the ablation study on the proposed IGD
10

TABLE IV
A N ABLATION STUDY IS CONDUCTED ON THE CVPPP DATASET TO

Memory Cost (MB)


Memory Cost (MB)
EVALUATE THE PERFORMANCE OF DIFFERENT COMPONENTS OF OUR
DISTILLATION METHOD IN THE TEACHER - STUDENT NETWORK PAIR OF

SBD

SBD
R ES UN ET AND M OBILE N ET NETWORKS . T HE CHECK MARK AND CROSS
MARK INDICATE THE USAGE AND NON - USAGE OF THIS COMPONENT,
RESPECTIVELY.

LIntra
N ode LIntra
Edge LIntra
AGD LInter
Edge LInter
AGD SBD ↑ |DiC| ↓
Queue Size K Sampling Number L
% % % % % 71.9 5.00
! % % % % 76.5 2.45 Fig. 6. Ablation study on the queue size K and sampling number L.
! ! % % % 79.4 2.50 Experiments are performed for the teacher-student network pair of ResUNet
! ! ! % % 84.7 1.40 and MobileNet networks on the CVPPP dataset. ‘Memory Cost’ denotes the
occupied GPU memory size (M B).
! ! ! ! % 85.2 1.35
! ! ! ! ! 86.0 1.10
TABLE VI
A BLATION STUDY OF THE DISTILLATION PERFORMANCE ON A SERIES OF
TABLE V SMALL MODELS OBTAINED BY REDUCING THE NUMBER OF CHANNELS OF
A BLATION STUDY ON LOSS WEIGHT HYPERPARAMETERS ON THE CVPPP EACH NETWORK LAYER IN DIFFERENT RATIOS . S 1/N REPRESENTS THE
DATASET. W E ADOPT THE SAME HYPERPARAMETERS FOR ALL STUDENT NETWORKS OBTAINED BY REDUCING THE NUMBER OF
1
EXPERIMENTAL SETTINGS AND USE NETWORKS R ES UN ET (T1) AND CHANNELS OF THE TEACHER NETWORK R ES UN ET BY N .
M OBILE N ET (S2) AS THE TEACHER AND STUDENT NETWORKS FOR
ANALYSIS .
ResUNet SBD ↑ |DiC| ↓ #Params (M) FLOPs (GMAC)
λ1 λ2 λ3 λ4 λ5 SBD ↑ |DiC| ↓
S1/20 w/o KD 74.6 3.95 0.07 1.50
0.1 0.1 10 1 1 86.0 1.10 S1/20 w/ KD 80.9 2.25 0.07 1.50
1 0.1 10 1 1 85.3 1.10
0.01 0.1 10 1 1 85.6 1.30 S1/15 w/o KD 78.7 3.10 0.17 3.28
0.1 1 10 1 1 85.8 1.35 S1/15 w/ KD 84.2 1.85 0.17 3.28
0.1 0.01 10 1 1 85.2 1.35 S1/10 w/o KD 81.9 2.15 0.30 5.76
0.1 0.1 100 1 1 84.9 1.20 S1/10 w/ KD 87.0 1.15 0.30 5.76
0.1 0.1 1 1 1 84.7 1.25
0.1 0.1 10 0.1 1 84.9 1.15 S1/5 w/o KD 85.1 1.50 0.90 17.31
0.1 0.1 10 10 1 84.8 1.45 S1/5 w/ KD 87.6 1.25 0.90 17.31
0.1 0.1 10 1 0.1 85.3 0.80
0.1 0.1 10 1 10 84.9 1.30

of our analysis. Based on the findings, it is observed that λ1


(divided into edge and node parts) and AGD schemes. These and λ2 have a relatively minor impact on the performance
two schemes work at both the intra-image level and the metrics within the considered ranges. However, λ3 , λ4 , and
inter-image level. Thus, we validate these loss terms in our λ5 emerge as critical hyperparameters for achieving optimal
distillation method, including LIntra Intra Intra Inter
N ode , LEdge , LAGD , LEdge
performance.
Inter
and LAGD . The results presented in Table IV demonstrate 3) Impact of the queue size and sampling number: We
the positive impact of each component on enhancing the perform an ablation study to examine the effect of the queue
performance of the student network and narrowing the per- size K and sampling number L of the memory bank mech-
formance gap with the teacher network. Notably, the final row anism on the distillation performance, as shown in Fig. 6.
of the table highlights that our architecture, incorporating all It can be observed that the larger value of these two hyper-
loss components, achieves the highest performance among the parameters, the more GPU memory overhead. The distillation
evaluated configurations. performance improves as the queue size K and sampling
The IGD scheme at the intra-image level, consisting of number L increase. This can be attributed to the fact that
LIntra Intra
N ode and LEdge , improves the undistilled student model larger queues provide a more diverse and abundant range of
by 10.4% according to the SBD metric. This improvement features from different input images, which can capture long-
further increases to 17.8% when the AGD scheme at the intra- range relations more effectively. However, it is also noted
image level is also incorporated. Extending these schemes to that the distillation performance may saturate at a certain
the inter-image level boosts the improvement to 19.6%. The memory capacity. This could be due to the fact that increasing
IGD and AGD schemes demonstrate complementary effects, the queue size beyond a certain point may not yield any
as do the intra-image level and inter-image level schemes. additional benefits to the model. Therefore, it is essential
2) Sensitivity experiments on hyperparameters: Given the to strike a balance between these two hyper-parameters and
importance of hyperparameters in the distillation method’s the computational and memory resources available to ensure
loss terms for achieving optimal performance, we conduct optimal performance. In addition, compared with sampling
a sensitivity analysis on these hyperparameters to effectively number L, the queue size K has a greater impact on GPU
balance different loss functions. Table V presents the results memory occupation.
11

4) Student networks with different widths: We conduct an [12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
ablation study on reduced-size models to evaluate the effec- network,” arXiv preprint arXiv:1503.02531, 2015.
[13] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Ben-
tiveness of distillation. The models are created by reducing the gio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550,
number of channels in each layer of the networks. Specifically, 2014.
we generate student networks with width reductions of approx- [14] F. Tung and G. Mori, “Similarity-preserving knowledge distillation,” in
1 1 1 Proc. Int. Conf. Comput. Vis., 2019, pp. 1365–1374.
imately 20 , 15 , 10 , and 15 compared to the original width. [15] N. Zagoruyko, Komodakis, “Paying more attention to attention: Im-
The experiments utilize ResUet network pairs on the CVPPP proving the performance of convolutional neural networks via attention
dataset. The results in Tab. VI demonstrate that our knowledge transfer,” arXiv preprint arXiv:1612.03928, 2016.
[16] D. Qin, J.-J. Bu, Z. Liu, X. Shen, S. Zhou, J.-J. Gu, Z.-H. Wang,
distillation method improves the performance of all student L. Wu, and H.-F. Dai, “Efficient medical image segmentation based on
networks, even when they have very few parameters. However, knowledge distillation,” IEEE Trans. Med. Imag., vol. 40, no. 12, pp.
it is important to note that the effectiveness of knowledge 3820–3831, 2021.
[17] P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge
distillation depends on the initial performance gap between review,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp.
the teacher and student networks. When this gap is minimal, 5008–5017.
achieving significant improvements becomes challenging. [18] X. Liu, B. Hu, W. Huang, Y. Zhang, and Z. Xiong, “Efficient biomedical
instance segmentation via knowledge distillation,” in MICCAI. Springer,
2022, pp. 14–24.
VI. C ONCLUSION [19] D. Liu, D. Zhang, Y. Song, H. Huang, and W. Cai, “Panoptic feature
In this paper, we propose a novel graph relation distil- fusion net: a novel instance segmentation paradigm for biomedical and
biological images,” IEEE Trans. Image Process., vol. 30, pp. 2045–2059,
lation approach for biomedical instance segmentation that 2021.
effectively transfers instance-level features, instance relations, [20] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid
and pixel-level boundaries from a heavy teacher network to a networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp.
6399–6408.
lightweight student network, through a unique combination [21] J. Yi, P. Wu, M. Jiang, Q. Huang, D. J. Hoeppner, and D. N. Metaxas,
of instance graph distillation and affinity graph distillation “Attentive neural cell instance segmentation,” Medical image analysis,
schemes. Furthermore, we extend these two schemes beyond vol. 55, pp. 228–240, 2019.
[22] R. Girshick, “Fast r-cnn,” in Proc. Int. Conf. Comput. Vis., 2015, pp.
the intra-image level to the inter-image level by incorporating 1440–1448.
a memory bank mechanism, which captures the global rela- [23] X. Zhang, H. Li, F. Meng, Z. Song, and L. Xu, “Segmenting beyond
tion information across different input images. Experimental the bounding box for instance segmentation,” IEEE Trans. Circuits Syst.
Video Technol., vol. 32, no. 2, pp. 704–714, 2021.
results on both 2D and 3D biomedical datasets demonstrate [24] H. Zhang, Y. Tian, K. Wang, W. Zhang, and F.-Y. Wang, “Mask ssd: An
that our method surpasses existing distillation methods and effective single-stage approach to object instance segmentation,” IEEE
effectively bridges the performance gap between the heavy Trans. Circuits Syst. Video Technol., vol. 29, pp. 2078–2093, 2019.
[25] L. Yang, H. Li, F. Meng, Q. Wu, and K. N. Ngan, “Task-specific loss
teacher networks and their corresponding lightweight student for robust instance segmentation with noisy class labels,” IEEE Trans.
networks. Circuits Syst. Video Technol., 2021.
[26] B. De Brabandere, D. Neven, and L. Van Gool, “Semantic instance
R EFERENCES segmentation with a discriminative loss function,” arXiv preprint
arXiv:1708.02551, 2017.
[1] H. Chen, X. Qi, L. Yu, and P.-A. Heng, “Dcan: deep contour-aware [27] M. Lalit, P. Tomancak, and F. Jug, “Embedseg: Embedding-based
networks for accurate gland segmentation,” in CVPR, 2016. instance segmentation for biomedical microscopy data,” Medical image
[2] M. Li, C. Chen, X. Liu, W. Huang, Y. Zhang, and Z. Xiong, “Advanced analysis, vol. 81, p. 102523, 2022.
deep networks for 3d mitochondria instance segmentation,” in ISBI. [28] J.-H. Shi, Q. Zhang, Y.-H. Tang, and Z.-Q. Zhang, “Polyp-mixer: An
IEEE, 2022, pp. 1–5. efficient context-aware mlp-based paradigm for polyp segmentation,”
[3] N. Kumar, R. Verma, S. Sharma, S. Bhargava, A. Vahadane, and IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 1, pp. 30–42,
A. Sethi, “A dataset and a technique for generalized nuclear segmen- 2022.
tation for computational pathology,” IEEE Trans. Med. Imag., vol. 36, [29] T. Beier, C. Pape, N. Rahaman, T. Prange, S. Berg, D. D. Bock,
no. 7, pp. 1550–1560, 2017. A. Cardona, G. W. Knott, S. M. Plaza, L. K. Scheffer et al., “Multicut
[4] Z. Song, P. Wang, J. Zhou, Z. Yang, Y. Yang, Z. Gong, and N. Zheng, brings automated neurite segmentation closer to human performance,”
“Muscleparsenet: a novel framework for parsing muscles of drosophila Nature methods, vol. 14, no. 2, pp. 101–102, 2017.
larva in light-sheet fluorescence microscopy images,” IEEE Trans. [30] K. Fukunaga and L. Hostetler, “The estimation of the gradient of a
Circuits Syst. Video Technol., 2023. density function, with applications in pattern recognition,” IEEE Trans.
[5] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proc. Inf. Theory, vol. 21, no. 1, pp. 32–40, 1975.
Int. Conf. Comput. Vis., 2017, pp. 2961–2969. [31] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward
[6] D. Liu, D. Zhang, Y. Song, C. Zhang, F. Zhang, L. O’Donnell, and feature space analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,
W. Cai, “Nuclei segmentation via a deep panoptic model with semantic no. 5, pp. 603–619, 2002.
feature fusion.” in IJCAI, 2019, pp. 861–868. [32] R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering
[7] D. Zhang, Y. Song, D. Liu, H. Jia, S. Liu, Y. Xia, H. Huang, and W. Cai, based on hierarchical density estimates,” in Pacific-Asia conference on
“Panoptic segmentation with an end-to-end cell r-cnn for pathology knowledge discovery and data mining. Springer, 2013, pp. 160–172.
image analysis,” in MICCAI. Springer, 2018, pp. 237–244. [33] A. Wolny, Q. Yu, C. Pape, and A. Kreshuk, “Sparse object-level
[8] L. Chen, M. Strauch, and D. Merhof, “Instance segmentation of supervision for instance segmentation with pixel embeddings,” in Proc.
biomedical images with an object-aware embedding learned with local IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4402–4411.
constraints,” in MICCAI. Springer, 2019, pp. 451–459. [34] M. Lalit, P. Tomancak, and F. Jug, “Embedding-based instance segmen-
[9] V. Kulikov and V. Lempitsky, “Instance segmentation of biological tation in microscopy,” in MIDL, 2021.
images using harmonic embeddings,” in Proc. IEEE Conf. Comput. Vis. [35] C. Payer, D. Štern, M. Feiner, H. Bischof, and M. Urschler, “Segmenting
Pattern Recog., 2020, pp. 3843–3851. and tracking cell instances with cosine embeddings and recurrent hour-
[10] K. Lee, R. Lu, K. Luther, and H. S. Seung, “Learning and segmenting glass networks,” Medical image analysis, vol. 57, pp. 106–119, 2019.
dense voxel embeddings for 3d neuron reconstruction,” IEEE Trans. [36] W. Huang, S. Deng, C. Chen, X. Fu, and Z. Xiong, “Learning to model
Med. Imag., vol. 40, no. 12, pp. 3801–3811, 2021. pixel-embedded affinity for homogeneous instance segmentation,” in
[11] C. Payer, D. Štern, T. Neff, H. Bischof, and M. Urschler, “Instance seg- AAAI, vol. 36, no. 1, 2022, pp. 1007–1015.
mentation and tracking with cosine embeddings and recurrent hourglass [37] X. Liu, W. Huang, Y. Zhang, and Z. Xiong, “Biological instance
networks,” in MICCAI. Springer, 2018, pp. 3–11. segmentation with a superpixel-guided graph.” in IJCAI, 2022.
12

[38] T. Fukuda, M. Suzuki, G. Kurata, S. Thomas, J. Cui, and B. Ramabhad-


ran, “Efficient knowledge distillation from an ensemble of teachers.” in
Interspeech, 2017, pp. 3697–3701.
[39] C.-H. Chao, B.-W. Cheng, and C.-Y. Lee, “Rethinking ensemble-
distillation for semantic segmentation based unsupervised domain adap-
tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 2610–
2620.
[40] J. M. Noothout, N. Lessmann, M. C. Van Eede, L. D. van Harten,
E. Sogancioglu, F. G. Heslinga, M. Veta, B. van Ginneken, and I. Išgum,
“Knowledge distillation with ensembles of convolutional neural net-
works for medical image segmentation,” Journal of Medical Imaging,
vol. 9, no. 5, pp. 052 407–052 407, 2022.
[41] Y. Liu, J. Cao, B. Li, C. Yuan, W. Hu, Y. Li, and Y. Duan, “Knowledge
distillation via instance relationship graph,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recog., 2019, pp. 7096–7104.
[42] C. Li, G. Cheng, and J. Han, “Boosting knowledge distillation via intra-
class logit distribution smoothing,” IEEE Trans. Circuits Syst. Video
Technol., 2023.
[43] L. Xu, J. Ren, Z. Huang, W. Zheng, and Y. Chen, “Improving knowledge
distillation via head and tail categories,” IEEE Trans. Circuits Syst. Video
Technol., 2023.
[44] Y. Wen, L. Chen, S. Xi, Y. Deng, X. Tang, and C. Zhou, “Towards
efficient medical image segmentation via boundary-guided knowledge
distillation,” in ICME. IEEE, 2021, pp. 1–6.
[45] Y. Chen, P. Chen, S. Liu, L. Wang, and J. Jia, “Deep structured instance
graph for distilling object detectors,” in Proc. Int. Conf. Comput. Vis.,
2021, pp. 4359–4368.
[46] Y. Fu, Y. Feng, and J. P. Cunningham, “Paraphrase generation with latent
bag of words,” Adv. Neural Inform. Process. Syst., vol. 32, 2019.
[47] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in
Proc. Eur. Conf. Comput. Vis. Springer, 2020, pp. 776–794.
[48] C. Yang, H. Zhou, Z. An, X. Jiang, Y. Xu, and Q. Zhang, “Cross-image
relational knowledge distillation for semantic segmentation,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 12 319–12 328.
[49] J. Funke, F. Tschopp, W. Grisaitis, A. Sheridan, C. Singh, S. Saalfeld,
and S. C. Turaga, “Large scale image segmentation with structured
loss based deep learning for connectome reconstruction,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 41, no. 7, pp. 1669–1680, 2018.
[50] S. Wolf, A. Bailoni, C. Pape, N. Rahaman, A. Kreshuk, U. Köthe, and
F. A. Hamprecht, “The mutex watershed and its objective: Efficient,
parameter-free graph partitioning,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 43, no. 10, pp. 3724–3738, 2020.
[51] E. M. A. Anas, S. Nouranian, S. S. Mahdavi, I. Spadinger, W. J.
Morris, S. E. Salcudean, P. Mousavi, and P. Abolmaesumi, “Clinical
target-volume delineation in prostate brachytherapy using residual neural
networks,” in MICCAI. Springer, 2017, pp. 365–373.
[52] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang,
“Unet++: A nested u-net architecture for medical image segmentation,”
in MICCAI worshops. Springer, 2018, pp. 3–11.
[53] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in MICCAI. Springer, 2015, pp.
234–241.
[54] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recog., 2018, pp. 4510–4520.
[55] H. Scharr, M. Minervini, A. Fischbach, and S. A. Tsaftaris, “Annotated
image datasets of rosette plants,” in ECCV, 2014, pp. 6–12.
[56] V. Ljosa, K. L. Sokolnicki, and A. E. Carpenter, “Annotated high-
throughput microscopy image sets for validation.” Nature methods,
vol. 9, no. 7, pp. 637–637, 2012.
[57] W. M. Rand, “Objective criteria for the evaluation of clustering meth-
ods,” Journal of the American Statistical association, vol. 66, no. 336,
pp. 846–850, 1971.
[58] H. Chen, X. Qi, L. Yu, Q. Dou, J. Qin, and P.-A. Heng, “Dcan: Deep
contour-aware networks for object instance segmentation from histology
images,” Medical image analysis, vol. 36, pp. 135–146, 2017.
[59] N. Kasthuri, K. J. Hayworth, D. R. Berger, R. L. Schalek, J. A.
Conchello, S. Knowles-Barley, D. Lee, A. Vázquez-Reina, V. Kaynig,
T. R. Jones et al., “Saturated reconstruction of a volume of neocortex,”
Cell, vol. 162, no. 3, pp. 648–661, 2015.
[60] M. Meilă, “Comparing clusterings by the variation of information,” in
LTKM workshop. Springer, 2003, pp. 173–187.
[61] CREMI, “Miccal challenge on circuit reconstruction from electron
microscopy images,” https://cremi.org/, 2016.

You might also like