Paper 5
Paper 5
17, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3063716
ABSTRACT Computer-aided detection, localisation, and segmentation methods can help improve
colonoscopy procedures. Even though many methods have been built to tackle automatic detection and
segmentation of polyps, benchmarking of state-of-the-art methods still remains an open problem. This
is due to the increasing number of researched computer vision methods that can be applied to polyp
datasets. Benchmarking of novel methods can provide a direction to the development of automated polyp
detection and segmentation tasks. Furthermore, it ensures that the produced results in the community are
reproducible and provide a fair comparison of developed methods. In this paper, we benchmark several
recent state-of-the-art methods using Kvasir-SEG, an open-access dataset of colonoscopy images for polyp
detection, localisation, and segmentation evaluating both method accuracy and speed. Whilst, most methods
in literature have competitive performance over accuracy, we show that the proposed ColonSegNet achieved
a better trade-off between an average precision of 0.8000 and mean IoU of 0.8100, and the fastest speed
of 180 frames per second for the detection and localisation task. Likewise, the proposed ColonSegNet
achieved a competitive dice coefficient of 0.8206 and the best average speed of 182.38 frames per second
for the segmentation task. Our comprehensive comparison with various state-of-the-art methods reveals the
importance of benchmarking the deep learning methods for automated real-time polyp identification and
delineations that can potentially transform current clinical practices and minimise miss-detection rates.
INDEX TERMS Medical image segmentation, ColonSegNet, colonoscopy, polyps, deep learning, detection,
localisation, benchmarking, Kvasir-SEG.
I. INTRODUCTION are important to detect because it can develop into the CRC
Colorectal Cancer (CRC) has the third highest mortality rate at late stage. Thus, an early detection of CRC is crucial for
among all cancers. The overall five-year survival rate of colon survival.
cancer is around 68%, and stomach cancer is only around After modification in the lifestyle, the prevention from
44% [1]. Searching for and removing precancerous anomalies the CRC is the screening of the colon regularly. Differ-
is one of the best working methods to avoid CRC based ent research studies suggest that population-wide screening
mortality. Among these abnormalities, polyps in the colon advances the prognosis and can even reduce the incidence
of CRC [2]. Colonoscopy is an invasive medical procedure
The associate editor coordinating the review of this manuscript and where an endoscopist examines and operates on the colon
approving it for publication was Alberto Cano . using a flexible endoscope. It is considered to be the best
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
40496 VOLUME 9, 2021
D. Jha et al.: Real-Time Polyp Detection, Localization and Segmentation in Colonoscopy Using Deep Learning
diagnostic tool for colon examination for early detection and designs have enabled us to built accurate and efficient sys-
removal of polyps. Therefore, colonoscopic screening is the tems, these largely depend on the data availability as most
most preferred technique among gastroenterologists. recent methods are data voracious. The lack of availability
Polyps are abnormal growths of tissue protruding from of public datasets [14] is a critical bottleneck to accelerate
the mucous membrane. They can occur anywhere in the algorithm development in this realm.
gastrointestinal (GI) tract but are mostly found in the col- In general, curating medical datasets are challenging and
orectal area and are often considered a predecessor of it requires domain knowledge expertise. Reaching a con-
CRC [3], [4]. Polyps may be pedunculated (having a well- sensus to achieve ground truth labels from different experts
defined stalk) or sessile (without a defined stalk). The on the same dataset is again another obstacle. Typically,
colorectal polyps can be categorised into two classes: non- in colonoscopy, smaller polyps or flat/sessile polyps that
neoplastic and neoplastic. Non-neoplastic polyps are further are usually missed out during a procedure can be difficult
sub-categorised into hyperplastic, inflammatory, and hamar- to observe even during manual labeling. Other challenges
tomatous polyps. These types of polyps are non-cancerous include the patient variability and presence of different sizes,
and not harmful. Neoplastic is further sub-categorised into shapes, textures, colors, and orientations of these polyps [3].
adenomas and serrated polyps. These polyps can develop Therefore, during polyp data curation and developing of auto-
into the risk of cancer. Based on their size, colorectal mated systems for the colonoscopy, it is vital that all various
polyps can be categorised into three classes, namely, diminu- challenges often come along routine colonoscopy has to be
tive (≤5mm), small (6 to 9 mm), and advanced (large) taken into consideration.
(≥10mm) [5]. Usually, larger polyps can be detected and Automatic polyp detection and segmentation systems
resected. based on Deep Learning (DL) have a high overall per-
There exists a significant risk with small and diminu- formance in both colonoscopy images and colonoscopy
tive colorectal polyps [6]. A polypectomy is a technique videos [15], [16]. Ideally, the automatic CADx systems for
for the removal of small and diminutive polyps. There polyps detection, localisation, and segmentation should have:
are five different polypectomy techniques for resection of 1) consistent performance and improved robustness to patient
diminutive polyps, namely, cold forceps polypectomy, hot variability, i.e., the system should be able to produce reli-
forceps polypectomy, cold snare polypectomy, hot snare able outputs, 2) high overall performance surpassing the set
polypectomy, and endoscopic mucosal resection [5]. Among bar for algorithms, 3) real-time performance required for
these techniques, cold snare polypectomy is considered best clinical applicability, and 4) easy-to-use system that can pro-
polypectomy technique for resectioning small colorectal vide with clinically interpretable outputs. Scaling this to a
polyps [7]. population sized cohort is also a very resource-demanding
Colonoscopy is an invasive procedure that requires high- and incurs enormous costs. As a first step, we therefore
quality bowel preparation as well as air insufflation during target the detection, localisation, and segmentation of col-
examination [8]. It is both an expensive and time-demanding orectal polyps known as precursors of CRC. The reason for
procedure. Nevertheless, on average, 20% of polyps are starting with this scenario is that most colon cancers arise
missed during examinations. The risk of getting cancer there- from benign adenomatous polyps (around 20%) containing
fore relates to the individual endoscopists’ ability to detect dysplastic cells. Detection and removal of polyps prevent the
polyps [9]. Recent studies have shown that new endoscopic development of cancer, and the risk of getting CRC in the
devices and diagnostic tools have improved the adenoma following 60 months after a colonoscopy depends largely on
detection rate and polyp detection rate [10], [11]. However, the endoscopist ability to detect polyps [9].
the problem of over-looked polyps remains the same. Detection and localisation of polyps are usually critical
The colonoscopy videos recorded at the clinical centers during routine surveillance and to measure the polyp load
store a significant amount of colonoscopy data. However, of the patient at the end of the surveillance while pixel-wise
the collected data are not used efficiently as they are labour segmentation becomes vital to automate the polyp boundary
intense for the endoscopists [12]. Thus, a second review of delineation during the surgical procedures or radio-frequency
videos are often not done. This might lead to missed detec- ablations. In this paper, we evaluate DL methods for both
tion at an early stage largely. Automated data curation and detection (and localisation referring to bounding box detec-
annotation of video data is a prerequisite for building reliable tion) and segmentation (pixel-wise classification or semantic
Computer Aided Diagnosis (CADx) systems that can help to segmentation) SOTA methods on Kvasir-SEG dataset [17]
assess clinical endoscopy more thoroughly [13]. A fraction to provide a comprehensive benchmark for the colonoscopy
of the collected colonoscopy data can be curated to develop images. The main aim of the paper is to establish a
computer-aided systems for automated detection and delin- new strong benchmark with existing successful computer
eation of polyps either during the clinical procedure or after vision approaches. Our contributions can be summarised as
the reporting. At the same time, to build a robust system, follows:
it is vital to incorporate data variability related to patients, • We propose ColonSegNet, an encoder-decoder archi-
endoscopic procedure, and endoscope manufacturers. Even tecture for segmentation of colonoscopic images. The
though recent developments in computer vision and system architecture is very efficient in terms of processing speed
(i.e., produces segmentation of colonoscopic polyp in detection to avoid abnormalities and enable early disease
real-time) and competitive in terms of performance. detection.
• A comprehensive comparison of the state-of-the-art In addition to the work related to automatic detection
computer vision baseline methods on the Kvasir-SEG and localisation, pixel-wise classification (segmentation) of
dataset is presented. The best approaches show real- the disease provides an exact polyp boundary and hence
time performance for polyp detection, localisation, and is also of high significance for clinical surveillance and
segmentation. procedures. Bernel et al. [31] presented the results of the
• We have established strong benchmark for detection and automatic polyp detection subchallenge, which was the part
localisation on the Kvasir-SEG dataset. Additionally, of the endoscopic vision challenge at the Medical Image
we have extended segmentation baseline as compared Computing and Computer Assisted Intervention (MICCAI)
to [3], [17], [18]. These benchmarks can be useful to 2015 conference. This work compared the performance of
develop reliable and clinically applicable methods. eight teams and provided an analysis of various detec-
• Detection, localisation, and semantic segmentation per- tion methods applied on the provided polyp challenge data.
formances are evaluated on standard computer vision Wang et al. [16] proposed a DL-based SegNet [39] that had
metrics. a real-time performance with an inference of more than
• Detailed analysis have been presented with the spe- 25 frames per second. Geo and Matuszewski [40] used
cific focus on the best and worst performing cases that fully convolution dilation networks on the Gastrointesti-
will allow to dissect method success and failure modes nal Image ANAlysis (GIANA) polyp segmentation dataset.
required to accelerate algorithm development. Jha et al. [3] proposed ResUNet++ demonstrating 10%
The rest of the paper is organized as follows: In Section II, improvement compared to the widely used UNet baseline
we present related work in the field. In Section III, we present on Kvasir-SEG dataset. They also further applied the trained
the material. Section IV presents both detection, localisa- model on the CVC-ClinicDB [23] dataset showing more
tion, and segmentation methods. Result are presented in than 15% improvement over UNet. Ali et al. [32] did a
Section V. Discussion on the best performing detection, comprehensive evaluation for both detection and segmenta-
localisation, and semantic segmentation approaches are pre- tion approaches for the artifacts present clinical endoscopy
sented in Section VI and finally a conclusion is provided in including colonoscopy data [41]. Wang et al. [42] proposed a
the Section VII. boundary-aware neural network (BA-Net) for medical image
segmentation. BA-Net is an encoder-decoder network that is
II. RELATED WORK capable of capturing the high-level context and preserving
Automated polyp detection has been an active topic for the spatial information. Later on, Jha et al. [43] proposed
research over the last two decades and considerable work has DoubleUNet for the segmentation, which was applied to
been done to develop efficient methods and algorithms. Ear- four biomedical imaging datasets. The proposed DoubleUNet
lier works were especially focused on polyp color and texture, is the combination of two UNet stacked on top of each
using handcrafted descriptors-based feature learning [27], other with some additional blocks. Experimental results on
[28]. More recently, methods based on Convolutional Neural CVC-Clinic and ETIS-Larib polyp datasets show the state-of-
Networks (CNNs) have received significant attention [29], the-art (SOTA) performances. In addition to the related work
[30], and have been the go to approach for those competing on polyp segmentation, there are studies on segmentation
in public challenges [31], [32]. approaches [44]–[47].
Wang et al. [33] designed algorithms and developed Datasets has been instrumental for medical research.
software modules for fast polyp edge detection and polyp Table 1 shows the list of the available endoscopic image and
shot detection, including a polyp alert software system. video datasets. Kvasir-SEG, ETIS-Larib, and CVC-ClinicDB
Shin et al. [34] have used region-based CNN for automatic contain colonoscopy images, whereas Kvasir, Nerthus,
polyp detection in colonoscopy videos and images. They and HyperKvasir contain the images from the whole
used Inception ResNet as a transfer learning approach and GI. KvasirCapusle contains images from video capsule
post-processing techniques for reliable polyp detection in endoscopy. All the dataset contains images acquired from
colonoscopy. Later on, Shin et al. [14] used generative conventional White Light (WL) imaging technique except
adversarial network [35], where they showed that the gen- the EDD dataset, where it contains images from both
erated polyp images are not qualitatively realistic; how- WL imaging and Narrow Band Imaging (NBI) techniques.
ever, they can help to improve the detection performance. All of these datasets contain at least a polyp class. Out of
Lee et al. [15] used YOLO-v2 [36], [37] for the development nine available datasets, Kvasir-SEG [17], ETIS-Larib [22],
of polyp detection and localisation algorithm. The algorithm and CVC-ClinicDB [23] has manually labeled ground truth
produced high sensitivity and near real-time performance. masks. Among them, Kvasir-SEG offers the most num-
Yamada et al. [38] developed an artificial intelligence sys- ber of annotated samples providing both ground truth
tem that can automatically detect the sign of CRC dur- masks and bounding boxes offering detection, localisation,
ing colonoscopy with high sensitivity and specificity. They and segmentation task. All of the datasets are publicly
claimed that their system could aid endoscopists in real-time available.
IV. METHOD
Detection methods aim to predict the object class and regress
bounding boxes for localisation, while segmentation meth-
ods aim to classify the object class for each pixel in an
image. In Figure 1, ground truth masks for segmentation
task are shown in 2nd column while corresponding bounding
boxes for the detection task are in 3rd column. This section
describes the baseline methods for detection, localisation and
segmentation methods used for the automated detection and
segmentation of polyp in the Kvasir-SEG dataset.
The decoder starts with a transpose convolution, where the ensures monotonically decreasing precision. AP was
first decoder uses a stride value 4, which increases the feature computed as an average APs for IoU from 0.25 to
map spatial dimensions by 4. Similarly, the second decoder 0.75 with a step-size of 0.05 which means an average
uses a stride value of 2, increasing the spatial dimensions over 11 IoU levels are used (AP @[.25 : .05 : .75]).
by 2. Then, the network follows a simple concatenation and
a residual block. Next, it is concatenated with the second 2) SEGMENTATION TASK
skip connection and again followed by a residual block. The For polyp segmentation task, we have used widely accepted
output of the last decoder block passes through a 1 × 1 computer vision metrics that include Dice Coefficient (DSC),
convolution and a sigmoid activation function, generating the Jaccard Coefficient (JC), precision (p), and recall (r), and
binary segmentation mask. overall accuracy (Acc). JC is also termed as IoU. We have
also included Frame Per Second (FPS) to evaluate the clinical
1) DATA AUGMENTATION applicability of the segmentation methods in terms of infer-
Supervised learning methods are data voracious and require ence time during the test.
large amount of data to obtain reliable and well-performing To define each metric, let tp, fp, tn, and fn represents true
models. Acquiring such training data through data collection, positives, false positives, true negatives, and false negatives,
curation, and annotation is a manual process that needs sig- respectively.
nificant resources and man-hours from both clinical experts
2 · tp
and computational scientists. DSC = (3)
Data augmentation is a common technique to computa- 2 · tp + fp + fn
tionally increase the number of training samples in a dataset. tp
IoU = (4)
For our DL models, we use basic augmentation techniques tp + fp + fn
such as horizontal flipping, vertical flipping, random rotation, tp
r = (5)
random scale, and random cropping. The images used in all tp + fn
the experiments undergo normalization and are resized to a tp
p= (6)
fixed size of 512 × 512. For the normalization, we subtract tp + fp
the image by mean and divide it by standard deviation. 5p × r
F2 = (7)
4p + r
V. RESULTS tp + tn
In this section, we first present our evaluation metrics and Acc = (8)
tp + tn + fp + fn
experimental setup. Then, we present both quantitative and #frames 1
qualitative results. FPS = = (9)
sec sec/frame
A. EVALUATION METRICS
We have used standard computer vision metrics to evaluate B. EXPERIMENTAL SETUP AND CONFIGURATION
polyp detection and localisation, and semantic segmentation The methods such as UNet, ResUNet, ResUNet ++, Dou-
methods on the Kvasir-SEG dataset. bleUNet, and HRNet were implemented using Keras [70]
with a Tensorflow [71] back-end and were run on a Volta
1) DETECTION AND LOCALISATION TASK 100 GPU and an Nvidia DGX-2 AI system. A PyTorch
For the object detection and localisation task, the commonly implementation for FCN8, PSPNet, DeepLabv3 +, UNet-
used Average Precision (AP) and IoU have been used [68], ResNet34, and ColonSegNet networks were done. Training
[69]. of these methods were conducted on NVIDIA Quadro RTX
• IoU: This metric measures the overlap between two 6000. NVIDIA GTX2080Ti was used for test inference for all
bounding boxes A and B as the ratio between the over- methods reported in the paper. All of the detection methods
lapped area. were implemented using PyTorch and used NVIDIA Quadro
RTX 6000 hardware for training the network.
A∩B
IoU(A,B) = (1) In all of the cases, we used 880 images for training and
A∪B the remaining 120 images for the validation. Due to different
• AP: AP is computed as the Area Under Curve (AUC) image sizes in the dataset, we resized the images to 512×512.
of the precision-recall curve of detection sampled at all Hyperparamters are important for the DL algorithms to find
unique recall values (r1, r2, . . . ) whenever the maximum the optimal solution. However, picking the optimal hyperpa-
precision value drops: rameter is difficult. There are algorithms such as grid search,
AP =
X
(rn+1 − rn ) pinterp (rn+1 ) , (2) random search, and advanced solutions such as Bayesian
n
optimization for finding the optimal parameters. However,
an algorithm such as Bayesian optimization is computation-
with pinterp (rn+1 ) = max p(r̃). Here, p(rn ) denotes the
r̃≥rn+1 ally costly, making it difficult to test several DL algorithms.
precision value at a given recall value. This definition We have done an extensive hyperparameter search for finding
FIGURE 4. Detection and localisation results on test dataset: On right of the black solid line, images where EfficientDet-D0, YOLOv4, Faster R-CNN and
RetinaNet (with ResNet50 backbone) have similar results and in most cases obtained highest IoU. On left, images with failed case (worse localisation) for
either of the method. Confidence scores are provided on the top-left of the red prediction boxes.
TABLE 2. Hyperparameters used for baseline methods for polyp detection and localisation task on Kvasir-SEG. Here, CIoU: complete
intersection-of-union loss, MSE: mean square error, CE: cross-entropy.
TABLE 3. Result on the polyp detection and localisation task on the Kvasir-SEG dataset. Two best scores are highlighted in bold.
the optimal hyperparameters for polyp detection, localisation, computed for multiple IoU thresholds and for average pre-
and segmentation task. These sets of hyperparameters were cision at IoU threshold 25 (AP25 ) and 50 (AP50 ). RetinaNet
chosen based on empirical evaluation. The used hyperparam- with ResNet101 backbone achieved an average precision
eters are for the Kvasir-SEG dataset and are reported in the of 0.8745, while YOLOv4 yielded 0.8513. However, for
Table 2, and Table 4. the IoU threshold of 0.75, YOLOv4 showed improvement
over RetinaNet with (AP75 ) of 0.7594 against 0.7132 for
C. QUANTITATIVE EVALUATION RetinaNet with ResNet101 backbone. Similarly, the aver-
1) DETECTION AND LOCALISATION age IoU of 0.8248 was observed for YOLOv3, which is
Table 3 shows the detailed result for the polyp detec- nearly 8% improvement over RetinaNet. IoU determines the
tion and localisation task on the Kvasir-SEG dataset. preciseness of the bounding box localisation. EfficientDet-
It can be observed that RetinaNet shows improvement D0 obtained the least AP of 0.4756 and IoU of 0.4322.
over YOLOv3 and YOLOv4 for mean average precision Faster R-CNN obtained an AP of 0.7866. However, it only
TABLE 4. Hyperparameters used for baseline methods for polyp segmentation task on Kvasir-SEG dataset.
TABLE 5. Baseline methods for polyp segmentation on the Kvasir-SEG dataset. Two best scores are highlighted in bold. ‘‘-’’ shows that there is no
backbone used in the network.
obtained an FPS of 8. YOLOv4 with Darknet53 as back- methods in terms of DSC, and IoU. However, the proposed
bone obtained a FPS of 48, which is 6× faster than Faster ColonSegNet outperforms in terms of processing speed.
R-CNN. The other competitive network was YOLOv3, with ColonSegNet is faster than UNet-ResNet34 by more than
an average FPS of 45.01. However, its average precision four times in processing colonoscopy frames. The com-
value is 5% less than YOLOv4. Thus, the quantitative results plexity of the network is six times smaller than the UNet-
show that the YOLOv4 with Darknet can detect different ResNet34 network. The proposed network is even smaller
types of polyps at a real-time speed of 48 FPS and average than the conventional UNet, with its size only being around
precision of 0.8513. Therefore, from the evaluation metrics 0.75 times that of the UNet with higher scores on evaluation
comparison, YOLOv4 with Darknet53 is the best model for metrics compared to the classical UNet and its derivates such
detection and localisation of polyp. The results suggest that as ResUNet and ResUNet ++. Additionally, the recall and
the model can help gastroenterologists find missed polyps overall accuracy metrics of ColonSegNet are close to the
and decrease the polyp miss-rate. Even though, the proposed highest performing UNet-ResNet34 network, which shows
ColonSegNet is primary built for real-time segmentation of the proposed method’s efficiency.
polyps, we compared the bounding box predictions of the The original implementation of UNet obtained the least
proposed network with SOTA detection methods. It can be DSC of 0.5969, whereas the UNet with ResNet34 as the
observed that the inference of the proposed method is nearly backbone model obtained the highest DSC of 0.8757. The
four times faster (180 FPS) than YOLOv4. Additionally, it second and third best DSC scores of 0.8643 and 0.8572 were
is also obtaining competitive scores on both AP and IoU obtained for DeepLabv3+ with ResNet101 and DeepLabv3+
metrics (IoU of 0.81 and AP of 0.80). Therefore, it can also with ResNet50 as the backbone, respectively. From the
be considered as one of the best detection and localisation table, it is seen that DeepLabv3+ with ResNet101 per-
techniques. forms better than Deeplabv3+ with ResNet50. This may be
because of the top-5 accuracy (i.e., the validation results
on the ImageNet model) of ResNet101 is slightly better
2) SEGMENTATION
than ResNet50.1 Despite of DeepLabv3+ with ResNet101
Table 5 shows the obtained results on the polyp segmentation
task. It can be observed that the UNet with ResNet34 back-
bone performs better than the other SOTA segmentation 1 https://keras.io/api/applications/
FIGURE 5. Best and worse performing samples for polyp segmentation: a) Top (left) and bottom (right) scored sets, b) predicted masks for top scored
images and c) bottom scored images for all methods compared to the ground truth (GT) masks. Green rectangles represent the selected images from top
scored set and red rectangle represent those from bottom set. Here, UNet-RN34: UNet-ResNet34, RUNet ++: ResUNet ++, D-UNet: Double UNet, DLabv3
+: DeepLabv3 + (ResNet50).
backbone having the total number of trainable parameters ColonSegNet is competitive compared to both of these net-
more than 11 times and DeepLabv3+ with ResNet34 being works. However in terms, of processing speed, it is almost
nearly eight times computational complexity, the DSC of 11 times faster than DeepLabv3 + with ResNet101 and
nearly seven times faster than DeepLabv3 with ResNet34 for flat polyps (very small), images with a certain degree
backbone. of inclined view, and for the images with saturated areas.
FCN8, HRNet and DoubleUNet provided similar results The proposed ColonSegNet is able to achieve similar shapes
with DSC of 0.8310, 0.8446, and 0.8129 while ResUNet ++ compared to these of the ground truth with some outliers for
achieved DSC of only 0.7143. A similar trend can be the predictions which can be seen in Figure 5(b), while for
observed for F2-score for all methods. For precision, UNet the prediction on worse performing images in Figure 5(c),
with ResNet34 backbone achieved the maximum score of our proposed network provides comparatively improved pre-
p = 0.9435, and DeepLabv3 + with ResNet50 backbone dictions on almost all samples.
achieved the highest scores of r = 0.8616, while UNet scored
the worst with p = 0.6722 and r = 0.6171. The overall VI. DISCUSSION
accuracy was outstanding for most methods, with the highest It is evident that there is a growing interest in the investi-
for UNet and ResNet34 as the backbone. IoU is also provided gation of computational support systems for decision mak-
in the table for each segmentation method for scientific com- ing through endoscopic images. For the first time, we are
pletion. Again, UNet and ResNet34 surpassed others with a using Kvasir-SEG for detection and localisation tasks, and
mIoU score of 0.8100. Also, UNet and ResNet34 achieved comparing segmentation methods with most recent SOTA
the highest FPS rate of 35 fps, which is acceptable in terms methods. We provide a reproducible benchmarking of the DL
of speed and is relatively faster as compared to DeepLabv3+ methods using standard computer vision metrics in object
with ResNet50 (27.9000) and DeepLabv3+ with ResNet101 detection and localisation, and semantic segmentation. The
(16.7500) and other SOTA methods. Additionally, when we choice of methods are based their popularity in the medical
consider the number of parameter uses (see Table 4), UNet image domain for detection and segmentation (e.g., UNet,
with ResNet34 backbone uses less number of the parameters Faster R-CNN), speed (e.g., UNet with ResNet34, YOLOv3),
as compared to that of FCN8 or DeepLabv3 + network. and accuracy (e.g., PSPNet, FCN8, or DoubleUNet) or a
Due to the low number of trainable parameters and fastest combination of all (e.g., DeepLabv3+, YOLOv4).
inference time, ColonSegNet is computationally efficient and From the experimental results in Table 3, we can observe
becomes the best choice while considering the need for real- that the combination of YOLOv3 with Darknet53 backbone
time segmentation (182.38 FPS on NVIDIA GTX2080Ti) shows improvement over other methods in terms of mIoU,
of polyps with deployment possible on even low-end hard- which means a better localisation compared to counterpart
ware devices making it feasible for many clinical settings. RetinaNet. However, YOLOv4 is 3× faster than RetinaNet
Whereas, UNet with ResNet34 backbone seems the best and has a good trade-off between the average precision and
choice while taking DSC metric into account, however, with IoU. This is because of their Cross-Stage-Partial-Connections
speed of only 35 FPS on NVIDIA GTX2080Ti. (CSP) and CIoU loss for bounding box regression. However,
RetinaNet with the backbone ResNet101 shows competitive
D. QUALITATIVE EVALUATION results surpassing other methods on average precision but
Figure 4 shows the qualitative result for the polyp detection nearly 5% less IoU compared to YOLOv4 and nearly 5% less
and localisation task along with their corresponding confi- than YOLOv3-spp. Similarly, state-of-the-art methods Faster
dence scores. It can be observed that for most images on the R-CNN and EffecientDet-D0 provided the least AP and IoU.
left side of the vertical line, both YOLOv4 and RetinaNet A choice between computational speed, accuracy and pre-
are able to detect and localise polyps with higher confidence, cision is vital in object detection and localisation tasks, espe-
except for the third column sample where most of these cially for colonoscopy video data where speed is a vital
methods can identify only some polyp areas. Similarly, on the element to achieve real-time performance. Therefore, we con-
right side of the vertical line, the detected bounding boxes for sider YOLOv4 with Darknet53 and CSP backbone as the best
5th and 6th column images are too wide for the RetinaNet, approach in the table for the polyp detection and localisation
while YOLOv4 has the best localisation of polyp (observe task.
the bounding box). Also, in the seventh column, RetinaNet For the semantic segmentation tasks, ColonSegNet showed
and EfficientDet D0 misses the polyp. In the eighth column, improvement over all the methods. The method obtained the
YOLOv4 and EfficientDet D0 misses the small polyp com- highest FPS of 182.38. The quantitative results in Figure 5 (b)
pletely while stool and polyp is detected as polyp by the Faster showed the most accurate delineation of polyp pixels com-
R-CNN and RetinaNet. pared to other SOTA methods considered in this paper.
Figure 5 shows the result for the top-scored and bottom The most competitive method to ColonSegNet was UNet
scored sets selected based on their dice similarity coefficient with ResNet34 backbone. The other comparable method was
values for the semantic segmentation methods. It can be seen DeepLabv3 +, which accuracy can be due to its ability
that all the algorithms are able to detect large polyps and to navigate the semantically meaningful regions with its
produce high-quality masks (see Figure 5(b). atrous convolution and spatial-pyramid pooling mechanism.
Here, the best obtained segmentation results can be Additionally, the feature concatenation from previous fea-
observed for DeepLabv3+ and UNet-ResNet34. However, ture maps may have helped to compute more accurate maps
as shown in Figure 5(c), the segmentation results are affected for object semantic representation and hence segmentation.
The other competitor was PSPNet, which is also based on IoU, and FPS for the detection algorithm and DSC, IoU,
similar idea but on aggregating the global context informa- precision, recall, F2-score, and FPS for the segmentation
tion from different regions rather than the use of dilated algorithm. While algorithms investigated in this paper show
convolutions. The computational speed for DeepLabv3+ a clear strength to be used in clinical settings to help gas-
with the same ResNet50 backbone as used in PSPNet in troenterologists for the polyp detection, localisation, and seg-
our experiments comes from the fact that the 1D separa- mentation task, computational scientists can build upon these
ble convolutions and SPP network is used in DeepLabv3+. methods to further improve in terms of accuracy, speed and
We evaluated the most recent popular SOTA method in seg- robustness.
mentation ‘‘HRNet’’ [65]. While HRNet produced compet- Additionally, the qualitative results provide insight for
itive results compared to other SOTA methods, UNet with failure cases. This gives an opportunity to address the chal-
ResNet34 backbone and DeepLabv3+ outperformed for most lenges present in the Kvasir-SEG dataset. Moreover, we have
evaluation metrics with ColonSegNet being competitive in provided experimental results using well-established perfor-
the recall, and overall accuracy and outperforming other mance metrics along with the dataset for a fair comparison of
SOTA method significantly. the approaches. We believe that further data augmentation,
Figure 5 shows an example for the 16 top scored and fine tuning, and more advanced methods can improve the
16 bottom scored images on DSC for segmentation. From the results. Additionally, incorporating artifacts [73] (e.g., sat-
results in Figure 5(c), it can be observed that there are polyps uration, specularity, bubbles, and contrast) issues can help
whose appearance under the given lighting conditions is very improve the performance of polyp detection, localisation,
similar to healthy surrounding gastrointestinal skin texture. and segmentation. In the future, research should be more
We suggest that including more samples with variable tex- focused on designing even better algorithms for detection,
ture, different lighting conditions, and different angular views localisation, and segmentation tasks, and models should be
(refer to the samples in Figure 5(a) on the right, and (c)) can build taking the number of parameters into consideration as
help to improve the DSC and other metrics of segmentation. required by most clinical systems.
We also observed that the presence of sessile or flat polyps
were major limiting factors for algorithm robustness. Thus, ACKNOWLEDGMENT
including smaller polyps with respect to image size can help Debesh Jha is funded by the Research Council of Norway
algorithm to generalise better thereby making these methods project number 263248 (Privaton). The computations in this
more usable for early detection of hard-to find polyps. In this paper were performed on equipment provided by the Experi-
regard, we also suggest the use of spatial pyramid layers to mental Infrastructure for Exploration of Exascale Computing
handle small polyps and using context-aware methods such (eX3), which is financially supported by the Research Coun-
as incorporation of artifacts or shape information to improve cil of Norway under contract 270053. Parts of computational
the robustness of these methods. resources were also used from the research supported by the
The possible limitation of the study is its retrospective National Institute for Health Research (NIHR) Oxford BRC
design. Clinical studies are required for the validation of the with additional support from the Wellcome Trust Core Award
approach in a real-world setting [72]. Additionally, in the Grant Number 203141/Z/16/Z. Sharib Ali is supported by
presented study design we have resized the images, which the NIHR Oxford Biomedical Research Centre. The views
can lead to loss of information and affect the algorithm expressed are those of the author(s) and not necessarily
performance. Moreover, we have optimized all the algo- those of the NHS, the NIHR or the Department of Health.
rithms based on the empirical evaluation. Even though, (Debesh Jha and Sharib Ali contributed equally to this work.)
optimal hyper-parameters have been set after experiments,
we acknowledge that these can be further adjusted. Similarly,
meta-learning approaches can be exploited to optimize the REFERENCES
hyper-parameters that can work even in resource constraint [1] J. Asplund, J. H. Kauppila, F. Mattsson, and J. Lagergren, ‘‘Survival trends
settings. in gastric adenocarcinoma: A population-based study in Sweden,’’ Ann.
Surgical Oncol., vol. 25, no. 9, pp. 2693–2702, Sep. 2018.
[2] Ø. Holme, M. Bretthauer, A. Fretheim, J. Odgaard-Jensen, and G. Hoff,
VII. CONCLUSION ‘‘Flexible sigmoidoscopy versus faecal occult blood testing for colorectal
In this paper, we benchmark deep learning methods on the cancer screening in asymptomatic individuals,’’ Cochrane Database Sys-
tematic Rev., vol. 9. Munich, Germany: Zuckschwerdt, Oct. 2013.
Kvasir-SEG dataset. We conducted thorough and extensive
[3] D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, T. D. Lange,
experiments for polyp detection, localisation, and segmen- P. Halvorsen, and H. D. Johansen, ‘‘ResUNet++: An advanced architec-
tation tasks and shown how different algorithms performs ture for medical image segmentation,’’ in Proc. IEEE Int. Symp. Multime-
dia (ISM), Dec. 2019, pp. 225–2255.
on variable polyp sizes and image resolutions. The proposed
[4] R. G. Holzheimer and J. A. Mannick, Surgical Treatment: Evidence-Based
ColonSegNet detected and localised polyps at 180 frames Problem-Oriented. 2001.
per second. Similarly, ColonSegNet segmented polyps at the [5] J. Lee, ‘‘Resection of diminutive and small colorectal polyps: What is the
speed of 182.38 frames per second. The automatic polyp optimal technique?’’ Clin. Endoscopy, vol. 49, no. 4, p. 355, 2016.
[6] P. L. Ponugoti, O. W. Cummings, and D. K. Rex, ‘‘Risk of cancer in small
detection, localisation, and segmentation algorithms showed and diminutive colorectal polyps,’’ Digestive Liver Disease, vol. 49, no. 1,
good performance, as evidenced by high average precision, pp. 34–37, Jan. 2017.
[7] C. V. Tranquillini, W. M. Bernardo, V. O. Brunaldi, E. T. D. Moura, [25] S. Ali et al., ‘‘Deep learning for detection and segmentation of artefact
S. B. Marques, and E. G. H. D. Moura, ‘‘Best polypectomy technique for and disease instances in gastrointestinal endoscopy,’’ Med. Image Anal.,
small and diminutive colorectal polyps: A systematic review and meta- vol. 70, May 2021, Art. no. 102002.
analysis,’’ Arquivos de Gastroenterologia, vol. 55, no. 4, pp. 358–368, [26] D. Jha, S. Ali, K. Emanuelsen, S. A. Hicks, V. Thambawita, E. Garcia-Ceja,
Dec. 2018. M. A. Riegler, T. D. Lange, P. T. Schmidt, H. D. Johansenm, D. Johansen,
[8] O. Kronborg and J. Regula, ‘‘Population screening for colorectal can- and P. Halvorsen, ‘‘Kvasir-instrument: Diagnostic and therapeutic tool
cer: Advantages and drawbacks,’’ Digestive Diseases, vol. 25, no. 3, segmentation dataset in gastrointestinal endoscopy,’’ in Proc. Int. Conf.
pp. 270–273, 2007. Multimedia Modeling (MMM), 2021, pp. 218–229.
[9] M. F. Kaminski, J. Regula, U. Wojciechowska, E. Kraszewska, [27] S. A. Karkanis, D. K. Iakovidis, D. E. Maroulis, D. A. Karras, and
M. Polkowski, J. Didkowska, M. Zwierko, M. Rupinski, M. P. Nowacki, M. Tzivras, ‘‘Computer-aided tumor detection in endoscopic video using
and E. Butruk, ‘‘Quality indicators for colonoscopy and the risk of color wavelet features,’’ IEEE Trans. Inf. Technol. Biomed., vol. 7, no. 3,
interval cancer,’’ New England J. Med., vol. 362, no. 19, pp. 1795–1803, pp. 141–152, Sep. 2003.
May 2010. [28] S. Ameling, S. Wirth, D. Paulus, G. Lacey, and F. Vilarino, ‘‘Texture-
[10] D. Castaneda, V. B. Popov, E. Verheyen, P. Wander, and S. A. Gross, based polyp detection in colonoscopy,’’ in Bildverarbeitung für die Medizin
‘‘New technologies improve adenoma detection rate, adenoma miss rate, 2009. Informatik aktuell, H. P. Meinzer, T. M. Deserno, H. Handels, and
and polyp detection rate: A systematic review and meta-analysis,’’ Gas- T. Tolxdorff, Eds. Berlin, Germany: Springer, 2009, pp. 346–350, doi:
trointestinal Endoscopy, vol. 88, no. 2, pp. 209–222, 2018. 10.1007/978-3-540-93860-6_70.
[11] M. Matyja, A. Pasternak, M. Szura, M. Wysocki, M. Pędziwiatr, and [29] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall,
K. Rembiasz, ‘‘How to improve the adenoma detection rate in colorectal M. B. Gotway, and J. Liang, ‘‘Convolutional neural networks for medical
cancer screening? Clinical factors and technological advancements,’’ Arch. image analysis: Full training or fine tuning?’’ IEEE Trans. Med. Imag.,
Med. Sci., AMS, vol. 15, no. 2, p. 424, 2019. vol. 35, no. 5, pp. 1299–1312, May 2016.
[12] M. Riegler, ‘‘Eir—A medical multimedia system for efficient computer [30] H.-C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao,
aided diagnosis,’’ Ph.D. dissertation, Dept. Inform., Univ. Oslo, Oslo, D. Mollura, and R. M. Summers, ‘‘Deep convolutional neural networks for
Norway, 2017. computer-aided detection: CNN architectures, dataset characteristics and
[13] T. D. Lange, P. Halvorsen, and M. Riegler, ‘‘Methodology to develop transfer learning,’’ IEEE Trans. Med. Imag., vol. 35, no. 5, pp. 1285–1298,
machine learning algorithms to improve performance in gastrointestinal May 2016.
endoscopy,’’ World J. Gastroenterol., vol. 24, no. 45, p. 5057, 2018. [31] J. Bernal et al., ‘‘Comparative validation of polyp detection methods in
[14] Y. Shin, H. A. Qadir, and I. Balasingham, ‘‘Abnormal colon polyp image
video colonoscopy: Results from the MICCAI 2015 endoscopic vision
synthesis using conditional adversarial networks for improved detection
challenge,’’ IEEE Trans. Med. Imag., vol. 36, no. 6, pp. 1231–1249,
performance,’’ IEEE Access, vol. 6, pp. 56007–56017, 2018.
Jun. 2017.
[15] J. Y. Lee, J. Jeong, E. M. Song, C. Ha, H. J. Lee, J. E. Koo,
[32] S. Ali et al., ‘‘An objective comparison of detection and segmentation
D.-H. Yang, N. Kim, and J.-S. Byeon, ‘‘Real-time detection of colon polyps
algorithms for artefacts in clinical endoscopy,’’ Sci. Rep., vol. 10, no. 1,
during colonoscopy using deep learning: Systematic validation with four
pp. 1–15, Dec. 2020.
independent datasets,’’ Sci. Rep., vol. 10, no. 1, pp. 1–9, Dec. 2020.
[33] Y. Wang, W. Tavanapong, J. Wong, J. H. Oh, and P. C. D. Groen, ‘‘Polyp-
[16] P. Wang, X. Xiao, J. R. Glissen Brown, T. M. Berzin, M. Tu, F. Xiong,
alert: Near real-time feedback during colonoscopy,’’ Comput. Methods
X. Hu, P. Liu, Y. Song, D. Zhang, X. Yang, L. Li, J. He, X. Yi, J. Liu,
Programs Biomed., vol. 120, no. 3, pp. 164–179, Jul. 2015.
and X. Liu, ‘‘Development and validation of a deep-learning algorithm for
[34] Y. Shin, H. A. Qadir, L. Aabakken, J. Bergsland, and I. Balasingham,
the detection of polyps during colonoscopy,’’ Nature Biomed. Eng., vol. 2,
‘‘Automatic colon polyp detection using region based deep CNN and post
no. 10, pp. 741–748, Oct. 2018.
[17] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. D. Lange, learning approaches,’’ IEEE Access, vol. 6, pp. 40950–40962, 2018.
D. Johansen, and H. D. Johansen, ‘‘Kvasir-SEG: A segmented polyp [35] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
dataset,’’ in Proc. Int. Conf. Multimedia Modeling (MMM), 2020, S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial nets,’’ in
pp. 451–462. Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
[18] D. Jha, P. H. Smedsrud, D. Johansen, T. D. Lange, H. Johansen, [36] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
P. Halvorsen, and M. Riegler, ‘‘A comprehensive study on colorectal polyp Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
segmentation with ResUNet++, conditional random field and test-time Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
augmentation,’’ IEEE J. Biomed. Health Inform., early access, Jan. 5, 2021, [37] J. Redmon and A. Farhadi, ‘‘YOLO9000: Better, faster, stronger,’’ in
doi: 10.1109/JBHI.2021.3049304. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
[19] K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. D. Lange, pp. 7263–7271.
D. Johansen, C. Spampinato, D. T. Dang-Nguyen, M. Lux, P. T. Schmidt, [38] M. Yamada, Y. Saito, H. Imaoka, M. Saiko, S. Yamada, H. Kondo,
and M. Riegler, ‘‘Kvasir: A multi-class image dataset for computer aided H. Takamaru, T. Sakamoto, J. Sese, A. Kuchiba, T. Shibata, and
gastrointestinal disease detection,’’ in Proc. 8th ACM Multimedia Syst. R. Hamamoto, ‘‘Development of a real-time endoscopic image diagnosis
Conf., 2017, pp. 164–169. support system using deep learning technology in colonoscopy,’’ Sci. Rep.,
[20] K. Pogorelov, K. R. Randel, T. D. Lange, S. L. Eskeland, C. Griwodz, vol. 9, no. 1, pp. 1–9, Dec. 2019.
D. Johansen, C. Spampinato, M. Taschwer, M. Lux, P. T. Schmidt, and [39] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep con-
M. Riegler, ‘‘Nerthus: A bowel preparation quality video dataset,’’ in Proc. volutional encoder-decoder architecture for image segmentation,’’ IEEE
ACM Multimedia Syst. Conf. (MMSys), 2017, pp. 170–174. Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
[21] H. Borgli, V. Thambawita, P. H. Smedsrud, S. Hicks, D. Jha, S. L. Eskeland, Dec. 2017.
K. R. Randel, K. Pogorelov, M. Lux, D. T. D. Nguyen, D. Johansen, [40] Y. Guo and B. Matuszewski, ‘‘GIANA polyp segmentation with
C. Griwodz, H. K. Stensland, E. Garcia-Ceja, P. T. Schmidt, H. L. Hammer, fully convolutional dilation neural networks,’’ in Proc. 14th Int.
M. A. Riegler, P. Halvorsen, and T. D. Lange, ‘‘HyperKvasir, a comprehen- Joint Conf. Comput. Vis., Imag. Comput. Graph. Theory Appl., 2019,
sive multi-class image and video dataset for gastrointestinal endoscopy,’’ pp. 632–641.
Sci. Data, vol. 7, no. 1, pp. 1–14, Dec. 2020. [41] S. Ali, F. Zhou, C. Daul, B. Braden, A. Bailey, S. Realdon, J. East,
[22] J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, ‘‘Toward G. Wagnières, V. Loschenov, E. Grisan, W. Blondel, and J. Rittscher,
embedded detection of polyps in WCE images for early diagnosis of ‘‘Endoscopy artifact detection (EAD 2019) challenge dataset,’’ 2019,
colorectal cancer,’’ Int. J. Comput. Assist. Radiol. Surgery, vol. 9, no. 2, arXiv:1905.03209. [Online]. Available: http://arxiv.org/abs/1905.
pp. 283–293, Mar. 2014. 03209
[23] J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, [42] R. Wang, S. Chen, C. Ji, J. Fan, and Y. Li, ‘‘Boundary-aware context neu-
and F. Vilariño, ‘‘WM-DOVA maps for accurate polyp highlighting in ral network for medical image segmentation,’’ 2020, arXiv:2005.00966.
colonoscopy: Validation vs. Saliency maps from physicians,’’ Computer- [Online]. Available: http://arxiv.org/abs/2005.00966
ized Med. Imag. Graph., vol. 43, pp. 99–111, Jul. 2015. [43] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, and H. D. Johansen,
[24] P. H. Smedsrud, H. L. Gjestang, O. O. Nedrejord, E. Næss, V. Thambawita, ‘‘DoubleU-Net: A deep convolutional neural network for medical image
S. Hicks, H. Borgli, D. Jha, T. J. Berstad, S. L. Eskeland, and M. Lux, segmentation,’’ in Proc. IEEE 33rd Int. Symp. Comput.-Based Med. Syst.
‘‘Kvasir-capsule, a video capsule endoscopy dataset,’’ Sci. Data, 2021. (CBMS), Jul. 2020, pp. 558–564.
[44] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and [68] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn,
D. Terzopoulos, ‘‘Image segmentation using deep learning: A survey,’’ and A. Zisserman, ‘‘The Pascal visual object classes challenge: A retro-
2020, arXiv:2001.05566. [Online]. Available: http://arxiv.org/abs/2001. spective,’’ Int. J. Comput. Vis., vol. 111, no. 1, pp. 98–136, Jan. 2015.
05566 [69] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
[45] M. Baldeon-Calisto and S. K. Lai-Yuen, ‘‘AdaResU-Net: Multiobjective and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ in
adaptive convolutional neural network for medical image segmentation,’’ Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755, doi: 10.1007/978-3-
Neurocomputing, vol. 392, pp. 325–340, Jun. 2020. 319-10602-1_48.
[46] N. Saeedizadeh, S. Minaee, R. Kafieh, S. Yazdani, and M. Sonka, [70] F. Chollet et al., ‘‘Keras,’’ Tech. Rep., 2015.
‘‘COVID TV-UNet: Segmenting COVID-19 chest CT images using con- [71] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
nectivity imposed U-Net,’’ 2020, arXiv:2007.12303. [Online]. Available: S. Ghemawat, G. Irving, M. Isard, and M. Kudlur, ‘‘Tensorflow: A system
http://arxiv.org/abs/2007.12303 for large-scale machine learning,’’ in Proc. USENIX Symp. Operating Syst.
[47] Y. Meng, M. Wei, D. Gao, Y. Zhao, X. Yang, X. Huang, and Y. Zheng, Design Implement. (OSDI), 2016, pp. 265–283.
‘‘CNN-GCN aggregation enabled boundary regression for biomedical [72] Y. Mori et al., ‘‘Real-time use of artificial intelligence in identification of
image segmentation,’’ in Proc. Int. Conf. Med. Image Comput. Comput.- diminutive polyps during colonoscopy: A prospective study,’’ Ann. Internal
Assist. Intervent., 2020, pp. 352–362. Med., vol. 169, no. 6, pp. 357–366, 2018.
[48] D. Vázquez, A. M. López, F. J. Sánchez, J. Bernal, A. Romero, [73] S. Ali, F. Zhou, A. Bailey, B. Braden, J. E. East, X. Lu, and J. Rittscher,
G. Fernández-Esparrach, M. Drozdzal, and A. Courville, ‘‘A benchmark ‘‘A deep learning framework for quality assessment and restoration in
for endoluminal scene segmentation of colonoscopy images,’’ J. Health- video endoscopy,’’ Med. Image Anal., vol. 68, Feb. 2021, Art. no. 101900.
care Eng., vol. 2017, Jul. 2017, Art. no. 4037190.
[49] T. Roß et al., ‘‘Comparative validation of multi-instance instrument seg-
mentation in endoscopy: Results of the ROBUST-MIS 2019 challenge,’’
Med. Image Anal., vol. 70, Nov. 2020, Art. no. 101920.
[50] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, ‘‘Focal loss for dense
object detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
pp. 2980–2988.
[51] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time
object detection with region proposal networks,’’ in Proc. Adv. Neural Inf. DEBESH JHA received the master’s degree
Process. Syst., 2015, pp. 91–99. in information and communication engineering
[52] J. Dai, Y. Li, K. He, and J. Sun, ‘‘R-FCN: Object detection via region- from Chosun University, Gwangju, Republic of
based fully convolutional networks,’’ in Proc. Adv. Neural Inf. Process. Korea. He is currently pursuing the Ph.D. degree
Syst., 2016, pp. 379–387.
with SimulaMet, Oslo, Norway, and UiT—The
[53] M. Tan, R. Pang, and Q. V. Le, ‘‘Efficientdet: Scalable and efficient object
Arctic University of Norway, Tromsø, Norway.
detection,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2020, pp. 10781–10790. His research interests include computer vision,
[54] M. Tan and Q. V. Le, ‘‘EfficientNet: Rethinking model scaling for convo- machine learning, deep learning, and medical
lutional neural networks,’’ 2019, arXiv:1905.11946. [Online]. Available: image analysis.
http://arxiv.org/abs/1905.11946
[55] R. Girshick, ‘‘Fast R-CNN,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Dec. 2015, pp. 1440–1448.
[56] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improve-
ment,’’ 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.org/abs/
1804.02767
[57] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ‘‘YOLOv4: Opti-
mal speed and accuracy of object detection,’’ 2020, arXiv:2004.10934.
[Online]. Available: http://arxiv.org/abs/2004.10934
[58] J. Long, E. Shelhamer, and T. Darrell, ‘‘Fully convolutional networks SHARIB ALI received the Ph.D. degree from
for semantic segmentation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern the University of Lorraine, France. He worked
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. as a Postdoctoral Researcher at the Biomedical
[59] O. Ronneberger, P. Fischer, and T. Brox, ‘‘U-Net: Convolutional networks Computer Vision Group and the German
for biomedical image segmentation,’’ in Proc. Int. Conf. Med. Image Cancer research Center (DKFZ), University of
Comput. Comput.-Assist. Intervent. (MICCAI), 2015, pp. 234–241. Heidelberg, Heidelberg, Germany. He is cur-
[60] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, ‘‘Pyramid scene parsing rently working at the Department of Engineer-
network,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), ing Science, Institute of Biomedical Engineering,
Jul. 2017, pp. 2881–2890.
University of Oxford, Oxford, U.K. His research
[61] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, ‘‘Encoder-
interests include computer vision and medical
decoder with atrous separable convolution for semantic image segmenta-
tion,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 801–818. image analysis.
[62] Z. Zhang, Q. Liu, and Y. Wang, ‘‘Road extraction by deep residual U-Net,’’
IEEE Geosci. Remote Sens. Lett., vol. 15, no. 5, pp. 749–753, May 2018.
[63] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
[64] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail-
able: http://arxiv.org/abs/1409.1556
[65] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, NIKHIL KUMAR TOMAR received the bachelor’s
M. Tan, X. Wang, W. Liu, and B. Xiao, ‘‘Deep high-resolution represen- degree in computer application from Indira Gandhi
tation learning for visual recognition,’’ IEEE Trans. Pattern Anal. Mach. Open University, New Delhi, India. He is currently
Intell., early access, Apr. 1, 2020, doi: 10.1109/TPAMI.2020.2983686. doing collaborative research at SimulaMet. His
[66] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image research interests include computer vision, artifi-
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), cial intelligence, parallel processing, and medical
Jun. 2016, pp. 770–778. image segmentation.
[67] J. Hu, L. Shen, and G. Sun, ‘‘Squeeze- and-excitation networks,’’
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 7132–7141.
HÅVARD D. JOHANSEN received the Ph.D. MICHAEL A. RIEGLER received the Ph.D. degree
degree from the UiT—The Arctic University of from the Department of Informatics, University of
Norway. He is currently a Professor with the Oslo, Oslo, Norway, in 2015. He is currently work-
Department of Informatics, UiT—The Arctic Uni- ing as a Chief Research Scientist at SimulaMet,
versity of Norway. His major research interests Oslo, Norway. His research interests include
include computing networks, cloud computing, machine learning, video analysis and understand-
network security, information security, and net- ing, image processing, image retrieval, crowd-
work architecture. sourcing, social computing, and user intentions.