0% found this document useful (0 votes)
68 views

2 UOLO - Automatic Object Detection and Segmentation

1) The document proposes a framework called UOLO for simultaneous object detection and segmentation in biomedical images. 2) UOLO consists of an object segmentation module based on U-Net and an object detection module based on YOLOv2. The segmentation module produces intermediate feature maps that are used as input for the detection module. 3) UOLO is trained end-to-end using a loss function that accounts for whether pixel-wise segmentation labels are available for each image. The framework is validated on the tasks of optic disc detection, fovea detection, and optic disc segmentation in retinal images, achieving state-of-the-art performance.

Uploaded by

SUMOD SUNDAR
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

2 UOLO - Automatic Object Detection and Segmentation

1) The document proposes a framework called UOLO for simultaneous object detection and segmentation in biomedical images. 2) UOLO consists of an object segmentation module based on U-Net and an object detection module based on YOLOv2. The segmentation module produces intermediate feature maps that are used as input for the detection module. 3) UOLO is trained end-to-end using a loss function that accounts for whether pixel-wise segmentation labels are available for each image. The framework is validated on the tasks of optic disc detection, fovea detection, and optic disc segmentation in retinal images, achieving state-of-the-art performance.

Uploaded by

SUMOD SUNDAR
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

UOLO - automatic object detection and

segmentation in biomedical images

Teresa Araújo*1,2 , Guilherme Aresta*1,2 , Adrian Galdran1 , Pedro Costa1 , Ana


Maria Mendonça1,2 , and Aurélio Campilho1,2
*these authors contributed equally to this work
arXiv:1810.05729v1 [cs.CV] 9 Oct 2018

1
INESC TEC - Institute for Systems and Computer Engineering, Technology and
Science, Porto, Portugal,
{tfaraujo,guilherme.m.aresta,adrian.galdran,pvcosta}@inesctec.pt,
2
Faculdade de Engenharia da Universidade do Porto, Porto, Portugal
{amendon,campilho}@fe.up.pt

Abstract. We propose UOLO, a novel framework for the simultaneous


detection and segmentation of structures of interest in medical images.
UOLO consists of an object segmentation module which intermediate
abstract representations are processed and used as input for object de-
tection. The resulting system is optimized simultaneously for detecting
a class of objects and segmenting an optionally different class of struc-
tures. UOLO is trained on a set of bounding boxes enclosing the objects
to detect, as well as pixel-wise segmentation information, when available.
A new loss function is devised, taking into account whether a reference
segmentation is accessible for each training image, in order to suitably
backpropagate the error. We validate UOLO on the task of simultaneous
optic disc (OD) detection, fovea detection, and OD segmentation from
retinal images, achieving state-of-the-art performance on public datasets.

Keywords: detection, segmentation, biomedical images, eye fundus im-


ages, convolutional neural networks

1 Introduction
Detection and segmentation of anatomical structures are central medical im-
age analysis tasks since they allow to delimit Regions-Of-Interest (ROI), create
landmarks and improve feature collection. In terms of segmentation, Deep Fully-
Convolutional (FC) Neural Networks (NNs) achieve the highest performance on
a variety of images and problems. Namely, U-Net [1] has become a reference
model – its autoencoder structure with skip connections enables the propaga-
tion from the encoding to the decoding part of the network, allowing a more
robust multi-scale analysis while reducing the need for training data.
Similarly, Deep Neural Networks (DNNs) have become the technique of choice
in many medical imaging detection problems. The standard approach is to use
networks pre-trained on large datasets of natural images as feature extractors of
a detection module. For instance, Faster-R CNN [2] uses these features to identify
UOLO
Fig. 1. Using UOLO for fovea detection and optic disc detection and segmentation.

ROIs via a specialized layer. ROIs are then pooled, rescaled and supplied to a
pair of Fully-Connected NNs responsible for adjusting the size and label the
bounding boxes. Alternatively, YOLOv2 [3] avoids the use of an auxiliary ROI
proposal model by directly using region-wise activations from pre-trained weights
to predict coordinates and labels of ROIs.
When a ROI has been identified, the segmentation of an object contained on
it becomes much easier. For this reason, the combination of detection and seg-
mentation models into a single method is being explored. For instance, Mask-R
CNN [4] extends Faster-R CNN with the addition of FC layers after its final pool-
ing, enabling a fine segmentation without a significant computational overhead.
In this architecture, the segmentation and detection modules are decoupled, i.e.
the segmentation part is only responsible for predicting a mask, which is then la-
beled class-wise by the detection module. However, despite the high performance
achieved by Mask-R CNN in computer vision, its application to medical image
analysis problems remains limited. This is due to the large requirement of data
annotated at a pixel level, which is usually not available in medical applications.
In this paper we propose UOLO (Fig. 1), a novel architecture that performs
simultaneous detection and segmentation of structures of interest in biomedical
images. UOLO harvests the best of its individual detection and segmentation
modules to allow robust and efficient predictions even when few training data
is available. Moreover, training UOLO is simple since the entire network can
be updated during back-propagation. We experimentally validate UOLO on eye
fundus images for the joint task of fovea (FV) detection, optic disc (OD) detec-
tion, and OD segmentation, where we achieve state-of-the-art performance.

2 UOLO framework

2.1 Object segmentation module

For object segmentation we consider an adapted version of the U-Net network


presented in [1]. U-Net is composed of FC layers organized on an auto-encoder
scheme, which allows to obtain an output of the same size of the input, thus
enabling pixel-wise predictions. Skip connections between the encoding and de-
coding parts are used for avoiding the information loss inherent to encoding.
The model’s upsampling path includes a large number of feature channels with
the aim of propagating the multi-scale context information to higher resolution
layers. Ultimately, the segmentation prediction results from the analysis of ab-
stract representations of the images from multiple scales, with the majority of
the relevant classification information being available on the decoder portion of
the network due to the skip connections. We modify the network by adding batch
normalization after each convolutional layer, and replacing the pooling layers by
convolutions with stride. The soft intersection over union (IoU) is used as loss:
P
It ◦ Ip
LU-Net = 1 − IoU = 1 − P P , (1)
(It + Ip ) − It ◦ Ip

where It and Ip are the ground truth mask and the soft prediction mask, respec-
tively, and ◦ is the Hadamard product.

2.2 Object detection module

For object detection we take inspiration from YOLOv2 [3], a network composed
of: 1) a DNN that extracts features from an image (FYOLO ); 2) a feature inter-
pretation block that predicts both labels and bounding boxes for the objects of
interest (DYOLO ). YOLOv2 assumes that every image’s patch can contain an
object of size similar to one of various template bounding boxes (or anchors)
computed a priori from the objects’ shape distribution in the training data.
Let the output of FYOLO be a tensor F of shape S × S × N , where S is the
dimension of the spatial grid and N is the number of maps. FYOLO convolves
and reshapes F into Y , a tensor of shape S × S × A × (C + 5), where A is the
number of anchors, C is the number of object classes, and 5 is the number of
variables to be optimized: center coordinates x and y, width w, height h, and the
confidence c (how likely is the bounding box to be an object) of the bounding
boxes. For each anchor Ak in Y , the value of each feature map element mi,j is
responsible for adjusting a property of the predicted bounding box b̂,

(b̂x , b̂y ) = (σ(x̂) + xi,j,k , σ(ŷ) + yi,j,k )


(b̂w , b̂h ) = (wi,j,k eŵ , hi,j,k eĥ ) (2)
confidence = σ(ĉ)

where σ is a sigmoid function. YOLOv2 is trained by optimizing the loss function:

LYOLO = λ1 Lcenters + λ2 Ldimensions + λ3 Lconfidence + λ4 Lclasses (3)

where λi are predefined weighting factors, Lcenters , Ldimensions and Lconfidence


are mean squared errors, and Lclasses is the cross-entropy loss. Each loss term
penalizes a different error: 1) Lcenters penalizes the error in the center position of
the cells; 2) Ldimensions penalizes the incorrect size, i.e. height and width, of the
bounding box; 3) Lconfidence penalizes the incorrect prediction of a box presence;
4) Lclasses penalizes the misclassification of the objects.
2.3 UOLO for joint object detection and segmentation
UOLO framework for object detection and segmentation is depicted in Fig. 2,
where the segmentation module itself is used as a feature extraction module,
adopting the role of FYOLO , and serving as input for the localization module
DYOLO . The intuition behind this design is that the abstract representation
learned by the decoding part of U-Net contains multi-scale information that can
be useful not only to segment objects, but also to detect them. In addition,
the class of objects that UOLO can detect is not limited to those for which
segmentation ground-truth is available.
Let MU-Net be an U-Net-like network that, given pairs of images and binary
masks, can be trained for performing segmentation by minimizing LU-Net (Eq. 1).
MU-Net has a second output corresponding to the concatenation of the down-
sampled decoding maps with its bottle neck (last encoder layer). The resulting
tensor corresponds to a set of multi-scale representations of the original image
that are supplied to the object detection block DYOLO , which, by its turn, can be
optimized via LYOLO , defined in Eq. 3. DYOLO and MU-Net are then merged by
concatenation into MUOLO , a single model that can be optimized by minimizing
the addition of the corresponding loss functions:

LUOLO = LYOLO + LU-Net (4)


Thanks to the straightforward definition of the loss function in Eq. (4),
MUOLO can be trained with a simple iterative scheme detailed in Algorithm 1.
In essence, LU-Net is updated only when segmentation information is available.
However, a global weight update is performed at every step based on the pre-
diction error backpropagation. Furthermore, the outlined training scheme allows
for a different number of strong (pixel-wise) and weak (bounding boxes) anno-
tations, easing its application to medical images.

256×256×3

256×256×32
256×256×8

128×128×32
128×128×256
Ground-truth (train) 64×64×64 3
1 64×64×512
32×32×96
32×32×768 1
16×16×128 16×16×1672

Convolution 2D (3×3) 16×16×2×7


2 Batch Normalization
Label x y h w Upsampling 2D 2 3
OD 50 125 64 64 Concatenation
FV 150 155 32 32

Fig. 2. UOLO framework, nesting an U-Net responsible for segmentation and feature
extraction for an YOLOv2-based detector. MU-Net : U-net part; MUOLO : full UOLO.
Algorithm 1 Loss computation scheme of UOLO. MU-Net : U-net part from the
UOLO model; MUOLO : full UOLO model; bdet : batches of images with objects’ bound-
ing boxes ground truth; bseg : batches of images with segmentation ground truth.
LU-Net ← 1
for each training step do
MUOLO ← train(MUOLO , bdet , LUOLO ) {train on ndet batches from bdet , back-
propagating LUOLO };
update(LYOLO )
MU-Net ← train(MU-Net , bseg , LU-Net ) {train on nseg batches from bseg , backprop-
agating LU-Net }
update(LU-Net )
LUOLO ← LYOLO + LU-Net

3 Experiments and results

3.1 Datasets and experimental details

We test UOLO on 3 public eye fundus datasets with healthy and pathological im-
ages: 1) Messidor [5] has 1200 images (1440×960, 2240×1488 and 2304×1536 pix-
els, 45◦ field-of-view (FOV)), 1136 having ground truth (GT) for OD segmenta-
tion and FV centers1 ; 2) IDRID2 training set has 413 images (4288×2848 pixels,
50◦ FOV) with OD and FV centers and 54 with OD segmentation; 3) DRIVE [6]
has 40 images (768×584 pixels, 45◦ FOV) with OD segmentation3 .
All images are cropped around the FOV (determined via Otsu’s thresholding)
and resized to 256×256 pixels. The side of the square GT bounding boxes is set
to 32 and 64 for the FV and OD following their relative size in the image. For
training, ndet and nseg (Alg. 1) are set to 256 and 32, respectively. Online data
augmentation, a mini-batch size of 8, and the Adam optimizer (learning rate of
1e-4) were used for training, while 25% of the data was kept for validation. The
bounding box with highest confidence for each class is kept. The predicted soft
segmentations are binarized using a threshold of 0.5.
The OD segmentation is evaluated with IoU and Sorensen-Dice coefficient
overlap metrics. The detection is evaluated in terms of mean euclidean distance
(ED) between the prediction and the GT. We also evaluate ED relatively to
the OD radius, D̄ [7,8]. Finally, detection success, S1R , is assessed using the
maximum distance criteria of 1 OD radius.

3.2 Results and discussion

We evaluate UOLO both inter and intra-dataset-wise. For inter-dataset experi-


ments, UOLO was trained on Messidor and tested in the other datasets whereas
for intra-dataset studies stratified 5-fold cross-validation was used. We do not
1
http://www.uhu.es/retinopathy
2
https://idrid.grand-challenge.org/, available since January 20, 2018
3
https://sites.google.com/a/uw.edu/src/useful-links
Table 1. UOLO performance on optic disc (OD) detection and segmentation and fovea
(FV) detection. n: number of training images for detection and segmentation.

Datasets n OD seg. OD det. FV det.


Train Test seg. det. IoU Dice D̄ S1R D̄ S1R
Messidor 680 680 0.88 0.93 0.111 99.74 0.121 99.38
Messidor 100 680 0.87 0.93 0.114 99.74 0.114 97.89
IDRID 30 280 0.88 0.93 0.095 99.79 0.288 93.78
Messidor IDRID 852 852 0.84 0.91 0.138 99.78 0.403 89.06
Messidor DRIVE 852 852 0.82 0.89 0.171 97.50 - -

extensively optimize the training parameters to verify how robust UOLO is when
dealing with segmentation and detection simultaneously. Table 1 shows the re-
sults of UOLO for the OD detection and segmentation and FV detection tasks,
Table 2 compares our performance with state-of-the-art methods and Fig. 3
shows two prediction examples in complex detection and segmentation cases.
UOLO achieves equal or better performance in comparison to the state-of-
the-art on both detection and segmentation tasks (IoU 0.88 ± 0.09 on Messidor)
in a single step prediction. Furthermore, the proposed network is robust even
in inter-dataset scenarios, maintaining both segmentation and detection perfor-
mances. This indicates that the abstract representations learned by UOLO are
highly effective for solving the task at hands. It is worth noting that our segmen-
tation and detection performances do not alter significantly even when UOLO
is trained with only 15% of the pixel-wise annotated images. This means that
UOLO does not require a significant amount of pixel-wise annotations, easing
its application on the medical field, where these are expensive to obtain.
Our results also suggest that UOLO is capable of using multi-scale informa-
tion (eg. relative position to the OD or vessel tree) to perform predictions. For

OD 0.96
OD 0.93
FV 0.16 IoU 0.941 FV 0.87 IoU 0.625

D 0.183 D 0.172

Fig. 3. Examples of results of UOLO on Messidor images. Green curve: segmented


optic disc (OD), green and blue boxes: predicted OD and FV locations, respectively;
black curve: ground truth OD segmentation; black and blue dots: ground truth OD
and FV locations, respectively. The object detection confidence is shown next to each
box. IoU (intersection over union) and normalized distance (D̄) values are also shown.
Table 2. State-of-the-art for OD detection and segmentation and FV detection.

(a) OD segmentation (b) OD and FV detection


Dataset Messidor DRIVE Task OD det. FV detection
Method IoU Dice IoU Dice Dataset Messidor DRIVE Messidor
UOLO 0.88 0.93 0.82 0.89 Method ED S1R ED S1R ED S1R
U-Net 0.88 0.93 0.81 0.88 UOLO 9.40 99.74 8.13 97.5 10.44 99.38
[9] 0.91 - - - YOLOv2 6.86 100 7.20 97.5 9.01 100
[10] 0.89 0.94 - - [14] - 97 - - - 96.6
[11] 0.84 - 0.81 - [8] - - - - 16.09 98.24
[12] 0.82 - 0.72 - [7] - - - - 20.17 98.24
[13] - - 0.82 - [15] - 98.83 - - - -
[16] 23.17 99.75 15.57 100 34.88 99.40
[10] 99.75 - - - -

instance, Fig. 3 shows UOLO’s output for two Messidor images, illustrating that
the network is capable of detecting the FV in a low contrast scenario. On the
other hand, the segmentation and detection processes are not completely inter-
dependent, as expected from the proposed training scheme, since the network
segments OD confounders outside the detected OD region. Another advantage
of UOLO is that these segmentation errors are easily correctable by limiting
the pixel-wise predictions to the found OD region. Unlike hand-crafted feature-
based methods, UOLO does not require an extensive parameter tunning and it
is simple to extend to different applications.
We also evaluate U-Net (MU-Net , Fig. 2) for OD segmentation and YOLOv2
(with a pretrained Inceptionv3 as feature extractor) for OD and FV detection
(Table 2). The training conditions were set as in UOLO. UOLO segmenta-
tion performance is practically the same as U-Net, whereas the detection drops
slightly when comparing with YOLOv2, mainly for OD detection. However, one
has to consider the trade-off between computational burden and performance,
since UOLO network has 23 347 063 parameters, whereas U-Net has 15 063 985
and YOLOv2 has 21 831 470, being that for training U-Net and YOLO a total
of 36 895 455 parameters have to be optimized (60% increase).

4 Conclusions
We presented UOLO, a novel network that performs joint detection and segmen-
tation of objects of interest in medical images by using the abstract representa-
tions learned by U-Net. Furthermore, UOLO can detect objects from a different
class for which segmentation ground-truth is available.
We tested UOLO for simultaneous fovea detection and optic disk detection
and segmentation, achieving state-of-the-art results. This network can be trained
with relatively few images with segmentation ground-truth and still maintain a
high performance. UOLO is also robust to inter-dataset settings, thus showing
great potential for applications in the medical image analysis field.
Acknowledgements T. Araújo is funded by the FCT grant SFRH/BD/122365/
2016. G. Aresta is funded by the FCT grant SFRH/BD/120435/ 2016. This work
is funded by the ERDF European Regional Development Fund, Operational Pro-
gramme for Competitiveness and Internationalisation - COMPETE 2020, and
by National Funds through the FCT - project CMUP-ERI/TIC/0028/2014.

References
1. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomed-
ical Image Segmentation. Medical Image Computing and Computer-Assisted In-
tervention MICCAI 9351 (2015) 234–241
2. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis
and Machine Intelligence 39(6) (2017) 1137–1149
3. Redmon, J., Farhadi, A.: YOLO9000: Better, Faster, Stronger. arXiv (2016)
4. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. arXiv (2017)
5. Decenciere, E., Zhang, X., Cazuguel, G., et al.: Feedback on a publicly distributed
image database: The Messidor database. Image Analysis and Stereology 33(3)
(2014) 231–234
6. Staal, J., Niemeijer, M., Viergever, M.A., Ginneken, B.V.: Ridge-Based Vessel
Segmentation in Color Images of the Retina. 23(4) (2004) 501–509
7. Gegundez-Arias, M.E., Marin, D., Bravo, J.M., Suero, A.: Locating the fovea
center position in digital fundus images using thresholding and feature extraction
techniques. Computerized Medical Imaging and Graphics 37(5-6) (2013) 386–393
8. Aquino, A.: Establishing the macular grading grid by means of fovea centre detec-
tion using anatomical-based and visual-based features. Computers in Biology and
Medicine 55 (2014) 61–73
9. Dai, B., Wu, X., Bu, W.: Optic disc segmentation based on variational model with
multiple energies. Pattern Recognition 64 (2017) 226–235
10. Dashtbozorg, B., Mendonça, A., Campilho, A.: Optic disc segmentation using the
sliding band filter. Computers in Biology and Medicine 56 (2015) 1–12
11. Roychowdhury, S., Koozekanani, D.D., Kuchinka, S.N., Parhi, K.K.: Optic disc
boundary and vessel origin segmentation of fundus images. IEEE Journal of
Biomedical and Health Informatics 20(6) (2016) 1562–1574
12. Morales, S., Naranjo, V., Angulo, U., Alcaniz, M.: Automatic detection of optic
disc based on PCA and mathematical morphology. IEEE Transactions on Medical
Imaging 32(4) (2013) 786–796
13. Salazar-Gonzalez, A., Kaba, D., Li, Y., Liu, X.: Segmentation of Blood Vessels and
Optic Disc in Retinal Images. IEEE Journal of Biomedical and Health Informatics
18(6) (2014) 1874–1886
14. Al-Bander, B., Al-Nuaimy, W., Williams, B.M., Zheng, Y.: Multiscale sequential
convolutional neural networks for simultaneous detection of fovea and optic disc.
Biomedical Signal Processing and Control 40 (2018) 91–101
15. Aquino, A., Gegúndez-arias, M.E., Marı́n, D.: Detecting the Optic Disc Boundary
in Digital Fundus Feature Extraction Techniques. IEEE Transactions on Medical
Imaging 29(11) (2010) 1860–1869
16. Kamble, R., Kokare, M., Deshmukh, G., Hussin, F.A., Mériaudeau, F.: Localization
of optic disc and fovea in retinal images using intensity based line scanning analysis.
Computers in Biology and Medicine 87 (2017) 382–396

You might also like