2 UOLO - Automatic Object Detection and Segmentation
2 UOLO - Automatic Object Detection and Segmentation
1
INESC TEC - Institute for Systems and Computer Engineering, Technology and
Science, Porto, Portugal,
{tfaraujo,guilherme.m.aresta,adrian.galdran,pvcosta}@inesctec.pt,
2
Faculdade de Engenharia da Universidade do Porto, Porto, Portugal
{amendon,campilho}@fe.up.pt
1 Introduction
Detection and segmentation of anatomical structures are central medical im-
age analysis tasks since they allow to delimit Regions-Of-Interest (ROI), create
landmarks and improve feature collection. In terms of segmentation, Deep Fully-
Convolutional (FC) Neural Networks (NNs) achieve the highest performance on
a variety of images and problems. Namely, U-Net [1] has become a reference
model – its autoencoder structure with skip connections enables the propaga-
tion from the encoding to the decoding part of the network, allowing a more
robust multi-scale analysis while reducing the need for training data.
Similarly, Deep Neural Networks (DNNs) have become the technique of choice
in many medical imaging detection problems. The standard approach is to use
networks pre-trained on large datasets of natural images as feature extractors of
a detection module. For instance, Faster-R CNN [2] uses these features to identify
UOLO
Fig. 1. Using UOLO for fovea detection and optic disc detection and segmentation.
ROIs via a specialized layer. ROIs are then pooled, rescaled and supplied to a
pair of Fully-Connected NNs responsible for adjusting the size and label the
bounding boxes. Alternatively, YOLOv2 [3] avoids the use of an auxiliary ROI
proposal model by directly using region-wise activations from pre-trained weights
to predict coordinates and labels of ROIs.
When a ROI has been identified, the segmentation of an object contained on
it becomes much easier. For this reason, the combination of detection and seg-
mentation models into a single method is being explored. For instance, Mask-R
CNN [4] extends Faster-R CNN with the addition of FC layers after its final pool-
ing, enabling a fine segmentation without a significant computational overhead.
In this architecture, the segmentation and detection modules are decoupled, i.e.
the segmentation part is only responsible for predicting a mask, which is then la-
beled class-wise by the detection module. However, despite the high performance
achieved by Mask-R CNN in computer vision, its application to medical image
analysis problems remains limited. This is due to the large requirement of data
annotated at a pixel level, which is usually not available in medical applications.
In this paper we propose UOLO (Fig. 1), a novel architecture that performs
simultaneous detection and segmentation of structures of interest in biomedical
images. UOLO harvests the best of its individual detection and segmentation
modules to allow robust and efficient predictions even when few training data
is available. Moreover, training UOLO is simple since the entire network can
be updated during back-propagation. We experimentally validate UOLO on eye
fundus images for the joint task of fovea (FV) detection, optic disc (OD) detec-
tion, and OD segmentation, where we achieve state-of-the-art performance.
2 UOLO framework
where It and Ip are the ground truth mask and the soft prediction mask, respec-
tively, and ◦ is the Hadamard product.
For object detection we take inspiration from YOLOv2 [3], a network composed
of: 1) a DNN that extracts features from an image (FYOLO ); 2) a feature inter-
pretation block that predicts both labels and bounding boxes for the objects of
interest (DYOLO ). YOLOv2 assumes that every image’s patch can contain an
object of size similar to one of various template bounding boxes (or anchors)
computed a priori from the objects’ shape distribution in the training data.
Let the output of FYOLO be a tensor F of shape S × S × N , where S is the
dimension of the spatial grid and N is the number of maps. FYOLO convolves
and reshapes F into Y , a tensor of shape S × S × A × (C + 5), where A is the
number of anchors, C is the number of object classes, and 5 is the number of
variables to be optimized: center coordinates x and y, width w, height h, and the
confidence c (how likely is the bounding box to be an object) of the bounding
boxes. For each anchor Ak in Y , the value of each feature map element mi,j is
responsible for adjusting a property of the predicted bounding box b̂,
256×256×3
256×256×32
256×256×8
128×128×32
128×128×256
Ground-truth (train) 64×64×64 3
1 64×64×512
32×32×96
32×32×768 1
16×16×128 16×16×1672
Fig. 2. UOLO framework, nesting an U-Net responsible for segmentation and feature
extraction for an YOLOv2-based detector. MU-Net : U-net part; MUOLO : full UOLO.
Algorithm 1 Loss computation scheme of UOLO. MU-Net : U-net part from the
UOLO model; MUOLO : full UOLO model; bdet : batches of images with objects’ bound-
ing boxes ground truth; bseg : batches of images with segmentation ground truth.
LU-Net ← 1
for each training step do
MUOLO ← train(MUOLO , bdet , LUOLO ) {train on ndet batches from bdet , back-
propagating LUOLO };
update(LYOLO )
MU-Net ← train(MU-Net , bseg , LU-Net ) {train on nseg batches from bseg , backprop-
agating LU-Net }
update(LU-Net )
LUOLO ← LYOLO + LU-Net
We test UOLO on 3 public eye fundus datasets with healthy and pathological im-
ages: 1) Messidor [5] has 1200 images (1440×960, 2240×1488 and 2304×1536 pix-
els, 45◦ field-of-view (FOV)), 1136 having ground truth (GT) for OD segmenta-
tion and FV centers1 ; 2) IDRID2 training set has 413 images (4288×2848 pixels,
50◦ FOV) with OD and FV centers and 54 with OD segmentation; 3) DRIVE [6]
has 40 images (768×584 pixels, 45◦ FOV) with OD segmentation3 .
All images are cropped around the FOV (determined via Otsu’s thresholding)
and resized to 256×256 pixels. The side of the square GT bounding boxes is set
to 32 and 64 for the FV and OD following their relative size in the image. For
training, ndet and nseg (Alg. 1) are set to 256 and 32, respectively. Online data
augmentation, a mini-batch size of 8, and the Adam optimizer (learning rate of
1e-4) were used for training, while 25% of the data was kept for validation. The
bounding box with highest confidence for each class is kept. The predicted soft
segmentations are binarized using a threshold of 0.5.
The OD segmentation is evaluated with IoU and Sorensen-Dice coefficient
overlap metrics. The detection is evaluated in terms of mean euclidean distance
(ED) between the prediction and the GT. We also evaluate ED relatively to
the OD radius, D̄ [7,8]. Finally, detection success, S1R , is assessed using the
maximum distance criteria of 1 OD radius.
extensively optimize the training parameters to verify how robust UOLO is when
dealing with segmentation and detection simultaneously. Table 1 shows the re-
sults of UOLO for the OD detection and segmentation and FV detection tasks,
Table 2 compares our performance with state-of-the-art methods and Fig. 3
shows two prediction examples in complex detection and segmentation cases.
UOLO achieves equal or better performance in comparison to the state-of-
the-art on both detection and segmentation tasks (IoU 0.88 ± 0.09 on Messidor)
in a single step prediction. Furthermore, the proposed network is robust even
in inter-dataset scenarios, maintaining both segmentation and detection perfor-
mances. This indicates that the abstract representations learned by UOLO are
highly effective for solving the task at hands. It is worth noting that our segmen-
tation and detection performances do not alter significantly even when UOLO
is trained with only 15% of the pixel-wise annotated images. This means that
UOLO does not require a significant amount of pixel-wise annotations, easing
its application on the medical field, where these are expensive to obtain.
Our results also suggest that UOLO is capable of using multi-scale informa-
tion (eg. relative position to the OD or vessel tree) to perform predictions. For
OD 0.96
OD 0.93
FV 0.16 IoU 0.941 FV 0.87 IoU 0.625
D 0.183 D 0.172
instance, Fig. 3 shows UOLO’s output for two Messidor images, illustrating that
the network is capable of detecting the FV in a low contrast scenario. On the
other hand, the segmentation and detection processes are not completely inter-
dependent, as expected from the proposed training scheme, since the network
segments OD confounders outside the detected OD region. Another advantage
of UOLO is that these segmentation errors are easily correctable by limiting
the pixel-wise predictions to the found OD region. Unlike hand-crafted feature-
based methods, UOLO does not require an extensive parameter tunning and it
is simple to extend to different applications.
We also evaluate U-Net (MU-Net , Fig. 2) for OD segmentation and YOLOv2
(with a pretrained Inceptionv3 as feature extractor) for OD and FV detection
(Table 2). The training conditions were set as in UOLO. UOLO segmenta-
tion performance is practically the same as U-Net, whereas the detection drops
slightly when comparing with YOLOv2, mainly for OD detection. However, one
has to consider the trade-off between computational burden and performance,
since UOLO network has 23 347 063 parameters, whereas U-Net has 15 063 985
and YOLOv2 has 21 831 470, being that for training U-Net and YOLO a total
of 36 895 455 parameters have to be optimized (60% increase).
4 Conclusions
We presented UOLO, a novel network that performs joint detection and segmen-
tation of objects of interest in medical images by using the abstract representa-
tions learned by U-Net. Furthermore, UOLO can detect objects from a different
class for which segmentation ground-truth is available.
We tested UOLO for simultaneous fovea detection and optic disk detection
and segmentation, achieving state-of-the-art results. This network can be trained
with relatively few images with segmentation ground-truth and still maintain a
high performance. UOLO is also robust to inter-dataset settings, thus showing
great potential for applications in the medical image analysis field.
Acknowledgements T. Araújo is funded by the FCT grant SFRH/BD/122365/
2016. G. Aresta is funded by the FCT grant SFRH/BD/120435/ 2016. This work
is funded by the ERDF European Regional Development Fund, Operational Pro-
gramme for Competitiveness and Internationalisation - COMPETE 2020, and
by National Funds through the FCT - project CMUP-ERI/TIC/0028/2014.
References
1. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomed-
ical Image Segmentation. Medical Image Computing and Computer-Assisted In-
tervention MICCAI 9351 (2015) 234–241
2. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis
and Machine Intelligence 39(6) (2017) 1137–1149
3. Redmon, J., Farhadi, A.: YOLO9000: Better, Faster, Stronger. arXiv (2016)
4. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. arXiv (2017)
5. Decenciere, E., Zhang, X., Cazuguel, G., et al.: Feedback on a publicly distributed
image database: The Messidor database. Image Analysis and Stereology 33(3)
(2014) 231–234
6. Staal, J., Niemeijer, M., Viergever, M.A., Ginneken, B.V.: Ridge-Based Vessel
Segmentation in Color Images of the Retina. 23(4) (2004) 501–509
7. Gegundez-Arias, M.E., Marin, D., Bravo, J.M., Suero, A.: Locating the fovea
center position in digital fundus images using thresholding and feature extraction
techniques. Computerized Medical Imaging and Graphics 37(5-6) (2013) 386–393
8. Aquino, A.: Establishing the macular grading grid by means of fovea centre detec-
tion using anatomical-based and visual-based features. Computers in Biology and
Medicine 55 (2014) 61–73
9. Dai, B., Wu, X., Bu, W.: Optic disc segmentation based on variational model with
multiple energies. Pattern Recognition 64 (2017) 226–235
10. Dashtbozorg, B., Mendonça, A., Campilho, A.: Optic disc segmentation using the
sliding band filter. Computers in Biology and Medicine 56 (2015) 1–12
11. Roychowdhury, S., Koozekanani, D.D., Kuchinka, S.N., Parhi, K.K.: Optic disc
boundary and vessel origin segmentation of fundus images. IEEE Journal of
Biomedical and Health Informatics 20(6) (2016) 1562–1574
12. Morales, S., Naranjo, V., Angulo, U., Alcaniz, M.: Automatic detection of optic
disc based on PCA and mathematical morphology. IEEE Transactions on Medical
Imaging 32(4) (2013) 786–796
13. Salazar-Gonzalez, A., Kaba, D., Li, Y., Liu, X.: Segmentation of Blood Vessels and
Optic Disc in Retinal Images. IEEE Journal of Biomedical and Health Informatics
18(6) (2014) 1874–1886
14. Al-Bander, B., Al-Nuaimy, W., Williams, B.M., Zheng, Y.: Multiscale sequential
convolutional neural networks for simultaneous detection of fovea and optic disc.
Biomedical Signal Processing and Control 40 (2018) 91–101
15. Aquino, A., Gegúndez-arias, M.E., Marı́n, D.: Detecting the Optic Disc Boundary
in Digital Fundus Feature Extraction Techniques. IEEE Transactions on Medical
Imaging 29(11) (2010) 1860–1869
16. Kamble, R., Kokare, M., Deshmukh, G., Hussin, F.A., Mériaudeau, F.: Localization
of optic disc and fovea in retinal images using intensity based line scanning analysis.
Computers in Biology and Medicine 87 (2017) 382–396