Concurrent Segment at i on and Object Detection
Concurrent Segment at i on and Object Detection
Damien Grosgeorge, Maxime Arbelot, Alex Goupilleau, Tugdual Ceillier, Renaud Allioux
ABSTRACT
Detecting and identifying objects in satellite images is a
very challenging task: objects of interest are often very small
and features can be difficult to recognize even using very high
resolution imagery. For most applications, this translates into
a trade-off between recall and precision. We present here a
dedicated method to detect and identify aircraft, combining
two very different convolutional neural networks (CNNs): a
segmentation model, based on a modified U-net architecture
[1], and a detection model, based on the RetinaNet architec-
ture [2]. The results we present show that this combination (a) Soukhoï Su-25 (b) F-16 Fighting Falcon
outperforms significantly each unitary model, reducing dras-
Fig. 1. Illustration of the data diversity (with ground truth).
tically the false negative rate.
Index Terms— CNNs, deep learning, segmentation,
identification, aircraft, satellite images fundamental role in designing new architectures and thus in
achieving higher performances [4].
1. INTRODUCTION For segmentation tasks, the U-net architecture has been
widely used since its creation by [1]. This architecture allows
The last decade has seen a huge increase of available high a better reconstruction in the decoder by using skip connec-
resolution satellite images, which are used more and more for tions from the encoder (Fig. 2). Various improvements have
surveillance tasks. When monitoring military sites, it is nec- been made in the literature considering each CNN compo-
essary to automatically detect and identify objects of interest nents [4], but the global architecture of the U-net is still one
to derive trends. In this domain, aircraft recognition is of par- of the state-of-the-art architecture for the segmentation task.
ticular interest: each aircraft model has its own role, and a For detection tasks, two main categories have been de-
variation in the number of a specific type of aircraft at a given veloped in the literature. The most well-known uses a two-
location can be a highly relevant insight. This recognition task stages, proposal-driven mechanism: the first stage generates a
needs to be reliable to allow the automation of site analysis – sparse set of candidate object locations and the second stage
in particular to derive alerts corresponding to unusual events. classifies each candidate location either as one of the fore-
Robustness to noise, shadows, illumination or ground texture ground classes or as background using a CNN. One of the
variation is challenging to obtain but mandatory for real-life most used two-stages model is the Faster-RCNN [5], which
applications (see Fig. 1). has been considered as the state-of-the-art detector by achiev-
Nowadays, CNNs are considered as one of the best tech- ing top accuracy on the challenging COCO benchmark. How-
niques to analyse image content and are the most widely used ever, in the last few years, one-stage detectors, such as the
ML technique in computer vision applications. They have Feature Pyramid Network (FPN) [6], have matched the accu-
recently produced the state-of-the-art results for image recog- racy of the most complex two-stages detectors on the COCO
nition, segmentation and detection related tasks [3]. A typi- benchmark. In [2], authors have identified that since one-
cal CNN architecture is generally composed of alternate lay- stage detectors are applied over a regular, dense sampling of
ers of convolution and pooling (encoder) followed by a de- object locations, scales, and aspect ratios, then class imbal-
coder that can comprise one or more fully connected layers ance during training is the main obstacle impeding them from
(classification), a set of transpose convolutions (segmenta- achieving state-of-the-art accuracy. They thus proposed a new
tion) or some classification and regression branches (object loss function that eliminates this barrier (the focal loss) while
detection). The arrangement of the CNN components plays a integrating improvements such as the FPN [6] in their model
known as the RetinaNet [2]. • maxpool layers have been replaced by convolutionnal
In this paper, we are looking for a dedicated and robust layers with a stride of 2 (we reduce the spatial informa-
approach to address the aircraft detection and identification tion while increasing the number of feature maps);
problems, that can be easily adapted to multiple applications. • the depth and the width of the network have been set ac-
We propose a hybrid solution based on different CNNs strate- cordingly to the application: spatial information is only
gies: a segmentation model based on the U-Net architecture reduced twice (while doubling filters), the encoding is
[1] for a better detection rate and an object detection model composed of 36 IM blocks and the decoding of 8 IM
based on the RetinaNet [2], a fast one-stage detector, for iden- blocks (resp. 72 and 16 conv. layers).
tifying and improving the precision. Section 2 details this
concurrent approach while Section 3 presents results obtained Skip connections of the U-net are used for a better reconstruc-
on high-resolution satellite images. tion of the prediction map.