0% found this document useful (0 votes)
29 views15 pages

REF-6-DeepLab Semantic Image Segmentation With Deep Convolutional Nets Atrous Convolution and Fully Connected CRFs

The document presents DeepLab, a system for semantic image segmentation that utilizes Deep Convolutional Neural Networks (DCNNs) and introduces three key innovations: atrous convolution for controlling feature resolution, atrous spatial pyramid pooling (ASPP) for multi-scale object segmentation, and a fully connected Conditional Random Field (CRF) for improved boundary localization. DeepLab achieves state-of-the-art performance on the PASCAL VOC-2012 benchmark and other datasets, demonstrating significant advancements in speed, accuracy, and simplicity compared to previous methods. The authors provide publicly available code and models to facilitate further research in this area.

Uploaded by

JIJIN K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views15 pages

REF-6-DeepLab Semantic Image Segmentation With Deep Convolutional Nets Atrous Convolution and Fully Connected CRFs

The document presents DeepLab, a system for semantic image segmentation that utilizes Deep Convolutional Neural Networks (DCNNs) and introduces three key innovations: atrous convolution for controlling feature resolution, atrous spatial pyramid pooling (ASPP) for multi-scale object segmentation, and a fully connected Conditional Random Field (CRF) for improved boundary localization. DeepLab achieves state-of-the-art performance on the PASCAL VOC-2012 benchmark and other datasets, demonstrating significant advancements in speed, accuracy, and simplicity compared to previous methods. The authors provide publicly available code and models to facilitate further research in this area.

Uploaded by

JIJIN K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

834 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO.

4, APRIL 2018

DeepLab: Semantic Image Segmentation with


Deep Convolutional Nets, Atrous Convolution,
and Fully Connected CRFs
Liang-Chieh Chen , George Papandreou, Senior Member, IEEE, Iasonas Kokkinos, Member, IEEE,
Kevin Murphy, and Alan L. Yuille, Fellow, IEEE

Abstract—In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions
that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or
‘atrous convolution’, as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which
feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of
filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose
atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature
layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple
scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.
The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization
accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field
(CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed “DeepLab” system sets
the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7 percent mIOU in the test set, and
advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly
available online.

Index Terms—Convolutional neural networks, semantic segmentation, atrous convolution, conditional random fields

1 INTRODUCTION

D EEP Convolutional Neural Networks (DCNNs) [1] have


pushed the performance of computer vision systems to
soaring heights on a broad array of high-level problems,
In particular we consider three challenges in the applica-
tion of DCNNs to semantic image segmentation: (1) reduced
feature resolution, (2) existence of objects at multiple scales,
including image classification [2], [3], [4], [5], [6] and object and (3) reduced localization accuracy due to DCNN invari-
detection [7], [8], [9], [10], [11], [12], where DCNNs trained ance. Next, we discuss these challenges and our approach
in an end-to-end manner have delivered strikingly better to overcome them in our proposed DeepLab system.
results than systems relying on hand-crafted features. The first challenge is caused by the repeated combination
Essential to this success is the built-in invariance of DCNNs of max-pooling and downsampling (‘striding’) performed
to local image transformations, which allows them to learn at consecutive layers of DCNNs originally designed for
increasingly abstract data representations [13]. This invari- image classification [2], [4], [5]. This results in feature maps
ance is clearly desirable for classification tasks, but can ham- with significantly reduced spatial resolution when the
per dense prediction tasks such as semantic segmentation, DCNN is employed in a fully convolutional fashion [14]. In
where abstraction of spatial information is undesired. order to overcome this hurdle and efficiently produce
denser feature maps, we remove the downsampling opera-
tor from the last few max pooling layers of DCNNs and
instead upsample the filters in subsequent convolutional
 L.-C. Chen, G. Papandreou, and K. Murphy are with Google Inc.,
Mountain View, CA 94043. layers, resulting in feature maps computed at a higher sam-
E-mail: [email protected], {gpapan, kpmurphy}@google.com. pling rate. Filter upsampling amounts to inserting holes
 I. Kokkinos is with University College London, London, WC1E 6BT, UK. (‘trous’ in French) between nonzero filter taps. This tech-
E-mail: [email protected].
 A. Yuille is with the Departments of Cognitive Science and Computer
nique has a long history in signal processing, originally
Science, Johns Hopkins University, Baltimore, MD 21218. developed for the efficient computation of the undecimated
E-mail: [email protected]. wavelet transform in a scheme also known as “algorithme  a
Manuscript received 2 June 2016; revised 1 Mar. 2017; accepted 10 Apr. 2017. trous” [15]. We use the term atrous convolution as a short-
Date of publication 26 Apr. 2017; date of current version 13 Mar. 2018. hand for convolution with upsampled filters. Various fla-
Recommended for acceptance by T. Darrell. vors of this idea have been used before in the context of
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. DCNNs by [3], [6], [16]. In practice, we recover full resolu-
Digital Object Identifier no. 10.1109/TPAMI.2017.2699184 tion feature maps by a combination of atrous convolution,
0162-8828 ß 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
CHEN ET AL.: DEEPLAB: SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS, ATROUS CONVOLUTION, AND FULLY... 835

which computes feature maps more densely, followed by image resolution, yielding the input to a fully-connected
simple bilinear interpolation of the feature responses to the CRF [22] that refines the segmentation results.
original image size. This scheme offers a simple yet power- From a practical standpoint, the three main advantages
ful alternative to using deconvolutional layers [13], [14] in of our DeepLab system are: (1) Speed: by virtue of atrous
dense prediction tasks. Compared to regular convolution convolution, our dense DCNN operates at 8 FPS on an NVi-
with larger filters, atrous convolution allows us to effec- dia Titan X GPU, while Mean Field Inference for the fully-
tively enlarge the field of view of filters without increasing connected CRF requires 0.5 secs on a CPU. (2) Accuracy: We
the number of parameters or the amount of computation. obtain state-of-art results on several challenging datasets,
The second challenge is caused by the existence of objects including the PASCAL VOC 2012 semantic segmentation
at multiple scales. A standard way to deal with this is to benchmark [34], PASCAL-Context [35], PASCAL-Person-
present to the DCNN rescaled versions of the same image Part [36], and Cityscapes [37]. (3) Simplicity: Our system is
and then aggregate the feature or score maps [6], [17], [18]. composed of a cascade of two very well-established mod-
We show that this approach indeed increases the perfor- ules, DCNNs and CRFs.
mance of our system, but comes at the cost of computing The updated DeepLab system we present in this paper
feature responses at all DCNN layers for multiple scaled features several improvements compared to its first version
versions of the input image. Instead, motivated by spatial reported in our original conference publication [38]. Our
pyramid pooling [19], [20], we propose a computationally new version can better segment objects at multiple scales,
efficient scheme of resampling a given feature layer at mul- via either multi-scale input processing [17], [39], [40] or the
tiple rates prior to convolution. This amounts to probing the proposed ASPP. We have built a residual net variant of
original image with multiple filters that have complemen- DeepLab by adapting the state-of-art ResNet [11] image
tary effective fields of view, thus capturing objects as well classification DCNN, achieving better semantic segmenta-
as useful image context at multiple scales. Rather than actu- tion performance compared to our original model based on
ally resampling features, we efficiently implement this map- VGG-16 [4]. Finally, we present a more comprehensive
ping using multiple parallel atrous convolutional layers experimental evaluation of multiple model variants and
with different sampling rates; we call the proposed tech- report state-of-art results not only on the PASCAL VOC
nique “atrous spatial pyramid pooling” (ASPP). 2012 benchmark but also on other challenging tasks. We
The third challenge relates to the fact that an object-centric have implemented the proposed methods by extending the
classifier requires invariance to spatial transformations, Caffe framework [41]. We share our code and models at a
inherently limiting the spatial accuracy of a DCNN. One companion web site http://liangchiehchen.com/projects/
way to mitigate this problem is to use skip-layers to extract DeepLab.html.
“hyper-column” features from multiple network layers
when computing the final segmentation result [14], [21]. Our
work explores an alternative approach which we show to be 2 RELATED WORK
highly effective. In particular, we boost our model’s ability to Most of the successful semantic segmentation systems
capture fine details by employing a fully-connected Condi- developed in the previous decade relied on hand-crafted
tional Random Field (CRF) [22]. CRFs have been broadly features combined with flat classifiers, such as Boosting
used in semantic segmentation to combine class scores com- [24], [42], Random Forests [43], or Support Vector Machines
puted by multi-way classifiers with the low-level informa- [44]. Substantial improvements have been achieved by
tion captured by the local interactions of pixels and edges incorporating richer information from context [45] and
[23], [24] or superpixels [25]. Even though works of increased structured prediction techniques [22], [26], [27], [46], but the
sophistication have been proposed to model the hierarchical performance of these systems has always been compro-
dependency [26], [27], [28] and/or high-order dependencies mised by the limited expressive power of the features. Over
of segments [29], [30], [31], [32], [33], we use the fully con- the past few years the breakthroughs of Deep Learning in
nected pairwise CRF proposed by [22] for its efficient compu- image classification were quickly transferred to the seman-
tation, and ability to capture fine edge details while also tic segmentation task. Since this task involves both segmen-
catering for long range dependencies. That model was tation and classification, a central question is how to
shown in [22] to improve the performance of a boosting- combine the two tasks.
based pixel-level classifier. In this work, we demonstrate The first family of DCNN-based systems for semantic seg-
that it leads to state-of-the-art results when coupled with a mentation typically employs a cascade of bottom-up image
DCNN-based pixel-level classifier. segmentation, followed by DCNN-based region classifica-
A high-level illustration of the proposed DeepLab model tion. For instance the bounding box proposals and masked
is shown in Fig. 1. A deep convolutional neural network regions delivered by [47], [48] are used in [7] and [49] as
(VGG-16 [4] or ResNet-101 [11] in this work) trained in the inputs to a DCNN to incorporate shape information into the
task of image classification is re-purposed to the task of classification process. Similarly, the authors of [50] rely on a
semantic segmentation by (1) transforming all the fully con- superpixel representation. Even though these approaches
nected layers to convolutional layers (i.e., fully convolutional can benefit from the sharp boundaries delivered by a good
network [14]) and (2) increasing feature resolution through segmentation, they also cannot recover from any of its errors.
atrous convolutional layers, allowing us to compute feature The second family of works relies on using convolution-
responses every 8 pixels instead of every 32 pixels in the orig- ally computed DCNN features for dense image labeling,
inal network. We then employ bi-linear interpolation to and couples them with segmentations that are obtained
upsample by a factor of 8 the score map to reach the original independently. Among the first have been [39] who apply
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
836 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

Fig. 1. Model illustration. A deep convolutional neural network such as VGG-16 or ResNet-101 is employed in a fully convolutional fashion, using
atrous convolution to reduce the degree of signal downsampling (from 32x down 8x). A bilinear interpolation stage enlarges the feature maps to the
original image resolution. A fully connected CRF is then applied to refine the segmentation result and better capture the object boundaries.

DCNNs at multiple image resolutions and then employ a for the problem of material classification. However, the
segmentation tree to smooth the prediction results. More DCNN module of [57] was only trained by sparse point
recently, [21] propose to use skip layers and concatenate the supervision instead of dense supervision at every pixel.
computed intermediate feature maps within the DCNNs for Since the first version of this work was made publicly
pixel classification. Further, [51] propose to pool the inter- available [38], the area of semantic segmentation has pro-
mediate feature maps by region proposals. These works still gressed drastically. Multiple groups have made important
employ segmentation algorithms that are decoupled from advances, significantly raising the bar on the PASCAL VOC
the DCNN classifier’s results, thus risking commitment to 2012 semantic segmentation benchmark, as reflected to the
premature decisions. high level of activity in the benchmark’s leaderboard1 [17],
The third family of works uses DCNNs to directly provide [40], [58], [59], [60], [61], [62], [63]. Interestingly, most top-
dense category-level pixel labels, which makes it possible to performing methods have adopted one or both of the key
even discard segmentation altogether. The segmentation- ingredients of our DeepLab system: Atrous convolution for
free approaches of [14], [52] directly apply DCNNs to the efficient dense feature extraction and refinement of the raw
whole image in a fully convolutional fashion, transforming DCNN scores by means of a fully connected CRF. We outline
the last fully connected layers of the DCNN into convolu- below some of the most important and interesting advances.
tional layers. In order to deal with the spatial localization End-to-end training for structured prediction has more
issues outlined in the introduction, [14] upsample and con- recently been explored in several related works. While we
catenate the scores from intermediate feature maps, while employ the CRF as a post-processing method, [40], [59],
[52] refine the prediction result from coarse to fine by propa- [62], [64], [65] have successfully pursued joint learning of
gating the coarse results to another DCNN. Our work builds the DCNN and CRF. In particular, [59], [65] unroll the CRF
on these works, and as described in the introduction extends mean-field inference steps to convert the whole system into
them by exerting control on the feature resolution, introduc- an end-to-end trainable feed-forward network, while [62]
ing multi-scale pooling techniques and integrating the approximates one iteration of the dense CRF mean field
densely connected CRF of [22] on top of the DCNN. We inference [22] by convolutional layers with learnable filters.
show that this leads to significantly better segmentation Another fruitful direction pursued by [40], [66] is to learn
results, especially along object boundaries. The combination the pairwise terms of a CRF via a DCNN, significantly
of DCNN and CRF is of course not new but previous works improving performance at the cost of heavier computation.
only tried locally connected CRF models. Specifically, [53] In a different direction, [63] replace the bilateral filtering
use CRFs as a proposal mechanism for a DCNN-based module used in mean field inference with a faster domain
reranking system, while [39] treat superpixels as nodes for a transform module [67], improving the speed and lowering
local pairwise CRF and use graph-cuts for discrete inference. the memory requirements of the overall system, while [18],
As such their models were limited by errors in superpixel [68] combine semantic segmentation with edge detection.
computations or ignored long-range dependencies. Our Weaker supervision has been pursued in a number of
approach instead treats every pixel as a CRF node receiving papers, relaxing the assumption that pixel-level semantic
unary potentials by the DCNN. Crucially, the Gaussian CRF annotations are available for the whole training set [58],
potentials in the fully connected CRF model of [22] that we [69], [70], [71], achieving significantly better results than
adopt can capture long-range dependencies and at the same weakly-supervised pre-DCNN systems such as [72]. In
time the model is amenable to fast mean field inference. We another line of research, [49], [73] pursue instance segmen-
note that mean field inference had been extensively studied tation, jointly tackling object detection and semantic
for traditional image segmentation tasks [54], [55], [56], but segmentation.
these older models were typically limited to short-range con-
nections. In independent work, [57] use a very similar 1. http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?
densely connected CRF model to refine the results of DCNN challengeid=11&compid=6
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
CHEN ET AL.: DEEPLAB: SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS, ATROUS CONVOLUTION, AND FULLY... 837

Fig. 2. Illustration of atrous convolution in 1-D. (a) Sparse feature extrac- Fig. 3. Illustration of atrous convolution in 2-D. Top row: sparse feature
tion with standard convolution on a low resolution input feature map. (b) extraction with standard convolution on a low resolution input feature
Dense feature extraction with atrous convolution with rate r ¼ 2, applied map. Bottom row: Dense feature extraction with atrous convolution with
on a high resolution input feature map. rate r ¼ 2, applied on a high resolution input feature map.

What we call here atrous convolution was originally devel- these networks reduces significantly the spatial resolution
oped for the efficient computation of the undecimated of the resulting feature maps, typically by a factor of 32
wavelet transform in the “algorithme  a trous” scheme of across each direction in recent DCNNs. A partial remedy is
[15]. We refer the interested reader to [74] for early referen- to use ‘deconvolutional’ layers as in [14], which however
ces from the wavelet literature. Atrous convolution is also requires additional memory and time.
intimately related to the “noble identities” in multi-rate sig- We advocate instead the use of atrous convolution, origi-
nal processing, which builds on the same interplay of input nally developed for the efficient computation of the undeci-
signal and filter sampling rates [75]. Atrous convolution is a mated wavelet transform in the “algorithme  a trous”
term we first used in [6]. The same operation was later scheme of [15] and used before in the DCNN context by [3],
called dilated convolution by [76], a term they coined moti- [6], [16]. This algorithm allows us to compute the responses
vated by the fact that the operation corresponds to regular of any layer at any desirable resolution. It can be applied
convolution with upsampled (or dilated in the terminology post-hoc, once a network has been trained, but can also be
of [15]) filters. Various authors have used the same opera- seamlessly integrated with training.
tion before for denser feature extraction in DCNNs [3], [6], Considering one-dimensional signals first, the output y½i
[16]. Beyond mere resolution enhancement, atrous convolu- of atrous convolution2 of a 1-D input signal x½i with a filter
tion allows us to enlarge the field of view of filters to incor- w½k of length K is defined as
porate larger context, which we have shown in [38] to be X
K
beneficial. This approach has been pursued further by [76], y½i ¼ x½i þ r  kw½k: (1)
who employ a series of atrous convolutional layers with k¼1
increasing rates to aggregate multiscale context. The atrous The rate parameter r corresponds to the stride with which
spatial pyramid pooling scheme proposed here to capture we sample the input signal. Standard convolution is a spe-
multiscale objects and context also employs multiple atrous cial case for rate r ¼ 1. See Fig. 2 for illustration.
convolutional layers with different sampling rates, which We illustrate the algorithm’s operation in 2-D through a
we however lay out in parallel instead of in serial. Interest- simple example in Fig. 3: Given an image, we assume that
ingly, the atrous convolution technique has also been we first have a downsampling operation that reduces the
adopted for a broader set of tasks, such as object detection resolution by a factor of 2, and then perform a convolution
[12], [77], instance-level segmentation [78], visual question with a kernel-here, the vertical Gaussian derivative. If one
answering [79], and optical flow [80]. implants the resulting feature map in the original image
We also show that, as expected, integrating into DeepLab coordinates, we realize that we have obtained responses at
more advanced image classification DCNNs such as the only 1/4 of the image positions. Instead, we can compute
residual net of [11] leads to better results. This has also been responses at all image positions if we convolve the full reso-
observed independently by [81]. lution image with a filter ‘with holes’, in which we upsam-
ple the original filter by a factor of 2, and introduce zeros in
between filter values. Although the effective filter size
3 METHODS increases, we only need to take into account the non-zero fil-
3.1 Atrous Convolution for Dense Feature ter values, hence both the number of filter parameters and
Extraction and Field-of-View Enlargement the number of operations per position stay constant. The
The use of DCNNs for semantic segmentation, or other resulting scheme allows us to easily and explicitly control
dense prediction tasks, has been shown to be simply and the spatial resolution of neural network feature responses.
successfully addressed by deploying DCNNs in a fully con-
volutional fashion [3], [14]. However, the repeated combina- 2. We follow the standard practice in the DCNN literature and use
tion of max-pooling and striding at consecutive layers of non-mirrored filters in this definition.
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
838 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

Fig. 5. Score map (input before softmax function) and belief map (output
of softmax function) for Aeroplane. We show the score (1st row) and
belief (2nd row) maps after each mean field iteration. The output of last
DCNN layer is used as input to the mean field inference.

Fig. 4. Atrous Spatial Pyramid Pooling (ASPP). To classify the center [15]. We implemented this in our earlier work [6], [38], fol-
pixel (orange), ASPP exploits multi-scale features by employing multiple lowed by [76], within the Caffe framework [41] by adding
parallel filters with different rates. The effective Field-Of-Views are to the im2col function (it extracts vectorized patches from
shown in different colors.
multi-channel feature maps) the option to sparsely sample
the underlying feature maps. The second method, origi-
In the context of DCNNs one can use atrous convolution nally proposed by [82] and used in [3], [16] is to subsam-
in a chain of layers, effectively allowing us to compute the ple the input feature map by a factor equal to the atrous
final DCNN network responses at an arbitrarily high resolu- convolution rate r, deinterlacing it to produce r2 reduced
tion. For example, in order to double the spatial density of resolution maps, one for each of the rr possible shifts.
computed feature responses in the VGG-16 or ResNet-101 This is followed by applying standard convolution to
networks, we find the last pooling or convolutional layer these intermediate feature maps and reinterlacing them to
that decreases resolution (’pool5’ or ’conv5_1’ respectively), the original image resolution. By reducing atrous convolu-
set its stride to 1 to avoid signal decimation, and replace all tion into regular convolution, it allows us to use off-the-
subsequent convolutional layers with atrous convolutional shelf highly optimized convolution routines. We have
layers having rate r ¼ 2. Pushing this approach all the way implemented the second approach into the TensorFlow
through the network could allow us to compute feature framework [83].
responses at the original image resolution, but this ends up
being too costly. We have adopted instead a hybrid 3.2 Multiscale Image Representations Using Atrous
approach that strikes a good efficiency/accuracy trade-off, Spatial Pyramid Pooling
using atrous convolution to increase by a factor of 4 the den- DCNNs have shown a remarkable ability to implicitly rep-
sity of computed feature maps, followed by fast bilinear resent scale, simply by being trained on datasets that con-
interpolation by an additional factor of 8 to recover feature tain objects of varying size. Still, explicitly accounting for
maps at the original image resolution. Bilinear interpolation object scale can improve the DCNN’s ability to successfully
is sufficient in this setting because the class score maps (cor- handle both large and small objects [6].
responding to log-probabilities) are quite smooth, as illus- We have experimented with two approaches to handling
trated in Fig. 5. Unlike the deconvolutional approach scale variability in semantic segmentation. The first
adopted by [14], the proposed approach converts image approach amounts to standard multiscale processing [17],
classification networks into dense feature extractors without [18]. We extract DCNN score maps from multiple (three in
requiring learning any extra parameters, leading to faster our experiments) rescaled versions of the original image
DCNN training in practice. using parallel DCNN branches that share the same parame-
Atrous convolution also allows us to arbitrarily enlarge ters. To produce the final result, we bilinearly interpolate
the field-of-view of filters at any DCNN layer. State-of-the-art the feature maps from the parallel DCNN branches to the
DCNNs typically employ spatially small convolution ker- original image resolution and fuse them, by taking at each
nels (typically 33) in order to keep both computation and position the maximum response across the different scales.
number of parameters contained. Atrous convolution with We do this both during training and testing. Multiscale
rate r introduces r  1 zeros between consecutive filter val- processing significantly improves performance, but at the
ues, effectively enlarging the kernel size of a kk filter to cost of computing feature responses at all DCNN layers for
ke ¼ k þ ðk  1Þðr  1Þ without increasing the number of multiple scales of input.
parameters or the amount of computation. It thus offers an The second approach is inspired by the success of the
efficient mechanism to control the field-of-view and finds R-CNN spatial pyramid pooling method of [20], which
the best trade-off between accurate localization (small field- showed that regions of an arbitrary scale can be accurately
of-view) and context assimilation (large field-of-view). We and efficiently classified by resampling convolutional fea-
have successfully experimented with this technique: Our tures extracted at a single scale. We have implemented a
DeepLab-LargeFOV model variant [38] employs atrous con- variant of their scheme which uses multiple parallel atrous
volution with rate r ¼ 12 in VGG-16 ‘fc6’ layer with signifi- convolutional layers with different sampling rates. The fea-
cant performance gains, as detailed in Section 4. tures extracted for each sampling rate are further processed
Turning to implementation aspects, there are two effi- in separate branches and fused to generate the final result.
cient ways to perform atrous convolution. The first is to The proposed “atrous spatial pyramid pooling” (DeepLab-
implicitly upsample the filters by inserting holes (zeros), ASPP) approach generalizes our DeepLab-LargeFOV vari-
or equivalently sparsely sample the input feature maps ant and is illustrated in Fig. 4.
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
CHEN ET AL.: DEEPLAB: SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS, ATROUS CONVOLUTION, AND FULLY... 839

3.3 Structured Prediction with Fully-Connected connecting all pairs of image pixels, i; j. In particular, as in
Conditional Random Fields for Accurate [22], we use the following expression
Boundary Recovery
" !
A trade-off between localization accuracy and classifica- jjpi  pj jj2 jjIi  Ij jj2
tion performance seems to be inherent in DCNNs: deeper uij ðxi ; xj Þ ¼ mðxi ; xj Þ w1 exp  
2s 2a 2s 2b
models with multiple max-pooling layers have proven !#
most successful in classification tasks, however the jjpi  pj jj2
increased invariance and the large receptive fields of top- þ w2 exp  ;
2s 2g
level nodes can only yield smooth responses. As illus-
trated in Fig. 5, DCNN score maps can predict the pres- (3)
ence and rough position of objects but cannot really
where mðxi ; xj Þ ¼ 1 if xi 6¼ xj , and zero otherwise, which, as
delineate their borders.
in the Potts model, means that only nodes with distinct
Previous work has pursued two directions to address
labels are penalized. The remaining expression uses two
this localization challenge. The first approach is to harness
Gaussian kernels in different feature spaces; the first,
information from multiple layers in the convolutional net-
‘bilateral’ kernel depends on both pixel positions (denoted
work in order to better estimate the object boundaries [14],
as p) and RGB color (denoted as I), and the second kernel
[21], [52]. The second is to employ a super-pixel representa-
only depends on pixel positions. The hyper parameters s a ,
tion, essentially delegating the localization task to a low-
s b and s g control the scale of Gaussian kernels. The first ker-
level segmentation method [50].
nel forces pixels with similar color and position to have sim-
We pursue an alternative direction based on coupling the
ilar labels, while the second kernel only considers spatial
recognition capacity of DCNNs and the fine-grained locali-
proximity when enforcing smoothness.
zation accuracy of fully connected CRFs and show that it is
Crucially, this model is amenable to efficient approxi-
remarkably successful in addressing the localization chal-
mate probabilistic inference [22]. The message passing
lenge, producing accurate semantic segmentation results
updates under Q a fully decomposable mean field approxima-
and recovering object boundaries at a level of detail that is
xÞ ¼ i bi ðxi Þ can be expressed as Gaussian convolu-
tion bðx
well beyond the reach of existing methods. This direction
tions in bilateral space. High-dimensional filtering
has been extended by several follow-up papers [17], [40],
algorithms [84] significantly speed-up this computation
[58], [59], [60], [61], [62], [63], [65], since the first version of
resulting in an algorithm that is very fast in practice, requir-
our work was published [38].
ing less that 0.5 sec on average for Pascal VOC images using
Traditionally, conditional random fields (CRFs) have
the publicly available implementation of [22].
been employed to smooth noisy segmentation maps [23],
[31]. Typically these models couple neighboring nodes,
favoring same-label assignments to spatially proximal 4 EXPERIMENTAL RESULTS
pixels. Qualitatively, the primary function of these short- We finetune the model weights of the Imagenet-
range CRFs is to clean up the spurious predictions of pretrained VGG-16 or ResNet-101 networks to adapt
weak classifiers built on top of local hand-engineered them to the semantic segmentation task in a straightfor-
features. ward fashion, following the procedure of [14]. We replace
Compared to these weaker classifiers, modern DCNN the 1000-way Imagenet classifier in the last layer with a
architectures such as the one we use in this work produce classifier having as many targets as the number of seman-
score maps and semantic label predictions which are quali- tic classes of our task (including the background, if appli-
tatively different. As illustrated in Fig. 5, the score maps are cable). Our loss function is the sum of cross-entropy
typically quite smooth and produce homogeneous classifi- terms for each spatial position in the CNN output map
cation results. In this regime, using short-range CRFs can be (subsampled by 8 compared to the original image). All
detrimental, as our goal should be to recover detailed local positions and labels are equally weighted in the overall
structure rather than further smooth it. Using contrast- loss function (except for unlabeled pixels which are
sensitive potentials [23] in conjunction to local-range CRFs ignored). Our targets are the ground truth labels (sub-
can potentially improve localization but still miss thin-struc- sampled by 8). We optimize the objective function with
tures and typically requires solving an expensive discrete respect to the weights at all network layers by the stan-
optimization problem. dard SGD procedure of [2]. We decouple the DCNN and
To overcome these limitations of short-range CRFs, we CRF training stages, assuming the DCNN unary terms
integrate into our system the fully connected CRF model of are fixed when setting the CRF parameters.
[22]. The model employs the energy function We evaluate the proposed models on four challenging
datasets: PASCAL VOC 2012, PASCAL-Context, PASCAL-
X X
Eðx
xÞ ¼ ui ðxi Þ þ uij ðxi ; xj Þ; (2) Person-Part, and Cityscapes. We first report the main results
i ij of our conference version [38] on PASCAL VOC 2012, and
move forward to latest results on all datasets.
where x is the label assignment for pixels. We use as unary
potential ui ðxi Þ ¼ log P ðxi Þ, where P ðxi Þ is the label 4.1 PASCAL VOC 2012
assignment probability at pixel i as computed by a DCNN. Dataset. The PASCAL VOC 2012 segmentation benchmark
The pairwise potential has a form that allows for efficient [34] involves 20 foreground object classes and one back-
inference while using a fully-connected graph, i.e., when ground class. The original dataset contains 1,464 (train),
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
840 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

TABLE 1 TABLE 2
Effect of Field-Of-View by Adjusting the Kernel Size PASCAL VOC 2012 val Set Results (%) (before CRF)
and Atrous sampling Rate r at ‘fc6’ Layer as Different Learning Hyper Parameters Vary

Kernel Rate FOV Params Speed bef/aft CRF Learning policy Batch size Iteration mean IOU
77 4 224 134.3M 1.44 64.38 / 67.64 step 30 6K 62.25
44 4 128 65.1M 2.90 59.80 / 63.74 poly 30 6K 63.42
44 8 224 65.1M 2.90 63.41 / 67.14 poly 30 10K 64.90
33 12 224 20.5M 4.84 62.25 / 67.64 poly 10 10K 64.71
poly 10 20K 65.88
We show number of model parameters, training speed (img/sec), and val set
mean IOU before and after CRF. DeepLab-LargeFOV (kernel size 33, Employing “poly” learning policy is more effective than “step” when training
r ¼ 12) strikes the best balance. DeepLab-LargeFOV.

1,449 (val), and 1,456 (test) pixel-level labeled images for Test Set Evaluation. We have evaluated our DeepLab-
training, validation, and testing, respectively. The dataset is CRF-LargeFOV model on the PASCAL VOC 2012 official
augmented by the extra annotations provided by [85], test set. It achieves 70:3percent mean IOU performance.
resulting in 10,582 (trainaug) training images. The perfor-
mance is measured in terms of pixel intersection-over-union 4.1.2 Improvements after Conference Version
(IOU) averaged across the 21 classes. of This Work
4.1.1 Results from Our Conference Version After the conference version of this work [38], we
have pursued three main improvements of our model,
We employ the VGG-16 network pre-trained on Imagenet,
which we discuss below: (1) different learning policy
adapted for semantic segmentation as described in
during training, (2) atrous spatial pyramid pooling, and
Section 3.1. We use a mini-batch of 20 images and initial
(3) employment of deeper networks and multi-scale
learning rate of 0.001 (0.01 for the final classifier layer), mul-
processing.
tiplying the learning rate by 0.1 every 2,000 iterations. We
use momentum of 0.9 and weight decay of 0.0005. Learning Rate Policy. We have explored different learn-
After the DCNN has been fine-tuned on trainaug, we ing rate policies when training DeepLab-LargeFOV. Simi-
cross-validate the CRF parameters along the lines of [22]. lar to [86], we also found that employing a “poly” learning
We use default values of w2 ¼ 3 and s g ¼ 3 and we search rate policy (the learning rate is multiplied by
iter power
for the best values of w1 , s a , and s b by cross-validation on ð1  max iterÞ ) is more effective than “step” learning
100 images from val. We employ a coarse-to-fine search rate (reduce the learning rate at a fixed step size). As
scheme. The initial search range of the parameters are shown in Table 2, employing “poly” (with power ¼ 0:9)
w1 2 ½3 : 6, s a 2 ½30 : 10 : 100 and s b 2 ½3 : 6 (MATLAB and using the same batch size and same training iterations
notation), and then we refine the search step sizes around yields 1.17 percent better performance than employing
the first round’s best values. We employ 10 mean field “step” policy. Fixing the batch size and increasing the
iterations. training iteration to 10 K improves the performance to
Field of View and CRF. In Table 1, we report experi- 64.90 percent (1.48 percent gain); however, the total train-
ments with DeepLab model variants that use different ing time increases due to more training iterations. We then
field-of-view sizes, obtained by adjusting the kernel size reduce the batch size to 10 and found that comparable
and atrous sampling rate r in the ‘fc6’ layer, as described performance is still maintained (64.90 percent versus
in Section 3.1. We start with a direct adaptation of VGG- 64.71 percent). In the end, we employ batch size = 10 and
16 net, using the original 77 kernel size and r ¼ 4 (since 20 K iterations in order to maintain similar training time
we use no stride for the last two max-pooling layers). as previous “step” policy. Surprisingly, this gives us the
This model yields performance of 67.64 percent after performance of 65.88 percent (3.63 percent improvement
CRF, but is relatively slow (1.44 images per second dur- over “step”) on val, and 67.7 percent on test, compared to
ing training). We have improved model speed to 2.9 65.1 percent of the original “step” setting for DeepLab-
images per second by reducing the kernel size to 44. LargeFOV before CRF. We employ the “poly” learning
We have experimented with two such network variants rate policy for all experiments reported in the rest of
with smaller (r ¼ 4) and larger (r ¼ 8) FOV sizes; the lat- the paper.
ter one performs better. Finally, we employ kernel size Atrous Spatial Pyramid Pooling. We have experimented
33 and even larger atrous sampling rate (r ¼ 12), also with the proposed Atrous Spatial Pyramid Pooling (ASPP)
making the network thinner by retaining a random subset scheme, described in Section 3.1. As shown in Fig. 7, ASPP
of 1,024 out of the 4,096 filters in layers ‘fc6’ and ‘fc7’. for VGG-16 employs several parallel fc6-fc7-fc8 branches.
The resulting model, DeepLab-CRF-LargeFOV, matches They all use 33 kernels but different atrous rates r in the
the performance of the direct VGG-16 adaptation (77 ‘fc6’ in order to capture objects of different size. In Table 3,
kernel size, r ¼ 4). At the same time, DeepLab-LargeFOV we report results with several settings: (1) Our baseline
is 3.36 times faster and has significantly fewer parameters LargeFOV model, having a single branch with r ¼ 12, (2)
(20.5 M instead of 134.3 M). ASPP-S, with four branches and smaller atrous rates (r = {2, 4,
The CRF substantially boosts performance of all model 8, 12}), and (3) ASPP-L, with four branches and larger rates (r
variants, offering a 3-5 percent absolute increase in mean IOU. = {6, 12, 18, 24}). For each variant we report results before and
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
CHEN ET AL.: DEEPLAB: SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS, ATROUS CONVOLUTION, AND FULLY... 841

Fig. 6. PASCAL VOC 2012 val results. Input image and our DeepLab results before/after CRF.

after CRF. As shown in the table, ASPP-S yields 1.22 percent (e.g., our simplest ResNet-101 based model attains 68.72
improvement over the baseline LargeFOV before CRF. How- percent, compared to 65.76 percent of our DeepLab-Large-
ever, after CRF both LargeFOV and ASPP-S perform simi- FOV VGG-16 based variant, both before CRF). Multiscale
larly. On the other hand, ASPP-L yields consistent fusion [17] brings extra 2.55 percent improvement,
improvements over the baseline LargeFOV both before and while pretraining the model on MS-COCO gives another
after CRF. We evaluate on test the proposed ASPP-L + CRF 2.01 percent gain. Data augmentation during training is
model, attaining 72.6 percent. We visualize the effect of the effective (about 1.6 percent improvement). Employing
different schemes in Fig. 8. LargeFOV (adding an atrous convolutional layer on top of
Deeper Networks and Multiscale Processing. We have ResNet, with 3  3 kernel and rate = 12) is beneficial
experimented building DeepLab around the recently pro- (about 0.6 percent improvement). Further 0.8 percent
posed residual net ResNet-101 [11] instead of VGG-16. improvement is achieved by atrous spatial pyramid pool-
Similar to what we did for VGG-16 net, we re-purpose ing (ASPP). Post-processing our best model by dense CRF
ResNet-101 by atrous convolution, as described in yields performance of 77.69 percent.
Section 3.1. On top of that, we adopt several other fea- Qualitative Results. We provide qualitative visual com-
tures, following recent work of [17], [18], [39], [40], [58], parisons of DeepLab’s results (our best model variant)
[59], [62]: (1) Multi-scale inputs: We separately feed to the before and after CRF in Fig. 6. The visualization results
DCNN images at scale = {0.5, 0.75, 1}, fusing their score obtained by DeepLab before CRF already yields excellent
maps by taking the maximum response across scales for segmentation results, while employing the CRF further
each position separately [17]. (2) Models pretrained on improves the performance by removing false positives and
MS-COCO [87]. (3) Data augmentation by randomly scal- refining object boundaries.
ing the input images (from 0.5 to 1.5) during training. In Test Set Results. We have submitted the result of our
Table 4, we evaluate how each of these factors, along with final best model to the official server, obtaining test set
LargeFOV and atrous spatial pyramid pooling (ASPP), performance of 79.7 percent, as shown in Table 5. The
affects val set performance. Adopting ResNet-101 instead model substantially outperforms previous DeepLab var-
of VGG-16 significantly improves DeepLab performance iants (e.g., DeepLab-LargeFOV with VGG-16 net) and is
currently the top performing method on the PASCAL
VOC 2012 segmentation leaderboard.

Fig. 8. Qualitative segmentation results with ASPP compared to the


baseline LargeFOV model. The ASPP-L model, employing multiple large
Fig. 7. DeepLab-ASPP employs multiple filters with different rates to FOVs can successfully capture objects as well as image context at multi-
capture objects and context at multiple scales. ple scales.
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
842 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

TABLE 3 TABLE 5
Effect of ASPP on PASCAL VOC 2012 val Set Performance Performance on PASCAL VOC 2012 test Set
(Mean IOU) for VGG-16 Based DeepLab Model
Method mIOU
Method before CRF after CRF DeepLab-CRF-LargeFOV-COCO [58] 72.7
LargeFOV 65.76 69.84 MERL_DEEP_GCRF [88] 73.2
ASPP-S 66.98 69.73 CRF-RNN [59] 74.7
ASPP-L 68.96 71.57 POSTECH_DeconvNet_CRF_VOC [61] 74.8
BoxSup [60] 75.2
LargeFOV: single branch, r ¼ 12. ASPP-S: four branches, r = {2, 4, 8, 12}. Context + CRF-RNN [76] 75.3
ASPP-L: four branches, r = {6, 12, 18, 24}. QOmres [66] 75.5
4
DeepLab-CRF-Attention [17] 75.7
TABLE 4 CentraleSuperBoundaries++ [18] 76.0
Employing ResNet-101 for DeepLab on PASCAL DeepLab-CRF-Attention-DT [63] 76.3
VOC 2012 val set H-ReNet + DenseCRF [89] 76.8
LRR_4x_COCO [90] 76.8
MSC COCO Aug LargeFOV ASPP CRF mIOU DPN [62] 77.5
Adelaide_Context [40] 77.8
68.72 Oxford_TVG_HO_CRF [91] 77.9
@ 71.27 Context CRF + Guidance CRF [92] 78.1
@ @ 73.28 Adelaide_VeryDeep_FCN_VOC [93] 79.1
@ @ @ 74.87
@ @ @ @ 75.54 DeepLab-CRF (ResNet-101) 79.7
@ @ @ @ 76.35
@ @ @ @ @ 77.69 We have added some results from recent arXiv papers on top of the official lea-
dearboard results.
MSC: Employing mutli-scale inputs with max fusion COCO: Models pre-
trained on MS-COCO. Aug: Data augmentation by randomly rescaling TABLE 6
inputs. Comparison with Other State-of-Art Methods
on PASCAL-Context Dataset

Method MSC COCO Aug LargeFOV ASPP CRF mIOU


VGG-16
DeepLab [38] @ 37.6
DeepLab [38] @ @ 39.6
ResNet-101
DeepLab 39.6
Fig. 9. DeepLab results based on VGG-16 net or ResNet-101 before and DeepLab @ @ 41.4
after CRF. The CRF is critical for accurate prediction along object bound- DeepLab @ @ @ 42.9
aries with VGG-16, whereas ResNet-101 has acceptable performance DeepLab @ @ @ @ 43.5
even before CRF. DeepLab @ @ @ @ 44.7
DeepLab @ @ @ @ @ 45.7
O2 P [45] 18.1
CFM [51] 34.4
FCN-8s [14] 37.8
CRF-RNN [59] 39.3
ParseNet [86] 40.4
BoxSup [60] 40.5
HO_CRF [91] 41.3
Context [40] 43.3
VeryDeep [93] 44.5

Fig. 10. (a) Trimap examples (top-left: image. top-right: ground-truth. bot-
tom-left: trimap of 2 pixels. bottom-right: trimap of 10 pixels). (b) Pixel
mean IOU as a function of the band width around the object boundaries accuracy along object boundaries as employing VGG-16 in
when employing VGG-16 or ResNet-101 before and after CRF. conjunction with a CRF. Post-processing the ResNet-101
result with a CRF further improves the segmentation result.
VGG-16 versus ResNet-101. We have observed that Deep-
Lab based on ResNet-101 [11] delivers better segmentation 4.2 PASCAL-Context
results along object boundaries than employing VGG-16 [4], Dataset. The PASCAL-Context dataset [35] provides
as visualized in Fig. 9. We think the identity mapping [94] detailed semantic labels for the whole scene, including both
of ResNet-101 has similar effect as hyper-column features object (e.g., person) and stuff (e.g., sky). Following [35], the
[21], which exploits the features from the intermediate proposed models are evaluated on the most frequent 59
layers to better localize boundaries. We further quantize classes along with one background category. The training
this effect in Fig. 10 within the “trimap” [22], [31] (a narrow set and validation set contain 4,998 and 5,105 images.
band along object boundaries). As shown in the figure, Evaluation. We report the evaluation results in Table 6.
employing ResNet-101 before CRF has almost the same Our VGG-16 based LargeFOV variant yields 37.6 percent
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
CHEN ET AL.: DEEPLAB: SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS, ATROUS CONVOLUTION, AND FULLY... 843

Fig. 11. PASCAL-Context results. Input image, ground-truth, and our DeepLab results before/after CRF.

before and 39.6 percent after CRF. Repurposing the ResNet- annotations by [36]. We focus on the person part for the data-
101 [11] for DeepLab improves 2 percent over the VGG-16 set, which contains more training data and large variation in
LargeFOV. Similar to [17], employing multi-scale inputs object scale and human pose. Specifically, the dataset con-
and max-pooling to merge the results improves the perfor- tains detailed part annotations for every person, e.g., eyes,
mance to 41.4 percent. Pretraining the model on MS-COCO nose. We merge the annotations to be Head, Torso, Upper/
brings extra 1.5 percent improvement. Employing atrous Lower Arms and Upper/Lower Legs, resulting in six per-
spatial pyramid pooling is more effective than LargeFOV. son part classes and one background class. We only use
After further employing dense CRF as post processing, our those images containing persons for training (1,716 images)
final model yields 45.7 percent, outperforming the current and validation (1,817 images).
state-of-art method [40] by 2.4 percent without using their Evaluation. The human part segmentation results on
non-linear pairwise term. Our final model is slightly better PASCAL-Person-Part is reported in Table 7. [17] has
than the concurrent work [93] by 1.2 percent, which also already conducted experiments on this dataset with
employs atrous convolution to repurpose the residual net of re-purposed VGG-16 net for DeepLab, attaining
[11] for semantic segmentation. 56.39 percent (with multi-scale inputs). Therefore, in this
Qualitative Results. We visualize the segmentation results
part, we mainly focus on the effect of repurposing
of our best model with and without CRF as post processing
ResNet-101 for DeepLab. With ResNet-101, DeepLab
in Fig. 11. DeepLab before CRF can already predict most of
alone yields 58.9 percent, significantly outperforming
the object/stuff with high accuracy. Employing CRF, our
DeepLab-LargeFOV (VGG-16 net) and DeepLab-Attention
model is able to further remove isolated false positives and
(VGG-16 net) by about 7 percent and 2.5 percent, respec-
improve the prediction along object/stuff boundaries.
tively. Incorporating multi-scale inputs and fusion by
max-pooling further improves performance to 63.1 per-
4.3 PASCAL-Person-Part cent. Additionally pretraining the model on MS-COCO
Dataset. We further perform experiments on semantic part yields another 1.3 percent improvement. However, we do
segmentation [98], [99], using the extra PASCAL VOC 2010 not observe any improvement when adopting either
LargeFOV or ASPP on this dataset. Employing the dense
CRF to post process our final output substantially outper-
TABLE 7
Comparison with Other State-of-Art Methods forms the concurrent work [97] by 4.78 percent.
on PASCAL-Person-Part Dataset Qualitative Results. We visualize the results in Fig. 12.

Method MSC COCO Aug LFOV ASPP CRF mIOU 4.4 Cityscapes
ResNet-101 Dataset. Cityscapes [37] is a recently released large-scale
DeepLab 58.90 dataset, which contains high quality pixel-level annota-
DeepLab @ @ 63.10 tions of 5,000 images collected in street scenes from 50
DeepLab @ @ @ 64.40
DeepLab @ @ @ @ 64.94 different cities. Following the evaluation protocol [37], 19
semantic labels (belonging to 7 super categories: ground,
DeepLab @ @ @ @ 62.18 construction, object, nature, sky, human, and vehicle) are
DeepLab @ @ @ @ 62.76
used for evaluation (the void label is not considered for
Attention [17] 56.39 evaluation). The training, validation, and test sets contain
HAZN [95] 57.54 2,975, 500, and 1,525 images respectively.
LG-LSTM [96] 57.97
Graph LSTM [97] 60.16 Test Set Results of Pre-Release. We have participated in
benchmarking the Cityscapes dataset pre-release. As
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
844 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

Fig. 12. PASCAL-Person-Part results. Input image, ground-truth, and our DeepLab results before/after CRF.

shown in the top of Table 8, our model attained third net with ResNet-101. We do not exploit multi-scale inputs
place, with performance of 63.1 and 64.8 percent (with due to the limited GPU memories at hand. Instead, we only
training on additional coarsely annotated images). explore (1) deeper networks (i.e., ResNet-101), (2) data aug-
Val Set Results. After the initial release, we further mentation, (3) LargeFOV or ASPP, and (4) CRF as post proc-
explored the validation set in Table 9. The images of City- essing on this dataset. We first find that employing ResNet-
scapes have resolution 2; 0481; 024, making it a challeng- 101 alone is better than using VGG-16 net. Employing
ing problem to train deeper networks with limited GPU LargeFOV brings 2.6 percent improvement and using ASPP
memory. During benchmarking the pre-release of the data- further improves results by 1.2 percent. Adopting data aug-
set, we downsampled the images by 2. However, we have mentation and CRF as post processing brings another 0.6
found that it is beneficial to process the images in their origi- and 0.4 percent, respectively.
nal resolution. With the same training protocol, using Current Test Result. We have uploaded our best model
images of original resolution significantly brings 1.9 and to the evaluation server, obtaining performance of
1.8 percent improvements before and after CRF, respec- 70.4 percent. Note that our model is only trained on the
tively. In order to perform inference on this dataset with train set.
high resolution images, we split each image into overlapped Qualitative Results. We visualize the results in Fig. 13.
regions, similar to [37]. We have also replaced the VGG-16
4.5 Failure Modes
We further qualitatively analyze some failure modes of our
TABLE 8 best model variant on PASCAL VOC 2012 val set. As shown
Test Set Results on the Cityscapes Dataset, Comparing Our in Fig. 14, our proposed model fails to capture the delicate
DeepLab System with Other State-of-Art Methods boundaries of objects, such as bicycle and chair. The details
Method mIOU
pre-release version of dataset TABLE 9
Adelaide_Context [40] 66.4 Val Set Results on Cityscapes Dataset
FCN-8s [14] 65.3
Full Aug LargeFOV ASPP CRF mIOU
DeepLab-CRF-LargeFOV-StrongWeak [58] 64.8
DeepLab-CRF-LargeFOV [38] 63.1 VGG-16
@ 62.97
CRF-RNN [59] 62.5 @ @ 64.18
DPN [62] 59.1 @ @ 64.89
Segnet basic [100] 57.0 @ @ @ 65.94
Segnet extended [100] 56.1
ResNet-101
official version @ 66.6
Adelaide_Context [40] 71.6 @ @ 69.2
Dilation10 [76] 67.1 @ @ 70.4
DPN [62] 66.8 @ @ @ 71.0
Pixel-level Encoding [101] 64.3 @ @ @ @ 71.4
DeepLab-CRF (ResNet-101) 70.4
Full: model trained with full resolution images.
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
CHEN ET AL.: DEEPLAB: SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS, ATROUS CONVOLUTION, AND FULLY... 845

Fig. 13. Cityscapes results. Input image, ground-truth, and our DeepLab results before/after CRF.

could not even be recovered by the CRF post processing ACKNOWLEDGMENTS


since the unary term is not confident enough. We hypothe-
This work was partly supported by the ARO 62250-CS, FP7-
size the encoder-decoder structure of [100], [102] may allevi-
RECONFIG, FP7-MOBOT, and H2020-ISUPPORT EU proj-
ate the problem by exploiting the high resolution feature
ects. We gratefully acknowledge the support of NVIDIA
maps in the decoder path. How to efficiently incorporate
Corporation with the donation of GPUs used for this
the method is left as a future work.
research. The first two authors contributed equally to this
work.
5 CONCLUSION
Our proposed “DeepLab” system re-purposes networks REFERENCES
trained on image classification to the task of semantic seg- [1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
mentation by applying the ‘atrous convolution’ with learning applied to document recognition,” in Proc. IEEE, vol. 86,
no. 11, pp. 2278–2324, Nov. 1998.
upsampled filters for dense feature extraction. We further [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classi-
extend it to atrous spatial pyramid pooling, which encodes fication with deep convolutional neural networks,” in Proc. 25th
objects as well as image context at multiple scales. To pro- Int. Conf. Neural Inf. Process. Syst., 2013, pp. 1097–1105.
[3] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and
duce semantically accurate predictions and detailed seg- Y. LeCun, “OverFeat: Integrated recognition, localization and
mentation maps along object boundaries, we also combine detection using convolutional networks,” arXiv:1312.6229, 2013.
ideas from deep convolutional neural networks and fully- [4] K. Simonyan and A. Zisserman, “Very deep convolutional net-
works for large-scale image recognition,” in Proc. Int. Conf. Learn.
connected conditional random fields. Our experimental Representations, 2015.
results show that the proposed method significantly advan- [5] C. Szegedy, et al., “Going deeper with convolutions,”
ces the state-of-art in several challenging datasets, including arXiv:1409.4842, 2014.
PASCAL VOC 2012 semantic image segmentation bench- [6] G. Papandreou, I. Kokkinos, and P.-A. Savalle, “Modeling local
and global deformations in deep learning: Epitomic convolution,
mark, PASCAL-Context, PASCAL-Person-Part, and City- multiple instance learning, and sliding window detection,” in
scapes datasets. Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 390–399.
[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog.,
2014, pp. 580–587.
[8] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable
object detection using deep neural networks,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recog., 2014, pp. 2155–2162.
[9] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.,
2015.
[10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
real-time object detection with region proposal networks,” in
Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
Fig. 14. Failure modes. Input image, ground-truth, and our DeepLab image recognition,” arXiv:1512.03385, 2015.
results before/after CRF.
Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
846 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

[12] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, “SSD: [36] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille,
Single shot multibox detector,” arXiv:1512.02325, 2015. “Detect what you can: Detecting and representing objects using
[13] M. D. Zeiler and R. Fergus, “Visualizing and understanding holistic models and body parts,” in Proc. IEEE Conf. Comput. Vis.
convolutional networks,” in Proc. Eur. Conf. Comput. Vis., 2014, Pattern Recog., 2014.
pp. 818–833. [37] M. Cordts, et al., “The cityscapes dataset for semantic urban
[14] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- scene understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern
works for semantic segmentation,” in Proc. IEEE Conf. Comput. Recog., 2016.
Vision Pattern Recog., 2015. [38] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
[15] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tcha- Yuille, “Semantic image segmentation with deep convolutional
mitchian, “A real-time algorithm for signal analysis with the nets and fully connected CRFs,” in Proc. Int. Conf. Learn. Represen-
help of the wavelet transform,” in Proc. Wavelets: Time-Frequency tations, 2015.
Methods Phase Space, 1989, pp. 289–297. [39] C. Farabet, C. Couprie, L. Najman, and Y. LeCun , “Learning
[16] A. Giusti, D. Ciresan, J. Masci, L. Gambardella, and J. Schmid- hierarchical features for scene labeling,” IEEE Trans. Pattern
huber, “Fast image scanning with deep max-pooling convolu- Anal. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, Aug. 2013.
tional neural networks,” in Proc. IEEE Int. Conf. Image Process., [40] G. Lin, C. Shen, I. Reid et al., “Efficient piecewise training of deep
2013, pp. 4034–4038. structured models for semantic segmentation,” arXiv:1504.01013,
[17] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention 2015.
to scale: Scale-aware semantic image segmentation,” in Proc. [41] Y. Jia, et al., “Caffe: Convolutional architecture for fast feature
IEEE Conf. Comput. Vis. Pattern Recog., 2016. embedding,” arXiv:1408.5093, 2014.
[18] I. Kokkinos, “Pushing the boundaries of boundary detection using [42] Z. Tu and X. Bai, “Auto-context and its application to high-level
deep learning,” in Proc. Int. Conf. Learn. Representations, 2016. vision tasks and 3D brain image segmentation,” IEEE Trans. Pat-
[19] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: tern Anal. Mach. Intell., vol. 32, no. 10, pp. 1744–1757, Oct. 2010.
Spatial pyramid matching for recognizing natural scene catego- [43] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests
ries,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., for image categorization and segmentation,” in Proc. IEEE Conf.
2006, pp. 2169–2178. Comput. Vis. Pattern Recog., 2008, pp. 1–8.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in [44] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation and
deep convolutional networks for visual recognition,” in Proc. object localization with superpixel neighborhoods,” in Proc. IEEE
Eur. Conf. Comput. Vis., 2014, pp. 346–361. 12th Int. Conf. Comput. Vis., 2009, pp. 670–677.
[21] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik, [45] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu, “Semantic
“Hypercolumns for object segmentation and fine-grained local- segmentation with second-order pooling,” in Proc. Eur. Conf.
ization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, Comput. Vis., 2012, pp. 430–443.
pp. 447–456. [46] J. Carreira and C. Sminchisescu, “CPMC: Automatic object seg-
[22] P. Kr€ ahenb€uhl and V. Koltun, “Efficient inference in fully con- mentation using constrained parametric min-cuts,” IEEE Trans.
nected CRFs with Gaussian edge potentials,” in Proc. Advances Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1312–1328, Jul. 2012.
Neural Inf. Process. Syst., 2011, pp. 109–117. [47] P. Arbelaez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik,
[23] C. Rother, V. Kolmogorov, and A. Blake, “GrabCut: Interactive “Multiscale combinatorial grouping,” in Proc. IEEE Conf. Comput.
foreground extraction using iterated graph cuts,” in Proc. ACM Vis. Pattern Recog., 2014, pp. 328–335.
SIGGRAPH, 2004, pp. 309–314. [48] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders,
[24] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost for “Selective search for object recognition,” Int. J. Comput. Vis.,
image understanding: Multi-class object recognition and seg- vol. 104, pp. 154–171, 2013.
mentation by jointly modeling texture, layout, and context,” Int. [49] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik,
J. Comput. Vis., vol. 81, pp. 2–23, 2009. “Simultaneous detection and segmentation,” in Proc. Eur. Conf.
[25] A. Lucchi, Y. Li, X. Boix, K. Smith, and P. Fua, “Are spatial and Comput. Vis., 2014, pp. 297–312.
global constraints really necessary for segmentation?” in Proc. [50] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich,
Int. Conf. Comput. Vis., 2011, pp. 9–16. “Feedforward semantic segmentation with zoom-out features,”
[26] X. He, R. S. Zemel, and M. Carreira-Perpindn , “Multiscale condi- in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 3376–
tional random fields for image labeling,” in Proc. IEEE Comput. 3385.
Soc. Conf. Comput. Vis. Pattern Recog., 2004, pp. 695–703. [51] J. Dai, K. He, and J. Sun, “Convolutional feature masking for joint
[27] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hier- object and stuff segmentation,” arXiv:1412.1283, 2014.
archical CRFs for object class image segmentation,” in Proc. Int. [52] D. Eigen and R. Fergus, “Predicting depth, surface normals and
Conf. Comput. Vis., 2009, pp. 739–746. semantic labels with a common multi-scale convolutional
[28] V. Lempitsky, A. Vedaldi, and A. Zisserman, “Pylon model for architecture,” arXiv:1411.4734, 2014.
semantic segmentation,” in Proc. Advances Neural Inf. Process. [53] M. Cogswell, X. Lin, S. Purushwalkam, and D. Batra,
Syst., 2011, pp. 1485–1493. “Combining the best of graphical models and convnets for
[29] A. Delong, A. Osokin, H. N. Isack, and Y. Boykov, “Fast approxi- semantic segmentation,” arXiv:1412.4313, 2014.
mate energy minimization with label costs,” Int. J. Comput. Vis., [54] D. Geiger and F. Girosi, “Parallel and deterministic algorithms
vol. 96, pp. 1–27, 2012. from MRFs: Surface reconstruction,” IEEE Trans. Pattern Anal.
[30] J. M. Gonfaus, X. Boix, J. Van de Weijer, A. D. Bagdanov, J. Serrat, Mach. Intell., vol. 13, no. 5, pp. 401–412, May 1991.
and J. Gonzalez, “Harmony potentials for joint classification and [55] D. Geiger and A. Yuille, “A common framework for image
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., segmentation,” Int. J. Comput. Vis., vol. 6, no. 3, pp. 227–243, 1991.
2010. [56] I. Kokkinos, R. Deriche, O. Faugeras, and P. Maragos,
[31] P. Kohli, P. H. Torr, and L. Ladick, “Robust higher order poten- “Computational analysis and learning for a biologically moti-
tials for enforcing label consistency,” Int. J. Comput. Vis., vol. 82, vated model of boundary detection,” Neurocomputing, vol. 71,
no. 3, pp. 302–324, 2009. no. 10, pp. 1798–1812, 2008.
[32] L.-C. Chen, G. Papandreou, and A. Yuille, “Learning a dictionary [57] S. Bell, P. Upchurch, N. Snavely, and K. Bala, “Material recogni-
of shape epitomes with applications to image labeling,” in Proc. tion in the wild with the materials in context database,”
IEEE Int. Conf. Comput. Vis., 2013, pp. 337–344. arXiv:1412.0623, 2014.
[33] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille, [58] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille,
“Towards unified depth and semantic prediction from a single “Weakly- and semi-supervised learning of a DCNN for semantic
image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, image segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2015,
pp. 2800–2809. pp. 1742–1750.
[34] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, [59] S. Zheng, et al., “Conditional random fields as recurrent neural
J. Winn, and A. Zisserma, “The PASCAL visual object classes networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1529–
challenge a retrospective,” Int. J. Comput. Vis., 2014. 1537.
[35] R. Mottaghi, et al., “The role of context for object detection and [60] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to
semantic segmentation in the wild,” in Proc. IEEE Conf. Comput. supervise convolutional networks for semantic segmentation,”
Vis. Pattern Recog., 2014, pp. 891–898. in Proc. Int. Conf. Comput. Vis., 2015.

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
CHEN ET AL.: DEEPLAB: SEMANTIC IMAGE SEGMENTATION WITH DEEP CONVOLUTIONAL NETS, ATROUS CONVOLUTION, AND FULLY... 847

[61] H. Noh, S. Hong, and B. Han, “Learning deconvolution network [87] T.-Y. Lin, et al., “Microsoft COCO: Common objects in context,”
for semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis., in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
2015, pp. 1520–1528. [88] R. Vemulapalli, O. Tuzel, M.-Y. Liu, and R. Chellappa, “Gaussian
[62] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, “Semantic image conditional random field network for semantic segmentation,” in
segmentation via deep parsing network,” in Proc. Int. Conf. Com- Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 3224–3233.
put. Vis., 2015, pp. 1377–1385. [89] Z. Yan, H. Zhang, Y. Jia, T. Breuel, and Y. Yu, “Combining the
[63] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. best of convolutional layers and recurrent layers: A hybrid net-
Yuille, “Semantic image segmentation with task-specific edge work for semantic segmentation,” arXiv:1603.04871, 2016.
detection using cnns and a discriminatively trained domain [90] G. Ghiasi and C. C. Fowlkes, “Laplacian reconstruction and
transform,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016, refinement for semantic segmentation,” arXiv:1605.02264, 2016.
pp. 4545–4554. [91] A. Arnab, S. Jayasumana, S. Zheng, and P. Torr, “Higher order
[64] L.-C. Chen, A. Schwing, A. Yuille, and R. Urtasun, “Learning potentials in end-to-end trainable conditional random fields,”
deep structured models,” in Proc. 32nd Int. Conf. Int. Conf. Mach. arXiv:1511.08119, 2015.
Learn., 2015, pp. 1785–1794. [92] F. Shen and G. Zeng, “Fast semantic image segmentation with
[65] A. G. Schwing and R. Urtasun, “Fully connected deep structured high order context and guided filtering,” arXiv:1605.04068, 2016.
networks,” arXiv:1503.02351, 2015. [93] Z. Wu, C. Shen, and A. van den Hengel, “Bridging category-
[66] S. Chandra and I. Kokkinos, “Fast, exact and multi-scale infer- level and instance-level semantic image segmentation,”
ence for semantic image segmentation with deep Gaussian arXiv:1605.06885, 2016.
CRFs,” arXiv:1603.08358, 2016. [94] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep
[67] E. S. L. Gastal and M. M. Oliveira, “Domain transform for edge- residual networks,” arXiv:1603.05027, 2016.
aware image and video processing,” in Proc. ACM SIGGRAPH, [95] F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille, “Zoom better to see
2011, Art. no. 69. clearer: Huamn part segmentation with auto zoom net,”
[68] G. Bertasius, J. Shi, and L. Torresani, “High-for-low and low-for- arXiv:1511.06881, 2015.
high: Efficient boundary detection from deep object features [96] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan, “Semantic
and its applications to high-level vision,” in Proc. IEEE Int. Conf. object parsing with local-global long short-term memory,”
Comput. Vis., 2015. arXiv:1511.04510, 2015.
[69] P. O. Pinheiro and R. Collobert, “Weakly supervised semantic [97] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan, “Semantic object
segmentation with convolutional networks,” arXiv:1411.6228, parsing with graph lstm,” arXiv:1603.07063, 2016.
2014. [98] J. Wang and A. Yuille, “Semantic part segmentation using com-
[70] D. Pathak, P. Kr€ ahenb€uhl, and T. Darrell, “Constrained convolu- positional model combining shape and appearance,” in Proc.
tional neural networks for weakly supervised segmentation,” IEEE Conf. Comput. Vis. Pattern Recog., 2015.
Proc. IEEE Int. Conf. Comput. Vis., 2015. [99] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. Yuille, “Joint
[71] S. Hong, H. Noh, and B. Han, “Decoupled deep neural network object and part segmentation using deep learned potentials,” in
for semi-supervised semantic segmentation,” in Proc. 28th Int. Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1573–1581.
Conf. Neural Inf. Process. Syst., 2015, pp. 1495–1503. [100] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
[72] A. Vezhnevets, V. Ferrari, and J. M. Buhmann, “Weakly super- convolutional encoder-decoder architecture for image seg-
vised semantic segmentation with a multi-image model,” in mentation,” arXiv:1511.00561, 2015.
Proc. Int. Conf. Comput. Vis., 2011, pp. 643–650. [101] J. Uhrig, M. Cordts, U. Franke, and T. Brox, “Pixel-level encoding
[73] X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, and S. Yan, “Proposal- and depth layering for instance-level semantic labeling,”
free network for instance-level object segmentation,” arXiv:1604.05096, 2016.
arXiv:1509.02636, 2015. [102] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
[74] J. E. Fowler, “The redundant discrete wavelet transform and networks for biomedical image segmentation,” in Proc. Int. Conf.
additive noise,” IEEE Signal Process. Lett., vol. 12, no. 9, pp. 629– Medical Image Comput. Comput.-Assisted Intervention, 2015,
632, Sep. 2005. pp. 234–241.
[75] P. P. Vaidyanathan, “Multirate digital filters, filter banks, poly-
phase networks, and applications: a tutorial,” Proc. IEEE, vol. 78, Liang-Chieh Chen received the BSc degree
no. 1, pp. 56–93, Jan. 1990. from National Chiao Tung University, Taiwan, the
[76] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated MS degree from the University of Michigan- Ann
convolutions,” in Proc. Int. Conf. Learn. Representations, 2016. Arbor, and the PhD degree from the University of
[77] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via California-Los Angeles. He is currently working at
region-based fully convolutional networks,” arXiv:1605.06409, Google. His research interests include semantic
2016. image segmentation, probabilistic graphical mod-
[78] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully els, and machine learning.
convolutional networks,” arXiv:1603.08678, 2016.
[79] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia,
“ABC-CNN: An attention based convolutional neural network
for visual question answering,” arXiv:1511.05960, 2015.
[80] L. Sevilla-Lara, D. Sun, V. Jampani, and M. J. Black, “Optical George Papandreou (S’03–M’09–SM’14) recei-
flow with semantic segmentation and localized layers,” ved Diploma and the PhD degree in electrical engi-
arXiv:1603.03911, 2016. neering and computer science, in 2003 and 2009,
[81] Z. Wu, C. Shen, and A. van den Hengel, “High-performance respectively, both from the National Technical Uni-
semantic segmentation using very deep fully convolutional versity of Athens (NTUA), Greece. He is currently a
networks,” arXiv:1604.04339, 2016. research scientist at Google, following appoint-
[82] M. J. Shensa, “The discrete wavelet transform: Wedding the a ments as research assistant professor in Toyota
trous and Mallat algorithms,” IEEE Trans. Signal Process., vol. 40, Technological Institute, Chicago (2013-2014) and
no. 10, pp. 2464–2482, Oct. 1992. postdoctoral research scholar with the University of
[83] M. Abadi, A. Agarwal et al., “Tensorflow: Large-scale machine California, Los Angeles (2009-2013). His research
learning on heterogeneous distributed systems,” arXiv:1603.04467, interests include in computer vision and machine
2016. learning, with a current emphasis on deep learning. He regularly serves as
[84] A. Adams, J. Baek, and M. A. Davis, “Fast high-dimensional fil- a reviewer and program committee member to the main journals and con-
tering using the permutohedral lattice,” in Eurographics, vol. 29, ferences in computer vision, image processing, and machine learning. He
pp. 753–762, 2010. has been a co-organizer of the NIPS 2012, 2013, and 2014 Workshops on
[85] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, Perturbations, Optimization, and Statistics and co-editor of a book on the
“Semantic contours from inverse detectors,” in Proc. IEEE Int. same topic (MIT Press, 2016). He is a senior member of the IEEE.
Conf. Comput. Vis., 2011, pp. 991–998.
[86] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider
to see better,” arXiv:1506.04579, 2015.

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.
848 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 4, APRIL 2018

Iasonas Kokkinos (S’02–M’06) received the Alan L. Yuille (F’09) received the BA degree in
Diploma of engineering, in 2001 and the PhD mathematics from the University of Cambridge, in
degree, in 2006 from the School of Electrical and 1976 and the PhD degree on theoretical physics,
Computer Engineering of the National Technical supervised by Prof. S.W. Hawking, was approved,
University of Athens in Greece, and the Habilita- in 1981. He was a research scientist in the Artifi-
tion degree in 2013 from Universite Paris-Est. In cial Intelligence Laboratory, MIT and the Division
2006 he joined the University of California, Los of Applied Sciences at Harvard University from
Angeles as a postdoctoral scholar, and in 2008 1982 to 1988. He served as an assistant and
joined as faculty the Department of Applied Math- associate professor at Harvard until 1996. He
ematics of Ecole Centrale Paris (CentraleSupe- was a senior research scientist at the Smith-Ket-
lec), working an associate professor in the tlewell Eye Research Institute from 1996 to 2002.
Center for Visual Computing of CentraleSupelec and affiliate researcher He joined the University of California, Los Angeles, as a full professor
at INRIA-Saclay. In 2016 he joined University College London and Face- with a joint appointment in statistics and psychology in 2002, and com-
book Artificial Intelligence Research. His currently research activity is on puter science in 2007. He was appointed a Bloomberg Distinguished
deep learning for computer vision, focusing in particular on structured professor at Johns Hopkins University in January 2016. He holds a joint
prediction for deep learning, shape modeling, and multi-task learning appointment between the Departments of Cognitive science and Com-
architectures. He has been awarded a young researcher grant by the puter Science. His research interests include computational models of
French National Research Agency, has served as associate editor for vision, mathematical models of cognition, and artificial intelligence and
the Image and Vision Computing and Computer Vision and Image neural network. He is IEEE fellow since 2009.
Understanding journals, serves regularly as a reviewer and area chair
for all major computer vision conferences and journals. He is a member
of the IEEE. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

Kevin Murphy received the MEng degree from


U. Penn, the PhD degree from UC Berkeley, the
Postdoc at MIT and then became a professor in
the Computer Science and Statistics Depart-
ments, the University of British Columbia in Van-
couver, Canada in 2004. After getting tenure, he
went to Google in Mountain View, California for
his sabbatical. In 2011, he converted to a full-
time research scientist at Google. Kevin has pub-
lished more than 50 papers in refereed conferen-
ces and journals related to machine learning and
graphical models. He has recently published an 1100-page textbook
called Machine Learning: a Probabilistic Perspective (MIT Press, 2012).

Authorized licensed use limited to: DRDO-ITR. Downloaded on December 20,2024 at 06:58:55 UTC from IEEE Xplore. Restrictions apply.

You might also like