0% found this document useful (0 votes)
52 views10 pages

Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley

Uploaded by

Unaixa Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views10 pages

Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley

Uploaded by

Unaixa Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Fully Convolutional Networks for Semantic Segmentation

Jonathan Long∗ Evan Shelhamer∗ Trevor Darrell


UC Berkeley
{jonlong,shelhamer,trevor}@cs.berkeley.edu

Abstract  
 


  

  
 



 


  

Convolutional networks are powerful visual models that 
yield hierarchies of features. We show that convolu-
tional networks by themselves, trained end-to-end, pixels-
  
to-pixels, exceed the state-of-the-art in semantic segmen- 




  

tation. Our key insight is to build “fully convolutional” 


networks that take input of arbitrary size and produce
correspondingly-sized output with efficient inference and

learning. We define and detail the space of fully convolu-
tional networks, explain their application to spatially dense Figure 1. Fully convolutional networks can efficiently learn to
make dense predictions for per-pixel tasks like semantic segmen-
prediction tasks, and draw connections to prior models. We
tation.
adapt contemporary classification networks (AlexNet [20],
the VGG net [31], and GoogLeNet [32]) into fully convolu- We show that a fully convolutional network (FCN)
tional networks and transfer their learned representations trained end-to-end, pixels-to-pixels on semantic segmen-
by fine-tuning [3] to the segmentation task. We then define a tation exceeds the state-of-the-art without further machin-
skip architecture that combines semantic information from ery. To our knowledge, this is the first work to train FCNs
a deep, coarse layer with appearance information from a end-to-end (1) for pixelwise prediction and (2) from super-
shallow, fine layer to produce accurate and detailed seg- vised pre-training. Fully convolutional versions of existing
mentations. Our fully convolutional network achieves state- networks predict dense outputs from arbitrary-sized inputs.
of-the-art segmentation of PASCAL VOC (20% relative im- Both learning and inference are performed whole-image-at-
provement to 62.2% mean IU on 2012), NYUDv2, and SIFT a-time by dense feedforward computation and backpropa-
Flow, while inference takes less than one fifth of a second gation. In-network upsampling layers enable pixelwise pre-
for a typical image. diction and learning in nets with subsampled pooling.
This method is efficient, both asymptotically and abso-
lutely, and precludes the need for the complications in other
1. Introduction works. Patchwise training is common [27, 2, 7, 28, 9], but
lacks the efficiency of fully convolutional training. Our ap-
Convolutional networks are driving advances in recog- proach does not make use of pre- and post-processing com-
nition. Convnets are not only improving for whole-image plications, including superpixels [7, 15], proposals [15, 13],
classification [20, 31, 32], but also making progress on lo- or post-hoc refinement by random fields or local classifiers
cal tasks with structured output. These include advances in [7, 15]. Our model transfers recent success in classifica-
bounding box object detection [29, 10, 17], part and key- tion [20, 31, 32] to dense prediction by reinterpreting clas-
point prediction [39, 24], and local correspondence [24, 8]. sification nets as fully convolutional and fine-tuning from
The natural next step in the progression from coarse to their learned representations. In contrast, previous works
fine inference is to make a prediction at every pixel. Prior have applied small convnets without supervised pre-training
approaches have used convnets for semantic segmentation [7, 28, 27].
[27, 2, 7, 28, 15, 13, 9], in which each pixel is labeled with
the class of its enclosing object or region, but with short- Semantic segmentation faces an inherent tension be-
comings that this work addresses. tween semantics and location: global information resolves
what while local information resolves where. Deep feature
∗ Authors contributed equally hierarchies encode location and semantics in a nonlinear

978-1-4673-6964-0/15/$31.00 ©2015 IEEE 3431

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on December 08,2020 at 04:01:47 UTC from IEEE Xplore. Restrictions apply.
local-to-global pyramid. We define a skip architecture to [7], and Pinheiro and Collobert [28]; boundary prediction
take advantage of this feature spectrum that combines deep, for electron microscopy by Ciresan et al. [2] and for natu-
coarse, semantic information and shallow, fine, appearance ral images by a hybrid convnet/nearest neighbor model by
information in Section 4.2 (see Figure 3). Ganin and Lempitsky [9]; and image restoration and depth
In the next section, we review related work on deep clas- estimation by Eigen et al. [4, 5]. Common elements of these
sification nets, FCNs, and recent approaches to semantic approaches include
segmentation using convnets. The following sections ex- • small models restricting capacity and receptive fields;
plain FCN design and dense prediction tradeoffs, introduce • patchwise training [27, 2, 7, 28, 9];
our architecture with in-network upsampling and multi- • post-processing by superpixel projection, random field
layer combinations, and describe our experimental frame- regularization, filtering, or local classification [7, 2, 9];
work. Finally, we demonstrate state-of-the-art results on • input shifting and output interlacing for dense output [29,
PASCAL VOC 2011-2, NYUDv2, and SIFT Flow. 28, 9];
• multi-scale pyramid processing [7, 28, 9];
2. Related work • saturating tanh nonlinearities [7, 4, 28]; and
• ensembles [2, 9],
Our approach draws on recent successes of deep nets whereas our method does without this machinery. However,
for image classification [20, 31, 32] and transfer learning we do study patchwise training 3.4 and “shift-and-stitch”
[3, 38]. Transfer was first demonstrated on various visual dense output 3.2 from the perspective of FCNs. We also
recognition tasks [3, 38], then on detection, and on both discuss in-network upsampling 3.3, of which the fully con-
instance and semantic segmentation in hybrid proposal- nected prediction by Eigen et al. [5] is a special case.
classifier models [10, 15, 13]. We now re-architect and fine- Unlike these existing methods, we adapt and extend deep
tune classification nets to direct, dense prediction of seman- classification architectures, using image classification as su-
tic segmentation. We chart the space of FCNs and situate pervised pre-training, and fine-tune fully convolutionally to
prior models, both historical and recent, in this framework. learn simply and efficiently from whole image inputs and
Fully convolutional networks To our knowledge, the whole image ground thruths.
idea of extending a convnet to arbitrary-sized inputs first Hariharan et al. [15] and Gupta et al. [13] likewise adapt
appeared in Matan et al. [26], which extended the classic deep classification nets to semantic segmentation, but do
LeNet [21] to recognize strings of digits. Because their net so in hybrid proposal-classifier models. These approaches
was limited to one-dimensional input strings, Matan et al. fine-tune an R-CNN system [10] by sampling bounding
used Viterbi decoding to obtain their outputs. Wolf and Platt boxes and/or region proposals for detection, semantic seg-
[37] expand convnet outputs to 2-dimensional maps of de- mentation, and instance segmentation. Neither method is
tection scores for the four corners of postal address blocks. learned end-to-end. They achieve state-of-the-art segmen-
Both of these historical works do inference and learning tation results on PASCAL VOC and NYUDv2 respectively,
fully convolutionally for detection. Ning et al. [27] define so we directly compare our standalone, end-to-end FCN to
a convnet for coarse multiclass segmentation of C. elegans their semantic segmentation results in Section 5.
tissues with fully convolutional inference. We fuse features across layers to define a nonlinear local-
Fully convolutional computation has also been exploited to-global representation that we tune end-to-end. In con-
in the present era of many-layered nets. Sliding window temporary work Hariharan et al. [16] also use multiple lay-
detection by Sermanet et al. [29], semantic segmentation ers in their hybrid model for semantic segmentation.
by Pinheiro and Collobert [28], and image restoration by
Eigen et al. [4] do fully convolutional inference. Fully con-
3. Fully convolutional networks
volutional training is rare, but used effectively by Tompson
et al. [35] to learn an end-to-end part detector and spatial Each layer of data in a convnet is a three-dimensional
model for pose estimation, although they do not exposit on array of size h × w × d, where h and w are spatial dimen-
or analyze this method. sions, and d is the feature or channel dimension. The first
Alternatively, He et al. [17] discard the non- layer is the image, with pixel size h × w, and d color chan-
convolutional portion of classification nets to make a nels. Locations in higher layers correspond to the locations
feature extractor. They combine proposals and spatial in the image they are path-connected to, which are called
pyramid pooling to yield a localized, fixed-length feature their receptive fields.
for classification. While fast and effective, this hybrid Convnets are built on translation invariance. Their ba-
model cannot be learned end-to-end. sic components (convolution, pooling, and activation func-
Dense prediction with convnets Several recent works tions) operate on local input regions, and depend only on
have applied convnets to dense prediction problems, includ- relative spatial coordinates. Writing xij for the data vector
ing semantic segmentation by Ning et al. [27], Farabet et al. at location (i, j) in a particular layer, and yij for the follow-

3432

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on December 08,2020 at 04:01:47 UTC from IEEE Xplore. Restrictions apply.
ing layer, these functions compute outputs yij by 

yij = fks ({xsi+δi,sj+δj }0≤δi,δj≤k )        


      



where k is called the kernel size, s is the stride or subsam-     
pling factor, and fks determines the layer type: a matrix 
 
multiplication for convolution or average pooling, a spatial
max for max pooling, or an elementwise nonlinearity for an
activation function, and so on for other types of layers.     
     
  
This functional form is maintained under composition, 


with kernel size and stride obeying the transformation rule




fks ◦ g = (f ◦ g)k +(k−1)s ,ss . Figure 2. Transforming fully connected layers into convolution
k  s
layers enables a classification net to output a heatmap. Adding
While a general deep net computes a general nonlinear layers and a spatial loss (as in Figure 1) produces an efficient ma-
function, a net with only layers of this form computes a chine for end-to-end dense learning.
nonlinear filter, which we call a deep filter or fully convolu- Furthermore, while the resulting maps are equivalent to
tional network. An FCN naturally operates on an input of the evaluation of the original net on particular input patches,
any size, and produces an output of corresponding (possibly the computation is highly amortized over the overlapping
resampled) spatial dimensions. regions of those patches. For example, while AlexNet takes
A real-valued loss function composed with an FCN de- 1.2 ms (on a typical GPU) to infer the classification scores
fines a task. If the loss function is a sum over the spatial of a 227×227 image, the fully convolutional net takes 22 ms
dimensions of the final layer, (x; θ) = ij  (xij ; θ), its to produce a 10×10 grid of outputs from a 500×500 image,
gradient will be a sum over the gradients of each of its spa- which is more than 5 times faster than the naı̈ve approach1 .
tial components. Thus stochastic gradient descent on  com- The spatial output maps of these convolutionalized mod-
puted on whole images will be the same as stochastic gradi- els make them a natural choice for dense problems like se-
ent descent on  , taking all of the final layer receptive fields mantic segmentation. With ground truth available at ev-
as a minibatch. ery output cell, both the forward and backward passes are
When these receptive fields overlap significantly, both straightforward, and both take advantage of the inherent
feedforward computation and backpropagation are much computational efficiency (and aggressive optimization) of
more efficient when computed layer-by-layer over an entire convolution. The corresponding backward times for the
image instead of independently patch-by-patch. AlexNet example are 2.4 ms for a single image and 37 ms
We next explain how to convert classification nets into for a fully convolutional 10 × 10 output map, resulting in a
fully convolutional nets that produce coarse output maps. speedup similar to that of the forward pass.
For pixelwise prediction, we need to connect these coarse While our reinterpretation of classification nets as fully
outputs back to the pixels. Section 3.2 describes a trick, fast convolutional yields output maps for inputs of any size, the
scanning [11], introduced for this purpose. We gain insight output dimensions are typically reduced by subsampling.
into this trick by reinterpreting it as an equivalent network The classification nets subsample to keep filters small and
modification. As an efficient, effective alternative, we in- computational requirements reasonable. This coarsens the
troduce deconvolution layers for upsampling in Section 3.3. output of a fully convolutional version of these nets, reduc-
In Section 3.4 we consider training by patchwise sampling, ing it from the size of the input by a factor equal to the pixel
and give evidence in Section 4.3 that our whole image train- stride of the receptive fields of the output units.
ing is faster and equally effective.
3.1. Adapting classifiers for dense prediction 3.2. Shift-and-stitch is filter rarefaction

Typical recognition nets, including LeNet [21], AlexNet Dense predictions can be obtained from coarse outputs
[20], and its deeper successors [31, 32], ostensibly take by stitching together output from shifted versions of the in-
fixed-sized inputs and produce non-spatial outputs. The put. If the output is downsampled by a factor of f , shift the
fully connected layers of these nets have fixed dimensions input x pixels to the right and y pixels down, once for every
and throw away spatial coordinates. However, these fully (x, y) s.t. 0 ≤ x, y < f . Process each of these f 2 inputs,
connected layers can also be viewed as convolutions with and interlace the outputs so that the predictions correspond
kernels that cover their entire input regions. Doing so casts to the pixels at the centers of their receptive fields.
them into fully convolutional networks that take input of 1 Assuming efficient batching of single image inputs. The classification
any size and output classification maps. This transforma- scores for a single image by itself take 5.4 ms to produce, which is nearly
tion is illustrated in Figure 2. 25 times slower than the fully convolutional version.

3433

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on December 08,2020 at 04:01:47 UTC from IEEE Xplore. Restrictions apply.
Although performing this transformation naı̈vely in- In our experiments, we find that in-network upsampling
creases the cost by a factor of f 2 , there is a well-known trick is fast and effective for learning dense prediction. Our best
for efficiently producing identical results [11, 29] known segmentation architecture uses these layers to learn to up-
to the wavelet community as the à trous algorithm [25]. sample for refined prediction in Section 4.2.
Consider a layer (convolution or pooling) with input stride
s, and a subsequent convolution layer with filter weights 3.4. Patchwise training is loss sampling
fij (eliding the irrelevant feature dimensions). Setting the In stochastic optimization, gradient computation is
lower layer’s input stride to 1 upsamples its output by a driven by the training distribution. Both patchwise train-
factor of s. However, convolving the original filter with ing and fully convolutional training can be made to pro-
the upsampled output does not produce the same result as duce any distribution, although their relative computational
shift-and-stitch, because the original filter only sees a re- efficiency depends on overlap and minibatch size. Whole
duced portion of its (now upsampled) input. To reproduce image fully convolutional training is identical to patchwise
the trick, rarefy the filter by enlarging it as training where each batch consists of all the receptive fields
 of the units below the loss for an image (or collection of
 fi/s,j/s if s divides both i and j; images). While this is more efficient than uniform sampling
fij =
0 otherwise, of patches, it reduces the number of possible batches. How-
ever, random selection of patches within an image may be
(with i and j zero-based). Reproducing the full net output recovered simply. Restricting the loss to a randomly sam-
of the trick involves repeating this filter enlargement layer- pled subset of its spatial terms (or, equivalently applying a
by-layer until all subsampling is removed. (In practice, this DropConnect mask [36] between the output and the loss)
can be done efficiently by processing subsampled versions excludes patches from the gradient computation.
of the upsampled input.) If the kept patches still have significant overlap, fully
Decreasing subsampling within a net is a tradeoff: the fil- convolutional computation will still speed up training. If
ters see finer information, but have smaller receptive fields gradients are accumulated over multiple backward passes,
and take longer to compute. The shift-and-stitch trick is batches can include patches from several images.2
another kind of tradeoff: the output is denser without de- Sampling in patchwise training can correct class imbal-
creasing the receptive field sizes of the filters, but the filters ance [27, 7, 2] and mitigate the spatial correlation of dense
are prohibited from accessing information at a finer scale patches [28, 15]. In fully convolutional training, class bal-
than their original design. ance can also be achieved by weighting the loss, and loss
Although we have done preliminary experiments with sampling can be used to address spatial correlation.
this trick, we do not use it in our model. We find learn- We explore training with sampling in Section 4.3, and do
ing through upsampling, as described in the next section, to not find that it yields faster or better convergence for dense
be more effective and efficient, especially when combined prediction. Whole image training is effective and efficient.
with the skip layer fusion described later on.
4. Segmentation Architecture
3.3. Upsampling is backwards strided convolution
We cast ILSVRC classifiers into FCNs and augment
Another way to connect coarse outputs to dense pixels them for dense prediction with in-network upsampling and
is interpolation. For instance, simple bilinear interpolation a pixelwise loss. We train for segmentation by fine-tuning.
computes each output yij from the nearest four inputs by a Next, we add skips between layers to fuse coarse, semantic
linear map that depends only on the relative positions of the and local, appearance information. This skip architecture is
input and output cells. learned end-to-end to refine the semantics and spatial preci-
In a sense, upsampling with factor f is convolution with sion of the output.
a fractional input stride of 1/f . So long as f is integral, a For this investigation, we train and validate on the PAS-
natural way to upsample is therefore backwards convolution CAL VOC 2011 segmentation challenge [6]. We train with
(sometimes called deconvolution) with an output stride of a per-pixel multinomial logistic loss and validate with the
f . Such an operation is trivial to implement, since it simply standard metric of mean pixel intersection over union, with
reverses the forward and backward passes of convolution. the mean taken over all classes, including background. The
Thus upsampling is performed in-network for end-to-end training ignores pixels that are masked out (as ambiguous
learning by backpropagation from the pixelwise loss. or difficult) in the ground truth.
Note that the deconvolution filter in such a layer need not 2 Note that not every possible patch is included this way, since the re-
be fixed (e.g., to bilinear upsampling), but can be learned. ceptive fields of the final layer units lie on a fixed, strided grid. However,
A stack of deconvolution layers and activation functions can by shifting the image right and down by a random value up to the stride,
even learn a nonlinear upsampling. random selection from all possible patches may be recovered.

3434

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on December 08,2020 at 04:01:47 UTC from IEEE Xplore. Restrictions apply.
32x upsampled
image conv1 pool1 conv2 pool2 conv3 pool3 conv4 pool4 conv5 pool5 conv6-7 prediction (FCN-32s)

16x upsampled
2x conv7
prediction (FCN-16s)
pool4

8x upsampled
4x conv7 prediction (FCN-8s)
2x pool4
pool3

Figure 3. Our DAG nets learn to combine coarse, high layer information with fine, low layer information. Pooling and prediction layers are
shown as grids that reveal relative spatial coarseness, while intermediate layers are shown as vertical lines. First row (FCN-32s): Our single-
stream net, described in Section 4.1, upsamples stride 32 predictions back to pixels in a single step. Second row (FCN-16s): Combining
predictions from both the final layer and the pool4 layer, at stride 16, lets our net predict finer details, while retaining high-level semantic
information. Third row (FCN-8s): Additional predictions from pool3, at stride 8, provide further precision.

4.1. From classifier to dense FCN Table 1. We adapt and extend three classification convnets. We
compare performance by mean intersection over union on the vali-
We begin by convolutionalizing proven classification ar- dation set of PASCAL VOC 2011 and by inference time (averaged
chitectures as in Section 3. We consider the AlexNet3 ar- over 20 trials for a 500 × 500 input on an NVIDIA Tesla K40c).
chitecture [20] that won ILSVRC12, as well as the VGG We detail the architecture of the adapted nets with regard to dense
nets [31] and the GoogLeNet4 [32] which did exception- prediction: number of parameter layers, receptive field size of out-
put units, and the coarsest stride within the net. (These numbers
ally well in ILSVRC14. We pick the VGG 16-layer net5 ,
give the best performance obtained at a fixed learning rate, not best
which we found to be equivalent to the 19-layer net on this
performance possible.)
task. For GoogLeNet, we use only the final loss layer, and
improve performance by discarding the final average pool- FCN- FCN- FCN-
ing layer. We decapitate each net by discarding the final AlexNet VGG16 GoogLeNet4
classifier layer, and convert all fully connected layers to mean IU 39.8 56.0 42.5
convolutions. We append a 1 × 1 convolution with chan- forward time 50 ms 210 ms 59 ms
nel dimension 21 to predict scores for each of the PAS- conv. layers 8 16 22
CAL classes (including background) at each of the coarse parameters 57M 134M 6M
output locations, followed by a deconvolution layer to bi- rf size 355 404 907
linearly upsample the coarse outputs to pixel-dense outputs max stride 32 32 32
as described in Section 3.3. Table 1 compares the prelim- appears to be state-of-the-art at 56.0 mean IU on val, com-
inary validation results along with the basic characteristics pared to 52.6 on test [15]. Training on extra data raises
of each net. We report the best results achieved after con- FCN-VGG16 to 59.4 mean IU and FCN-AlexNet to 48.0
vergence at a fixed learning rate (at least 175 epochs). mean IU on a subset of val7 . Despite similar classification
Fine-tuning from classification to segmentation gave rea- accuracy, our implementation of GoogLeNet did not match
sonable predictions for each net. Even the worst model the VGG16 segmentation result.
achieved ∼ 75% of state-of-the-art performance. The
segmentation-equipped VGG net (FCN-VGG16) already 4.2. Combining what and where
3 Using the publicly available CaffeNet reference model. We define a new fully convolutional net (FCN) for seg-
4 Since there is no publicly available version of GoogLeNet, we use mentation that combines layers of the feature hierarchy and
our own reimplementation. Our version is trained with less extensive data
augmentation, and gets 68.5% top-1 and 88.4% top-5 ILSVRC accuracy. refines the spatial precision of the output. See Figure 3.
5 Using the publicly available version from the Caffe model zoo. While fully convolutionalized classifiers can be fine-

3435

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on December 08,2020 at 04:01:47 UTC from IEEE Xplore. Restrictions apply.
tuned to segmentation as shown in 4.1, and even score FCN-32s FCN-16s FCN-8s Ground truth
highly on the standard metric, their output is dissatisfyingly
coarse (see Figure 4). The 32 pixel stride at the final predic-
tion layer limits the scale of detail in the upsampled output.
We address this by adding skips [1] that combine the
final prediction layer with lower layers with finer strides.
This turns a line topology into a DAG, with edges that skip
ahead from lower layers to higher ones (Figure 3). As they
see fewer pixels, the finer scale predictions should need
Figure 4. Refining fully convolutional nets by fusing information
fewer layers, so it makes sense to make them from shallower from layers with different strides improves segmentation detail.
net outputs. Combining fine layers and coarse layers lets the The first three images show the output from our 32, 16, and 8
model make local predictions that respect global structure. pixel stride nets (see Figure 3).
By analogy to the jet of Koenderick and van Doorn [19], we
call our nonlinear feature hierarchy the deep jet. Table 2. Comparison of skip FCNs on a subset7 of PASCAL VOC
We first divide the output stride in half by predicting 2011 segval. Learning is end-to-end, except for FCN-32s-fixed,
from a 16 pixel stride layer. We add a 1 × 1 convolution where only the last layer is fine-tuned. Note that FCN-32s is FCN-
layer on top of pool4 to produce additional class predic- VGG16, renamed to highlight stride.
tions. We fuse this output with the predictions computed pixel mean mean f.w.
on top of conv7 (convolutionalized fc7) at stride 32 by acc. acc. IU IU
adding a 2× upsampling layer and summing6 both predic- FCN-32s-fixed 83.0 59.7 45.4 72.0
tions (see Figure 3). We initialize the 2× upsampling to bi- FCN-32s 89.1 73.3 59.4 81.4
linear interpolation, but allow the parameters to be learned FCN-16s 90.0 75.7 62.4 83.0
as described in Section 3.3. Finally, the stride 16 predic- FCN-8s 90.3 75.9 62.7 83.2
tions are upsampled back to the image. We call this net
FCN-16s. FCN-16s is learned end-to-end, initialized with maintain its receptive field size. In addition to their com-
the parameters of the last, coarser net, which we now call putational cost, we had difficulty learning such large filters.
FCN-32s. The new parameters acting on pool4 are zero- We attempted to re-architect the layers above pool5 with
initialized so that the net starts with unmodified predictions. smaller filters, but did not achieve comparable performance;
The learning rate is decreased by a factor of 100. one possible explanation is that the ILSVRC initialization
Learning this skip net improves performance on the val- of the upper layers is important.
idation set by 3.0 mean IU to 62.4. Figure 4 shows im- Another way to obtain finer predictions is to use the shift-
provement in the fine structure of the output. We compared and-stitch trick described in Section 3.2. In limited exper-
this fusion with learning only from the pool4 layer, which iments, we found the cost to improvement ratio from this
resulted in poor performance, and simply decreasing the method to be worse than layer fusion.
learning rate without adding the skip, which resulted in an
insignificant performance improvement without improving 4.3. Experimental framework
the quality of the output.
Optimization We train by SGD with momentum. We
We continue in this fashion by fusing predictions from use a minibatch size of 20 images and fixed learning rates of
pool3 with a 2× upsampling of predictions fused from 10−3 , 10−4 , and 5−5 for FCN-AlexNet, FCN-VGG16, and
pool4 and conv7, building the net FCN-8s. We obtain FCN-GoogLeNet, respectively, chosen by line search. We
a minor additional improvement to 62.7 mean IU, and find use momentum 0.9, weight decay of 5−4 or 2−4 , and dou-
a slight improvement in the smoothness and detail of our bled learning rate for biases, although we found training to
output. At this point our fusion improvements have met di- be sensitive to the learning rate alone. We zero-initialize the
minishing returns, both with respect to the IU metric which class scoring layer, as random initialization yielded neither
emphasizes large-scale correctness, and also in terms of the better performance nor faster convergence. Dropout was in-
improvement visible e.g. in Figure 4, so we do not continue cluded where used in the original classifier nets.
fusing even lower layers.
Fine-tuning We fine-tune all layers by back-
Refinement by other means Decreasing the stride of
propagation through the whole net. Fine-tuning the
pooling layers is the most straightforward way to obtain
output classifier alone yields only 70% of the full fine-
finer predictions. However, doing so is problematic for our
tuning performance as compared in Table 2. Training from
VGG16-based net. Setting the pool5 stride to 1 requires
scratch is not feasible considering the time required to
our convolutionalized fc6 to have kernel size 14 × 14 to
learn the base classification nets. (Note that the VGG net is
6 Max fusion made learning difficult due to gradient switching. trained in stages, while we initialize from the full 16-layer

3436

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on December 08,2020 at 04:01:47 UTC from IEEE Xplore. Restrictions apply.
layer deconvolutional filters are fixed to bilinear interpola-
tion, while intermediate upsampling layers are initialized to
bilinear upsampling, and then learned.
Augmentation We tried augmenting the training data
by randomly mirroring and “jittering” the images by trans-
lating them up to 32 pixels (the coarsest scale of prediction)
in each direction. This yielded no noticeable improvement.
Implementation All models are trained and tested with
Caffe [18] on a single NVIDIA Tesla K40c. Our models
and code are publicly available at
Figure 5. Training on whole images is just as effective as sampling
http://fcn.berkeleyvision.org.
patches, but results in faster (wall time) convergence by making
more efficient use of data. Left shows the effect of sampling on
convergence rate for a fixed expected batch size, while right plots 5. Results
the same by relative wall time. We test our FCN on semantic segmentation and scene
version.) Fine-tuning takes three days on a single GPU for parsing, exploring PASCAL VOC, NYUDv2, and SIFT
the coarse FCN-32s version, and about one day each to Flow. Although these tasks have historically distinguished
upgrade to the FCN-16s and FCN-8s versions. between objects and regions, we treat both uniformly as
More Training Data The PASCAL VOC 2011 segmen- pixel prediction. We evaluate our FCN skip architecture on
tation training set labels 1112 images. Hariharan et al. [14] each of these datasets, and then extend it to multi-modal in-
collected labels for a larger set of 8498 PASCAL training put for NYUDv2 and multi-task prediction for the semantic
images, which was used to train the previous state-of-the- and geometric labels of SIFT Flow.
art system, SDS [15]. This training data improves the FCN- Metrics We report four metrics from common semantic
VGG16 validation score7 by 3.4 points to 59.4 mean IU. segmentation and scene parsing evaluations that are varia-
Patch Sampling As explained in Section 3.4, our full tions on pixel accuracy and region intersection over union
image training effectively batches each image into a regu- (IU). Let nij be the number of pixels of class i predicted to
lar grid of large, overlapping patches. By contrast, prior belong toclass j, where there are ncl different classes, and
work randomly samples patches over a full dataset [27, 2, 7, let ti = j nij be the total number of pixels of class i. We
28, 9], potentially resulting in higher variance batches that compute:
 
may accelerate convergence [22]. We study this tradeoff by • pixel accuracy: i nii / i ti

spatially sampling the loss in the manner described earlier, • mean accuraccy: (1/ncl ) i nii /ti
   
making an independent choice to ignore each final layer cell • mean IU: (1/ncl ) i nii / ti + j nji − nii
with some probability 1 − p. To avoid changing the effec- • frequency weighted
tive batch size, we simultaneously increase the number of   IU:   
−1
( k tk ) i ti nii / ti + j nji − nii
images per batch by a factor 1/p. Note that due to the ef-
PASCAL VOC Table 3 gives the performance of our
ficiency of convolution, this form of rejection sampling is
FCN-8s on the test sets of PASCAL VOC 2011 and 2012,
still faster than patchwise training for large enough values
and compares it to the previous state-of-the-art, SDS [15],
of p (e.g., at least for p > 0.2 according to the numbers
and the well-known R-CNN [10]. We achieve the best re-
in Section 3.1). Figure 5 shows the effect of this form of
sults on mean IU8 by a relative margin of 20%. Inference
sampling on convergence. We find that sampling does not
time is reduced 114× (convnet only, ignoring proposals and
have a significant effect on convergence rate compared to
refinement) or 286× (overall).
whole image training, but takes significantly more time due
to the larger number of images that need to be considered Table 3. Our fully convolutional net gives a 20% relative improve-
per batch. We therefore choose unsampled, whole image ment over the state-of-the-art on the PASCAL VOC 2011 and 2012
training in our other experiments. test sets and reduces inference time.
Class Balancing Fully convolutional training can bal- mean IU mean IU inference
ance classes by weighting or sampling the loss. Although VOC2011 test VOC2012 test time
our labels are mildly unbalanced (about 3/4 are back- R-CNN [10] 47.9 - -
ground), we find class balancing unnecessary. SDS [15] 52.6 51.6 ∼ 50 s
Dense Prediction The scores are upsampled to the input FCN-8s 62.7 62.2 ∼ 175 ms
dimensions by deconvolution layers within the net. Final
NYUDv2 [30] is an RGB-D dataset collected using the
7 There are training images from [14] included in the PASCAL VOC
2011 val set, so we validate on the non-intersecting set of 736 images. 8 This is the only metric provided by the test server.

3437

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on December 08,2020 at 04:01:47 UTC from IEEE Xplore. Restrictions apply.
Table 4. Results on NYUDv2. RGBD is early-fusion of the Table 5. Results on SIFT Flow9 with class segmentation
RGB and depth channels at the input. HHA is the depth embed- (center) and geometric segmentation (right). Tighe [33] is
ding of [13] as horizontal disparity, height above ground, and a non-parametric transfer method. Tighe 1 is an exemplar
the angle of the local surface normal with the inferred gravity SVM while 2 is SVM + MRF. Farabet is a multi-scale con-
direction. RGB-HHA is the jointly trained late fusion model vnet trained on class-balanced samples (1) or natural frequency
that sums RGB and HHA predictions. samples (2). Pinheiro is a multi-scale, recurrent convnet, de-
pixel mean mean f.w. noted RCNN3 (◦3 ). The metric for geometry is pixel accuracy.
acc. acc. IU IU
Gupta et al. [13] 60.3 - 28.6 47.0 pixel mean mean f.w. geom.
FCN-32s RGB 60.0 42.2 29.2 43.9 acc. acc. IU IU acc.
FCN-32s RGBD 61.5 42.4 30.5 45.5 Liu et al. [23] 76.7 - - - -
FCN-32s HHA 57.1 35.2 24.2 40.4 Tighe et al. [33] - - - - 90.8
Tighe et al. [34] 1 75.6 41.1 - - -
FCN-32s RGB-HHA 64.3 44.9 32.8 48.0
Tighe et al. [34] 2 78.6 39.2 - - -
FCN-16s RGB-HHA 65.4 46.1 34.0 49.5
Farabet et al. [7] 1 72.3 50.8 - - -
Microsoft Kinect. It has 1449 RGB-D images, with pixel- Farabet et al. [7] 2 78.5 29.6 - - -
wise labels that have been coalesced into a 40 class seman- Pinheiro et al. [28] 77.7 29.8 - - -
tic segmentation task by Gupta et al. [12]. We report results FCN-16s 85.2 51.7 39.5 76.1 94.3
on the standard split of 795 training images and 654 testing
images. (Note: all model selection is performed on PAS-
FCN-8s SDS [15] Ground Truth Image
CAL 2011 val.) Table 4 gives the performance of our model
in several variations. First we train our unmodified coarse
model (FCN-32s) on RGB images. To add depth informa-
tion, we train on a model upgraded to take four-channel
RGB-D input (early fusion). This provides little benefit,
perhaps due to the difficultly of propagating meaningful
gradients all the way through the model. Following the suc-
cess of Gupta et al. [13], we try the three-dimensional HHA
encoding of depth, training nets on just this information, as
well as a “late fusion” of RGB and HHA where the predic-
tions from both nets are summed at the final layer, and the
resulting two-stream net is learned end-to-end. Finally we
upgrade this late fusion net to a 16-stride version.
SIFT Flow is a dataset of 2,688 images with pixel labels
for 33 semantic categories (“bridge”, “mountain”, “sun”),
as well as three geometric categories (“horizontal”, “verti-
cal”, and “sky”). An FCN can naturally learn a joint repre-
sentation that simultaneously predicts both types of labels. Figure 6. Fully convolutional segmentation nets produce state-
We learn a two-headed version of FCN-16s with seman- of-the-art performance on PASCAL. The left column shows the
tic and geometric prediction layers and losses. The learned output of our highest performing net, FCN-8s. The second shows
model performs as well on both tasks as two independently the segmentations produced by the previous state-of-the-art system
trained models, while learning and inference are essentially by Hariharan et al. [15]. Notice the fine structures recovered (first
as fast as each independent model by itself. The results in row), ability to separate closely interacting objects (second row),
Table 5, computed on the standard split into 2,488 training and robustness to occluders (third row). The fourth row shows a
and 200 test images,9 show state-of-the-art performance on failure case: the net sees lifejackets in a boat as people.
both tasks. nets to segmentation, and improving the architecture with
multi-resolution layer combinations dramatically improves
6. Conclusion the state-of-the-art, while simultaneously simplifying and
Fully convolutional networks are a rich class of mod- speeding up learning and inference.
els, of which modern classification convnets are a spe- Acknowledgements This work was supported in part
cial case. Recognizing this, extending these classification by DARPA’s MSEE and SMISC programs, NSF awards IIS-
9 Three of the SIFT Flow categories are not present in the test set. We 1427425, IIS-1212798, IIS-1116411, and the NSF GRFP,
made predictions across all 33 categories, but only included categories ac- Toyota, and the Berkeley Vision and Learning Center. We
tually present in the test set in our evaluation. gratefully acknowledge NVIDIA for GPU donation. We

3438

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on December 08,2020 at 04:01:47 UTC from IEEE Xplore. Restrictions apply.
thank Bharath Hariharan and Saurabh Gupta for their ad- [15] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Simul-
vice and dataset tools. We thank Sergio Guadarrama for taneous detection and segmentation. In European Confer-
reproducing GoogLeNet in Caffe. We thank Jitendra Malik ence on Computer Vision (ECCV), 2014. 1, 2, 4, 5, 7, 8
for his helpful comments. Thanks to Wei Liu for pointing [16] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hy-
out an issue wth our SIFT Flow mean IU computation and percolumns for object segmentation and fine-grained local-
an error in our frequency weighted mean IU formula. ization. In Computer Vision and Pattern Recognition, 2015.
2
[17] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
References in deep convolutional networks for visual recognition. In
[1] C. M. Bishop. Pattern recognition and machine learning, ECCV, 2014. 1, 2
page 229. Springer-Verlag New York, 2006. 6 [18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
[2] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmid- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-
huber. Deep neural networks segment neuronal membranes tional architecture for fast feature embedding. arXiv preprint
in electron microscopy images. In NIPS, pages 2852–2860, arXiv:1408.5093, 2014. 7
2012. 1, 2, 4, 7 [19] J. J. Koenderink and A. J. van Doorn. Representation of lo-
[3] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, cal geometry in the visual system. Biological cybernetics,
E. Tzeng, and T. Darrell. DeCAF: A deep convolutional acti- 55(6):367–375, 1987. 6
vation feature for generic visual recognition. In ICML, 2014. [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
1, 2 classification with deep convolutional neural networks. In
[4] D. Eigen, D. Krishnan, and R. Fergus. Restoring an image NIPS, 2012. 1, 2, 3, 5
taken through a window covered with dirt or rain. In Com- [21] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard,
puter Vision (ICCV), 2013 IEEE International Conference W. Hubbard, and L. D. Jackel. Backpropagation applied to
on, pages 633–640. IEEE, 2013. 2 hand-written zip code recognition. In Neural Computation,
[5] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction 1989. 2, 3
from a single image using a multi-scale deep network. arXiv [22] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Ef-
preprint arXiv:1406.2283, 2014. 2 ficient backprop. In Neural networks: Tricks of the trade,
[6] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, pages 9–48. Springer, 1998. 7
and A. Zisserman. The PASCAL Visual Object Classes [23] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspon-
Challenge 2011 (VOC2011) Results. http://www.pascal- dence across scenes and its applications. Pattern Analysis
network.org/challenges/VOC/voc2011/workshop/index.html. and Machine Intelligence, IEEE Transactions on, 33(5):978–
4 994, 2011. 8
[7] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning [24] J. Long, N. Zhang, and T. Darrell. Do convnets learn corre-
hierarchical features for scene labeling. Pattern Analysis and spondence? In NIPS, 2014. 1
Machine Intelligence, IEEE Transactions on, 2013. 1, 2, 4, [25] S. Mallat. A wavelet tour of signal processing. Academic
7, 8 press, 2nd edition, 1999. 4
[8] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor matching [26] O. Matan, C. J. Burges, Y. LeCun, and J. S. Denker. Multi-
with convolutional neural networks: a comparison to SIFT. digit recognition using a space displacement neural network.
CoRR, abs/1405.5769, 2014. 1 In NIPS, pages 488–495. Citeseer, 1991. 2
[9] Y. Ganin and V. Lempitsky. N4 -fields: Neural network near- [27] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and
est neighbor fields for image transforms. In ACCV, 2014. 1, P. E. Barbano. Toward automatic phenotyping of developing
2, 7 embryos from videos. Image Processing, IEEE Transactions
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- on, 14(9):1360–1371, 2005. 1, 2, 4, 7
ture hierarchies for accurate object detection and semantic [28] P. H. Pinheiro and R. Collobert. Recurrent convolutional
segmentation. In Computer Vision and Pattern Recognition, neural networks for scene labeling. In ICML, 2014. 1, 2,
2014. 1, 2, 7 4, 7, 8
[11] A. Giusti, D. C. Cireşan, J. Masci, L. M. Gambardella, and [29] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
J. Schmidhuber. Fast image scanning with deep max-pooling and Y. LeCun. Overfeat: Integrated recognition, localization
convolutional neural networks. In ICIP, 2013. 3, 4 and detection using convolutional networks. In ICLR, 2014.
[12] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization 1, 2, 4
and recognition of indoor scenes from RGB-D images. In [30] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor
CVPR, 2013. 8 segmentation and support inference from rgbd images. In
[13] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning ECCV, 2012. 7
rich features from RGB-D images for object detection and [31] K. Simonyan and A. Zisserman. Very deep convolu-
segmentation. In ECCV. Springer, 2014. 1, 2, 8 tional networks for large-scale image recognition. CoRR,
[14] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. abs/1409.1556, 2014. 1, 2, 3, 5
Semantic contours from inverse detectors. In International [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
Conference on Computer Vision (ICCV), 2011. 7 D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.

3439

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on December 08,2020 at 04:01:47 UTC from IEEE Xplore. Restrictions apply.
Going deeper with convolutions. CoRR, abs/1409.4842,
2014. 1, 2, 3, 5
[33] J. Tighe and S. Lazebnik. Superparsing: scalable nonpara-
metric image parsing with superpixels. In ECCV, pages 352–
365. Springer, 2010. 8
[34] J. Tighe and S. Lazebnik. Finding things: Image parsing with
regions and per-exemplar detectors. In CVPR, 2013. 8
[35] J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training
of a convolutional network and a graphical model for human
pose estimation. CoRR, abs/1406.2984, 2014. 2
[36] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Reg-
ularization of neural networks using dropconnect. In Pro-
ceedings of the 30th International Conference on Machine
Learning (ICML-13), pages 1058–1066, 2013. 4
[37] R. Wolf and J. C. Platt. Postal address block location using
a convolutional locator network. Advances in Neural Infor-
mation Processing Systems, pages 745–745, 1994. 2
[38] M. D. Zeiler and R. Fergus. Visualizing and understanding
convolutional networks. In Computer Vision–ECCV 2014,
pages 818–833. Springer, 2014. 2
[39] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-
based r-cnns for fine-grained category detection. In Com-
puter Vision–ECCV 2014, pages 834–849. Springer, 2014.
1

3440

Authorized licensed use limited to: NUST School of Electrical Engineering and Computer Science (SEECS). Downloaded on December 08,2020 at 04:01:47 UTC from IEEE Xplore. Restrictions apply.

You might also like