0% found this document useful (0 votes)
15 views

2018 - Understanding Convolution For Semantic Segmentation

Uploaded by

WYS SNAPE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

2018 - Understanding Convolution For Semantic Segmentation

Uploaded by

WYS SNAPE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Understanding Convolution for Semantic Segmentation

Panqu Wang1 , Pengfei Chen1 , Ye Yuan2 , Ding Liu3 , Zehua Huang1 , Xiaodi Hou1 , Garrison Cottrell4
1
TuSimple, 2 Carnegie Mellon University, 3 University of Illinois Urbana-Champaign, 4 UC San Diego
1
{panqu.wang,pengfei.chen,zehua.huang,xiaodi.hou}@tusimple.ai, 2 [email protected],
3
[email protected], 4 [email protected]
arXiv:1702.08502v3 [cs.CV] 1 Jun 2018

Abstract placing the last few fully connected layers by convolutional


layers to make efficient end-to-end learning and inference
Recent advances in deep learning, especially deep con- that can take arbitrary input size; 2) Conditional Random
volutional neural networks (CNNs), have led to significant Fields (CRFs), to capture both local and long-range depen-
improvement over previous semantic segmentation systems. dencies within an image to refine the prediction map; 3) di-
Here we show how to improve pixel-wise semantic seg- lated convolution (or Atrous convolution), which is used to
mentation by manipulating convolution-related operations increase the resolution of intermediate feature maps in order
that are of both theoretical and practical value. First, we to generate more accurate predictions while maintaining the
design dense upsampling convolution (DUC) to generate same computational cost.
pixel-level prediction, which is able to capture and decode Since the introduction of FCN in [21], improvements on
more detailed information that is generally missing in bi- fully-supervised semantic segmentation systems are gener-
linear upsampling. Second, we propose a hybrid dilated ally focused on two perspectives: First, applying deeper
convolution (HDC) framework in the encoding phase. This FCN models. Significant gains in mean Intersection-over-
framework 1) effectively enlarges the receptive fields (RF) Union (mIoU) scores on PASCAL VOC2012 dataset [8]
of the network to aggregate global information; 2) allevi- were reported when the 16-layer VGG-16 model [26] was
ates what we call the “gridding issue”caused by the stan- replaced by a 101-layer ResNet-101 [13] model [3]; us-
dard dilated convolution operation. We evaluate our ap- ing 152 layer ResNet-152 model yields further improve-
proaches thoroughly on the Cityscapes dataset, and achieve ments [28]. This trend is consistent with the performance of
a state-of-art result of 80.1% mIOU in the test set at the these models on ILSVRC [23] object classification tasks, as
time of submission. We also have achieved state-of-the- deeper networks generally can model more complex repre-
art overall on the KITTI road estimation benchmark and sentations and learn more discriminative features that better
the PASCAL VOC2012 segmentation task. Our source code distinguish among categories. Second, making CRFs more
can be found at https://github.com/TuSimple/ powerful. This includes applying fully connected pairwise
TuSimple-DUC . CRFs [16] as a post-processing step [3], integrating CRFs
into the network by approximating its mean-field inference
steps [31, 20, 18] to enable end-to-end training, and incor-
1. Introduction porating additional information into CRFs such as edges
[15] and object detections [1].
Semantic segmentation aims to assign a categorical label We are pursuing further improvements on semantic seg-
to every pixel in an image, which plays an important role mentation from another perspective: the convolutional op-
in image understanding and self-driving systems. The re- erations for both decoding (from intermediate feature map
cent success of deep convolutional neural network (CNN) to output label map) and encoding (from input image to fea-
models [17, 26, 13] has enabled remarkable progress in ture map) counterparts. In decoding, most state-of-the-art
pixel-wise semantic segmentation tasks due to rich hier- semantic segmentation systems simply use bilinear upsam-
archical features and an end-to-end trainable framework pling (before the CRF stage) to get the output label map
[21, 31, 29, 20, 18, 3]. Most state-of-the-art semantic seg- [18, 20, 3]. Bilinear upsampling is not learnable and may
mentation systems have three key components:1) a fully- lose fine details. Inspired by work in image super-resolution
convolutional network (FCN), first introduced in [21], re- [25], we propose a method called dense upsampling convo-
lution (DUC), which is extremely easy to implement and is commonly used [18, 20, 3], as it is fast and memory-
can achieve pixel-level accuracy: instead of trying to re- efficient. Another popular method is called deconvolu-
cover the full-resolution label map at once, we learn an tion, in which the unpooling operation, using stored pooling
array of upscaling filters to upscale the downsized feature switches from the pooling step, recovers the information
maps into the final dense feature map of the desired size. necessary for feature visualization [30]. In [21], a single
DUC naturally fits the FCN framework by enabling end-to- deconvolutional layer is added in the decoding stage to pro-
end training, and it increases the mIOU of pixel-level se- duce the prediction result using stacked feature maps from
mantic segmentation on the Cityscapes dataset [5] signifi- intermediate layers. In [7], multiple deconvolutional layers
cantly, especially on objects that are relatively small. are applied to generate chairs, tables, or cars from several
For the encoding part, dilated convolution recently be- attributes. Noh et al. [22] employ deconvolutional layers
came popular [3, 29, 28, 32], as it maintains the resolution as mirrored version of convolutional layers by using stored
and receptive field of the network by in inserting “holes”in pooled location in unpooling step. [22] show that coarse-
the convolution kernels, thus eliminating the need for down- to-fine object structures, which are crucial to recover fine-
sampling (by max-pooling or strided convolution). How- detailed information, can be reconstructed along the propa-
ever, an inherent problem exists in the current dilated con- gation of the deconvolutional layers. Fischer at al. [9] use
volution framework, which we identify as “gridding”: as a similar mirrored structure, but combine information from
zeros are padded between two pixels in a convolutional ker- multiple deconvolutional layers and perform upsampling to
nel, the receptive field of this kernel only covers an area make the final prediction.
with checkerboard patterns - only locations with non-zero Dilated Convolution: Dilated Convolution (or Atrous
values are sampled, losing some neighboring information. convolution) was originally developed in algorithme à trous
The problem gets worse when the rate of dilation increases, for wavelet decomposition [14]. The main idea of dilated
generally in higher layers when the receptive field is large: convolution is to insert “holes”(zeros) between pixels in
the convolutional kernel is too sparse to cover any local in- convolutional kernels to increase image resolution, thus en-
formation, since the non-zero values are too far apart. Infor- abling dense feature extraction in deep CNNs. In the se-
mation that contributes to a fixed pixel always comes from mantic segmentation framework, dilated convolution is also
its predefined gridding pattern, thus losing a huge portion used to enlarge the field of convolutional kernels. Yu &
of information. Here we propose a simple hybrid dilation Koltun [29] use serialized layers with increasing rates of
convolution (HDC) framework as a first attempt to address dilation to enable context aggregation, while [3] design an
this problem: instead of using the same rate of dilation for “atrous spatial pyramid pooling (ASPP)”scheme to capture
the same spatial resolution, we use a range of dilation rates multi-scale objects and context information by placing mul-
and concatenate them serially the same way as “blocks”in tiple dilated convolution layers in parallel. More recently,
ResNet-101 [13]. We show that HDC helps the network to dilated convolution has been applied to a broader range of
alleviate the gridding problem. Moreover, choosing proper tasks, such as object detection [6], optical flow [24], and
rates can effectively increases the receptive field size and audio generation [27].
improves the accuracy for objects that are relatively big.
We design DUC and HDC to make convolution opera- 3. Our Approach
tions better serve the need of pixel-level semantic segmen-
tation. The technical details are described in Section 3 be- 3.1. Dense Upsampling Convolution (DUC)
low. Combined with post-processing by Conditional Ran-
Suppose an input image has height H, width W , and
dom Fields (CRFs), we show that this approach achieves
color channels C, and the goal of pixel-level semantic seg-
state-of-the art performance on the Cityscapes pixel-level
mentation is to generate a label map with size H ×W where
semantic labeling task, KITTI road estimation benchmark,
each pixel is labeled with a category label. After feeding
and PASCAL VOC2012 segmentation task.
the image into a deep FCN, a feature map with dimension
h × w × c is obtained at the final layer before making pre-
2. Related Work
dictions, where h = H/d, w = W/d, and d is the down-
Decoding of Feature Representation: In the pixel-wise sampling factor. Instead of performing bilinear upsampling,
semantic segmentation task, the output label map has the which is not learnable, or using deconvolution network (as
same size as the input image. Because of the operation in [22]), in which zeros have to be padded in the unpooling
of max-pooling or strided convolution in CNNs, the size step before the convolution operation, DUC applies convo-
of feature maps of the last few layers of the network are lutional operations directly on the feature maps to get the
inevitably downsampled. Multiple approaches have been dense pixel-wise prediction map. Figure 1 depicts the ar-
proposed to decode accurate information from the down- chitecture of our network with a DUC layer.
sampled feature map to label maps. Bilinear interpolation The DUC operation is all about convolution, which is
Figure 1. Illustration of the architecture of ResNet-101 network with Hybrid Dilated Convolution (HDC) and Dense Upsampling Convo-
lution (DUC) layer. HDC is applied within ResNet blocks, and DUC is applied on top of network and is used for decoding purpose.

performed on the feature map from ResNet of dimension 3.2. Hybrid Dilated Convolution (HDC)
h × w × c to get the output feature map of dimension
In 1-D, dilated convolution is defined as:
h × w × (d2 × L), where L is the total number of classes
in the semantic segmentation task. Thus each layer of the L
X
dense convolution is learning the prediction for each pixel. g[i] = f [i + r · l]h[l], (1)
The output feature map is then reshaped to H × W × L l=1
with a softmax layer, and an elementwise argmax operator
is applied to get the final label map. In practice, the “re- where f [i] is the input signal, g[i] is the output signal , h[l]
shape” operation may not be necessary, as the feature map denotes the filter of length L, and r corresponds to the di-
can be collapsed directly to a vector to be fed into the soft- lation rate we use to sample f [i]. In standard convolution,
max layer. The key idea of DUC is to divide the whole la- r = 1.
bel map into equal d2 subparts which have the same height In a semantic segmentation system, 2-D dilated convo-
and width as the incoming feature map. This is to say, we lution is constructed by inserting “holes”(zeros) between
transform the whole label map into a smaller label map with each pixel in the convolutional kernel. For a convolution
multiple channels. This transformation allows us to apply kernel with size k × k, the size of resulting dilated filter
the convolution operation directly between the input feature is kd × kd , where kd = k + (k − 1) · (r − 1). Dilated
map and the output label maps without the need of insert- convolution is used to maintain high resolution of feature
ing extra values in deconvolutional networks (the “unpool- maps in FCN through replacing the max-pooling operation
ing”operation). or strided convolution layer while maintaining the recep-
tive field (or “field of view”in [3]) of the corresponding
layer. For example, if a convolution layer in ResNet-101
has a stride s = 2, then the stride is reset to 1 to remove
Since DUC is learnable, it is capable of capturing and downsampling, and the dilation rate r is set to 2 for all
recovering fine-detailed information that is generally miss- convolution kernels of subsequent layers. This process is
ing in the bilinear interpolation operation. For example, if applied iteratively through all layers that have a downsam-
a network has a downsample rate of 1/16, and an object pling operation, thus the feature map in the output layer can
has a length or width less than 16 pixels (such as a pole or a maintain the same resolution as the input layer. In practice,
person far away), then it is more than likely that bilinear up- however, dilated convolution is generally applied on feature
sampling will not be able to recover this object. Meanwhile, maps that are already downsampled to achieve a reasonable
the corresponding training labels have to be downsampled efficiency/accuracy trade-off [3].
to correspond with the output dimension, which will already However, one theoretical issue exists in the
cause information loss for fine details. The prediction of above dilated convolution framework, and we call it
DUC, on the other hand, is performed at the original resolu- “gridding”(Figure 2): For a pixel p in a dilated convolu-
tion, thus enabling pixel-level decoding. In addition, the tional layer l, the information that contributes to pixel p
DUC operation can be naturally integrated into the FCN comes from a nearby kd × kd region in layer l − 1 centered
framework, and makes the whole encoding and decoding at p. Since dilated convolution introduces zeros in the
process end-to-end trainable. convolutional kernel, the actual pixels that participate in
use a different dilation rate for each layer. In our network,
the assignment of dilation rate follows a sawtooth wave-like
heuristic: a number of layers are grouped together to form
the “rising edge”of the wave that has an increasing dilation
rate, and the next group repeats the same pattern. For exam-
ple, for all layers that have dilation rate r = 2, we form 3
succeeding layers as a group, and change their dilation rates
to be 1, 2, and 3, respectively. By doing this, the top layer
can access information from a broader range of pixels, in
the same region as the original configuration (Figure 2 (b)).
This process is repeated through all layers, thus making the
receptive field unchanged at the top layer.
Figure 2. Illustration of the gridding problem. Left to right: the
Another benefit of HDC is that it can use arbitrary dila-
pixels (marked in blue) contributes to the calculation of the center tion rates through the process, thus naturally enlarging the
pixel (marked in red) through three convolution layers with kernel receptive fields of the network without adding extra mod-
size 3 × 3. (a) all convolutional layers have a dilation rate r = 2. ules [29], which is important for recognizing objects that
(b) subsequent convolutional layers have dilation rates of r = 1, are relatively big. One important thing to note, however, is
2, 3, respectively. that the dilation rate within a group should not have a com-
the computation from the kd × kd region are just k × k, mon factor relationship (like 2,4,8, etc.), otherwise the grid-
with a gap of r − 1 between them. If k = 3, r = 2, only 9 ding problem will still hold for the top layer. This is a key
out of 25 pixels in the region are used for the computation difference between our HDC approach and the atrous spa-
(Figure 2 (a)). Since all layers have equal dilation rates r, tial pyramid pooling (ASPP) module in [3], or the context
then for pixel p in the top dilated convolution layer ltop , aggregation module in [29], where dilation factors that have
the maximum possible number of locations that contribute common factor relationships are used. In addition, HDC is
to the calculation of the value of p is (w0 × h0 )/r2 where naturally integrated with the original layers of the network,
w0 , h0 are the width and height of the bottom dilated without any need to add extra modules as in [29, 3].
convolution layer, respectively. As a result, pixel p can
only view information in a checkerboard fashion, and lose 4. Experiments and Results
a large portion (at least 75% when r = 2) of information. We report our experiments and results on three challeng-
When r becomes large in higher layers due to additional ing semantic segmentation datasets: Cityscapes [5], KITTI
downsampling operations, the sample from the input can be dataset [10] for road estimation, and PASCAL VOC2012
very sparse, which may not be good for learning because 1) [8]. We use ResNet-101 or ResNet-152 networks that have
local information is completely missing; 2) the information been pretrained on the ImageNet dataset as a starting point
can be irrelevant across large distances. Another outcome for all of our models. The output layer contains the num-
of the gridding effect is that pixels in nearby r × r regions ber of semantic categories to be classified depending on the
at layer l receive information from completely different dataset (including background, if applicable). We use the
set of “grids” which may impair the consistency of local cross-entropy error at each pixel over the categories. This is
information. then summed over all pixel locations of the output map, and
Here we propose a simple solution- hybrid dilated con- we optimize this objective function using standard Stochas-
volution (HDC), to address this theoretical issue. Suppose tic Gradient Descent (SGD). We use MXNet [4] to train and
we have N convolutional layers with kernel size K ×K that evaluate all of our models on NVIDIA TITAN X GPUs.
have dilation rates of [r1 , ..., ri , ..., rn ], the goal of HDC is
to let the final size of the RF of a series of convolutional 4.1. Cityscapes Dataset
operations fully covers a square region without any holes or
The Cityscapes Dataset is a large dataset that focuses on
missing edges. We define the “maximum distance between
semantic understanding of urban street scenes. The dataset
two nonzero values” as
contains 5000 images with fine annotations across 50 cities,
Mi = max[Mi+1 − 2ri , Mi+1 − 2(Mi+1 − ri ), ri ], (2) different seasons, varying scene layout and background.
The dataset is annotated with 30 categories, of which 19
with Mn = rn . The design goal is to let M2 ≤ K. For ex- categories are included for training and evaluation (others
ample, for kernel size K = 3, an r = [1, 2, 5] pattern works are ignored). The training, validation, and test set contains
as M2 = 2; however, an r = [1, 2, 9] pattern does not work 2975, 500, and 1525 fine images, respectively. An addi-
as M2 = 5. Practically, instead of using the same dila- tional 20000 images with coarse (polygonal) annotations
tion rate for all layers after the downsampling occurs, we are also provided, but are only used for training.
4.1.1 Baseline Model controls the resolution of the intermediate feature map; 2)
whether to apply the ASPP module, and the number of par-
We use the DeepLab-V2 [3] ResNet-101 framework to train allel paths in the module; 3) whether to perform 12-fold data
our baseline model. Specifically, the network has a down- augmentation; and 4) cell size, which determines the size of
sampling rate of 8, and dilated convolution with rate of 2 neighborhood region (cell × cell) that one predicted pixel
and 4 are applied to res4b and res5b blocks, respectively. projects to. Pixel-level DUC should use cell = 1; however,
An ASPP module with dilation rate of 6, 12, 18, and 24 is since the ground-truth label generally cannot reach pixel-
added on top of the network to extract multiscale context level precision, we also try cell = 2 in the experiments.
information. The prediction maps and training labels are From Table 1 we can see that making the downsampling
downsampled by a factor of 8 compared to the size of orig- rate smaller decreases the accuracy. Also it significantly
inal images, and bilinear upsampling is used to get the final raises the computational cost due to the increasing resolu-
prediction. Since the image size in the Cityscapes dataset tion of the feature maps. ASPP generally helps to improve
is 1024 × 2048, which is too big to fit in the GPU memory, the performance, and increasing ASPP channels from 4 to
we partition each image into twelve 800 × 800 patches with 6 (dilation rate 6 to 36 with interval 6) yields a 0.2% boost.
partial overlapping, thus augmenting the training set to have Data augmentation helps to achieve another 1.5% improve-
35700 images. This data augmentation strategy is to make ment. Using cell = 2 yields slightly better performance
sure all regions in an image can be visited. This is an im- when compared with cell = 1, and it helps to reduce com-
provement over random cropping, in which nearby regions putational cost by decreasing the channels of the last con-
may be visited repeatedly. volutional layer by a factor of 4.
We train the network using mini-batch SGD with patch
size 544×544 (randomly cropped from the 800×800 patch)
and batch size 12, using multiple GPUs. The initial learning Network DS ASPP Augmentation Cell mIoU
rate is set to 2.5 × 10−4 , and a “poly”learning rate (as in Baseline 8 4 yes n/a 72.3
[3]) with power = 0.9 is applied. Weight decay is set to Baseline 4 4 yes n/a 70.9
5 × 10−4 , and momentum is 0.9. The network is trained for
DU C 8 no no 1 71.9
20 epochs and achieves mIoU of 72.3% on the validation
DU C 8 4 no 1 72.8
set.
DU C 8 4 yes 1 74.3
DU C 4 4 yes 1 73.7
4.1.2 Dense Upsampling Convolution (DUC)
DU C 8 6 yes 1 74.5
We examine the effect of DUC on the baseline network. In DU C 8 6 yes 2 74.7
DUC the only thing we change is the shape of the top convo-
Table 1. Ablation studies for applying ResNet-101 on the
lutional layer. For example, if the dimension of the top con-
Cityscapes dataset. DS: Downsampling rate of the network. Cell:
volutional layer is 68 × 68 × 19 in the baseline model (19 is
neighborhood region that one predicted pixel represents.
the number of classes), then the dimension of the same layer
for a network with DUC will be 68 × 68 × (r2 × 19) where Bigger Patch Size Since setting cell = 2 reduces GPU
r is the total downsampling rate of the network (r = 8 memory cost for network training, we explore the effect
in this case). The prediction map is then reshaped to size of patch size on the performance. Our assumption is that,
544 × 544 × 19. DUC will introduce extra parameters com- since the original images are all 1024 × 2048, the network
pared to the baseline model, but only at the top convolu- should be trained using patches as big as possible in order
tional layer. We train the ResNet-DUC network the same to aggregate both local detail and global context informa-
way as the baseline model for 20 epochs, and achieve a tion that may help learning. As such, we make the patch
mean IOU of 74.3% on the validation set, a 2% increase size to be 880 × 880, and set the batch size to be 1 on
compared to the baseline model. Visualization of the result each of the 4 GPUs used in training. Since the patch size
of ResNet-DUC and comparison with the baseline model is exceeds the maximum dimension (800 × 800) in the previ-
shown in Figure 3 ous 12-fold data augmentation framework, we adopt a new
From Figure 3, we can clearly see that DUC is very help- 7-fold data augmentation strategy: seven center locations
ful for identifying small objects, such as poles, traffic lights, with x = 512, y = {256, 512, ..., 1792} are set in the orig-
and traffic signs. Consistent with our intuition, pixel-level inal image; for each center location, a 880 × 880 patch is
dense upsampling can recover detailed information that is obtained by randomly setting its center within a 160 × 160
generally missed by bilinear interpolation. rectangle area centered at each center. This strategy makes
Ablation Studies We examine the effect of different set- sure that we can sample all areas in the image, including
tings of the network on the performance. Specifically, we edges. Training with a bigger patch size boosts the perfor-
examine: 1) the downsampling rate of the network, which mance to 75.7%, a 1% improvement over the previous best
Figure 3. Effect of Dense Upsampling Convolution (DUC) on the Cityscapes validation set. From left to right: input image, ground truth
(areas with black color are ignored in evaluation), baseline model, and our ResNet-DUC model.
result. 3. Dilation-RF: For the res4b module that contains 23
Compared with Deconvolution We compare our DUC blocks with dilation rate r = 2, we group every 3
model with deconvolution, which also involves learning for blocks together and change their dilation rates to be
upsampling. Particularly, we compare with 1) direct decon- 1, 2, and 3, respectively. For the last two blocks, we
volution from the prediction map (dowsampled by 8) to the keep r = 2. For the res5b module which contains 3
original resolution; 2) deconvolution with an upsampling blocks with dilation rate r = 4, we change them to 3,
factor of 2 first, followed by an upsampling factor of 4. We 4, and 5, respectively.
design the deconv network to have approximately the same
4. Dilation-bigger: For res4b module, we group every 4
number of parameters as DUC. We use the ResNet-DUC
blocks together and change their dilation rates to be 1,
bigger patch model to train the networks. The above two
2, 5, and 9, respectively. The rates for the last 3 blocks
models achieve mIOU of 75.1% and 75.0%, respectively,
are 1, 2, and 5. For res5b module, we set the dilation
lower than the ResNet-DUC model (75.7% mIoU).
rates to be 5, 9, and 17.
Conditional Random Fields (CRFs) Fully-connected
CRFs [16] are widely used for improving semantic segmen- The result is summarized in Table 2. We can see that in-
tation quality as a post-processing step of an FCN [3]. We creasing receptive field size generally yields higher accu-
follow the formation of CRFs as shown in [3]. We perform racy. Figure 5 illustrates the effectiveness of the ResNet-
a grid search on parameters on the validation set, and use DUC-HDC model in eliminating the gridding effect. A vi-
σα = 15, σβ = 3, σγ = 1 , w1 = 3, and w2 = 3 for sualization result is shown in Figure 4. We can see our
all of our models. Applying CRFs to our best ResNet-DUC best ResNet-DUC-HDC model performs particularly well
model yields an mIoU of 76.7%, a 1% improvement over on objects that are relatively big.
the model does not use CRFs.
Network RF increased mIoU (without CRF)

4.1.3 Hybrid Dilated Convolution (HDC) No dilation 54 72.9


Dilation-conv 88 75.0
We use the best 101 layer ResNet-DUC model as a starting Dilation-RF 116 75.4
point of applying HDC. Specifically, we experiment with Dilation-bigger 256 76.4
several variants of the HDC module:
Table 2. Result of different variations of the HDC module. “RF
1. No dilation: For all ResNet blocks containing dilation, increased”is the total size of receptive field increase along a single
dimension compared to the layer before the dilation operation.
we make their dilation rate r = 1 (no dilation).
Deeper Networks We have also tried replacing our
2. Dilation-conv: For all blocks contain dilation, we ResNet-101 based model with the ResNet-152 network,
group every 2 blocks together and make r = 2 for the which is deeper and achieves better performance on the
first block, and r = 1 for the second block. ILSVRC image classification task than ResNet-101 [13].
Figure 4. Effect of Hybrid Dilated Convolution (HDC) on the Cityscapes validation set. From left to right: input image, ground truth, result
of the ResNet-DUC model, result of the ResNet-DUC-HDC model (Dilation-bigger).
4.1.4 Test Set Results
Our results on the Cityscapes test set are summarized in
Table 4. There are separate entries for models trained
using fine-labels only, and using a combination of fine
and coarse labels. Our ResNet-DUC-HDC model achieves
77.6% mIoU using fine data only. Adding coarse data help
us achieve 78.5% mIoU.
In addition, inspired by the design of the VGG network
[26], in that a single 5 × 5 convolutional layer can be de-
composed into two adjacent 3×3 convolutional layers to in-
crease the expressiveness of the network while maintaining
the receptive field size, we replaced the 7 × 7 convolutional
layer in the original ResNet-101 network by three 3 × 3
Figure 5. Effectiveness of HDC in eliminating the gridding ef- convolutional layers. By retraining the updated network,
fect. First row: ground truth patch. Second row: prediction of the we achieve a mIoU of 80.1% on the test set using a single
ResNet-DUC model. A strong gridding effect is observed. Third
model without CRF post-processing. Our result achieves
row: prediction of the ResNet-DUC-HDC (Dilation-RF) model.
the state-of-the-art performance on the Cityscapes dataset
at the time of submission. Compared with the strong base-
Due to the network difference, we first train the ResNet-152
line of Chen et al. [3], we improve the mIoU by a significant
network to learn the parameters in all batch normalization
margin (9.7%), which demonstrates the effectiveness of our
(BN) layers for 10 epochs, and continue fine-tuning the net-
approach.
work by fixing these BN parameters for another 20 epochs.
The results are summarized in Table 3. We can see that
using the deeper ResNet-152 model generally yields better Method mIoU
performance than the ResNet-101 model. fine
FCN 8s [21] 65.3%
Network Method data mIoU Dilation10 [29] 67.1%
ResNet-101 Deconv fine 75.1 DeepLabv2-CRF [3] 70.4%
ResNet-101 DU C+HDC fine 76.4 Adelaide context [18] 71.6%
ResNet-101 DU C+HDC fine+coarse 76.2 ResNet-DUC-HDC (ours) 77.6%

ResNet-152 Deconv fine 76.4 coarse


ResNet-152 DU C+HDC fine 76.7 LRR-4x [11] 71.8%
ResNet-152 DU C+HDC fine+coarse 77.1 ResNet-DUC-HDC-Coarse 78.5%
ResNet-DUC-HDC-Coarse (better network) 80.1%
Table 3. Effect of depth of the network and upsampling method for
Cityscapes validation set (without CRF). Table 4. Performance on Cityscapes test set.
4.2. KITTI Road Segmentation images. The dataset has 20 foreground object categories
and 1 background class with pixel-level annotation.
Dataset The KITTI road segmentation task contains im-
ages of three various categories of road scenes, including Results We first pretrain our 152 layer ResNet-DUC
289 training images and 290 test images. The goal is to model using a combination of augmented VOC2012 train-
decide if each pixel in images is road or not. It is challeng- ing set and MS-COCO dataset [19], and then finetune the
ing to use neural network based methods due to the lim- pretrained network using augmented VOC2012 trainval set.
ited number of training images. In order to avoid overfit- We use patch size 512×512 (zero-padded) throughout train-
ting, we crop patches of 320 × 320 pixels with a stride of ing. All other training strategies are the same as Cityscapes
100 pixels from the training images, and use the ResNet- experiment. We achieve mIOU of 83.1% on the test set
101-DUC model pretrained from ImageNet during training. using a single model without any model ensemble or mul-
Other training settings are the same as Cityscapes experi- tiscale testing, which is the best-performing method at the
ment. We did not apply CRFs for post-processing. time of submission2 . The detailed results are displayed in
Table 6, and the visualizations are shown in Figure 7.

Figure 6. Examples of visualization on Kitti road segmentation test


set. The road is marked in red.
Results We achieve the state-of-the-art results at the time
of submission without using any additional information of
stereo, laser points and GPS. Specifically, our model attains
the highest maximum F1-measure in the sub-categories
of urban unmarked (UU ROAD), urban multiple marked
Figure 7. Examples of visualization on the PASCAL VOC2012
(UMM ROAD) and the overall category URBAN ROAD of
segmentation validation set. Left to right: input image, ground
all sub-categories, the highest average precision across all truth, our result before CRF, and after CRF.
three sub-categories and the overall category by the time of
submission of this paper. Examples of visualization results
are shown in Figure 6. The detailed results are displayed in
Method mIoU
Table 5 1 .
DeepLabv2-CRF[3] 79.7%
MaxF AP CentraleSupelec Deep G-CRF[2] 80.2%
UM ROAD 95.64% 93.50% ResNet-DUC (ours) 83.1%
UMM ROAD 97.62% 95.53% Table 6. Performance on the Pascal VOC2012 test set.
UM ROAD 95.17% 92.73%
URBAN ROAD 96.41% 93.88% 5. Conclusion
Table 5. Performance on different road scenes in KITTI test set.
MaxF: Maximum F1-measure, AP: Average precision. We propose simple yet effective convolutional opera-
tions for improving semantic segmentation systems. We
4.3. PASCAL VOC2012 dataset designed a new dense upsampling convolution (DUC) op-
eration to enable pixel-level prediction on feature maps,
Dataset The PASCAL VOC2012 segmentation bench- and hybrid dilated convolution (HDC) to solve the gridding
mark contains 1464 training images, 1449 validation im- problem, effectively enlarging the receptive fields of the net-
ages, and 1456 test images. Using the extra annotations pro- work. Experimental results demonstrate the effectiveness of
vided by [12], the training set is augmented to have 10582 our framework on various semantic segmentation tasks.
1 For thorough comparison with other methods, please check

http://www.cvlibs.net/datasets/kitti/eval road.php. 2 Result link: http://host.robots.ox.ac.uk:8080/anonymous/LQ2ACW.html


6. Acknowledgments [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
ing for image recognition. arXiv preprint arXiv:1512.03385,
We thank the members of TuSimple and Gary’s Unbe- 2015.
lievable Research Unit (GURU) for comments on this work. [14] M. Holschneider, R. Kronland-Martinet, J. Morlet, and
GWC was supported in part by Guangzhou Science and P. Tchamitchian. A real-time algorithm for signal analysis
Technology Planning Project (201704030051) and NSF co- with the help of the wavelet transform. In Wavelets, pages
operative agreement SMA 1041755 to the Temporal Dy- 286–297. Springer, 1990.
namics of Learning Center, an NSF Science of Learning [15] I. Kokkinos. Pushing the boundaries of boundary detection
Center. using deep learning. arXiv preprint arXiv:1511.07386, 2015.
[16] P. Krähenbühl and V. Koltun. Efficient inference in fully
References connected crfs with gaussian edge potentials. In Advances in
neural information processing systems, 2011.
[1] A. Arnab, S. Jayasumana, S. Zheng, and P. Torr. Higher order
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
potentials in end-to-end trainable conditional random fields.
classification with deep convolutional neural networks. In
arXiv preprint arXiv:1511.08119, 2015.
Advances in neural information processing systems, pages
[2] S. Chandra and I. Kokkinos. Fast, exact and multi-scale in- 1097–1105, 2012.
ference for semantic image segmentation with deep gaussian
[18] G. Lin, C. Shen, I. Reid, et al. Efficient piecewise training
crfs. arXiv preprint arXiv:1603.08358, 2016.
of deep structured models for semantic segmentation. arXiv
[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and
preprint arXiv:1504.01013, 2015.
A. L. Yuille. Deeplab: Semantic image segmentation with
deep convolutional nets, atrous convolution, and fully con- [19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
nected crfs. arXiv preprint arXiv:1606.00915, 2016. manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
mon objects in context. In European Conference on Com-
[4] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,
puter Vision, pages 740–755. Springer, 2014.
B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and effi-
cient machine learning library for heterogeneous distributed [20] Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic im-
systems. arXiv preprint arXiv:1512.01274, 2015. age segmentation via deep parsing network. In Proceedings
of the IEEE International Conference on Computer Vision,
[5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
pages 1377–1385, 2015.
R. Benenson, U. Franke, S. Roth, and B. Schiele. The
cityscapes dataset for semantic urban scene understanding. [21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
In Proc. of the IEEE Conference on Computer Vision and networks for semantic segmentation. In Proceedings of the
Pattern Recognition (CVPR), 2016. IEEE Conference on Computer Vision and Pattern Recogni-
[6] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via tion, pages 3431–3440, 2015.
region-based fully convolutional networks. arXiv preprint [22] H. Noh, S. Hong, and B. Han. Learning deconvolution net-
arXiv:1605.06409, 2016. work for semantic segmentation. In Proceedings of the IEEE
[7] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learn- International Conference on Computer Vision, pages 1520–
ing to generate chairs with convolutional neural networks. 1528, 2015.
In Proceedings of the IEEE Conference on Computer Vision [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
and Pattern Recognition, pages 1538–1546, 2015. S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, et al. Imagenet large scale visual recognition challenge.
and A. Zisserman. The PASCAL Visual Object Classes International Journal of Computer Vision, 115(3):211–252,
Challenge 2012 (VOC2012) Results. http://www.pascal- 2015.
network.org/challenges/VOC/voc2012/workshop/index.html. [24] L. Sevilla-Lara, D. Sun, V. Jampani, and M. J. Black. Optical
[9] P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazırbaş, flow with semantic segmentation and localized layers. arXiv
V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. preprint arXiv:1603.03911, 2016.
Flownet: Learning optical flow with convolutional networks. [25] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken,
arXiv preprint arXiv:1504.06852, 2015. R. Bishop, D. Rueckert, and Z. Wang. Real-time single im-
[10] J. Fritsch, T. Kuehnl, and A. Geiger. A new performance age and video super-resolution using an efficient sub-pixel
measure and evaluation benchmark for road detection algo- convolutional neural network. In Proceedings of the IEEE
rithms. In International Conference on Intelligent Trans- Conference on Computer Vision and Pattern Recognition,
portation Systems (ITSC), 2013. pages 1874–1883, 2016.
[11] G. Ghiasi and C. Fowlkes. Laplacian reconstruction and [26] K. Simonyan and A. Zisserman. Very deep convolutional
refinement for semantic segmentation. arXiv preprint networks for large-scale image recognition. arXiv preprint
arXiv:1605.02264, 2016. arXiv:1409.1556, 2014.
[12] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Ma- [27] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,
lik. Semantic contours from inverse detectors. In 2011 In- O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and
ternational Conference on Computer Vision, pages 991–998. K. Kavukcuoglu. Wavenet: A generative model for raw au-
IEEE, 2011. dio. arXiv preprint arXiv:1609.03499, 2016.
[28] Z. Wu, C. Shen, and A. v. d. Hengel. High-performance
semantic segmentation using very deep fully convolutional
networks. arXiv preprint arXiv:1604.04339, 2016.
[29] F. Yu and V. Koltun. Multi-scale context aggregation by di-
lated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[30] M. D. Zeiler and R. Fergus. Visualizing and understanding
convolutional networks. In European Conference on Com-
puter Vision, pages 818–833. Springer, 2014.
[31] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,
Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random
fields as recurrent neural networks. In Proceedings of the
IEEE International Conference on Computer Vision, pages
1529–1537, 2015.
[32] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Tor-
ralba. Semantic understanding of scenes through the ade20k
dataset. arXiv preprint arXiv:1608.05442, 2016.

You might also like