2018 - Understanding Convolution For Semantic Segmentation
2018 - Understanding Convolution For Semantic Segmentation
Panqu Wang1 , Pengfei Chen1 , Ye Yuan2 , Ding Liu3 , Zehua Huang1 , Xiaodi Hou1 , Garrison Cottrell4
1
TuSimple, 2 Carnegie Mellon University, 3 University of Illinois Urbana-Champaign, 4 UC San Diego
1
{panqu.wang,pengfei.chen,zehua.huang,xiaodi.hou}@tusimple.ai, 2 [email protected],
3
[email protected], 4 [email protected]
arXiv:1702.08502v3 [cs.CV] 1 Jun 2018
performed on the feature map from ResNet of dimension 3.2. Hybrid Dilated Convolution (HDC)
h × w × c to get the output feature map of dimension
In 1-D, dilated convolution is defined as:
h × w × (d2 × L), where L is the total number of classes
in the semantic segmentation task. Thus each layer of the L
X
dense convolution is learning the prediction for each pixel. g[i] = f [i + r · l]h[l], (1)
The output feature map is then reshaped to H × W × L l=1
with a softmax layer, and an elementwise argmax operator
is applied to get the final label map. In practice, the “re- where f [i] is the input signal, g[i] is the output signal , h[l]
shape” operation may not be necessary, as the feature map denotes the filter of length L, and r corresponds to the di-
can be collapsed directly to a vector to be fed into the soft- lation rate we use to sample f [i]. In standard convolution,
max layer. The key idea of DUC is to divide the whole la- r = 1.
bel map into equal d2 subparts which have the same height In a semantic segmentation system, 2-D dilated convo-
and width as the incoming feature map. This is to say, we lution is constructed by inserting “holes”(zeros) between
transform the whole label map into a smaller label map with each pixel in the convolutional kernel. For a convolution
multiple channels. This transformation allows us to apply kernel with size k × k, the size of resulting dilated filter
the convolution operation directly between the input feature is kd × kd , where kd = k + (k − 1) · (r − 1). Dilated
map and the output label maps without the need of insert- convolution is used to maintain high resolution of feature
ing extra values in deconvolutional networks (the “unpool- maps in FCN through replacing the max-pooling operation
ing”operation). or strided convolution layer while maintaining the recep-
tive field (or “field of view”in [3]) of the corresponding
layer. For example, if a convolution layer in ResNet-101
has a stride s = 2, then the stride is reset to 1 to remove
Since DUC is learnable, it is capable of capturing and downsampling, and the dilation rate r is set to 2 for all
recovering fine-detailed information that is generally miss- convolution kernels of subsequent layers. This process is
ing in the bilinear interpolation operation. For example, if applied iteratively through all layers that have a downsam-
a network has a downsample rate of 1/16, and an object pling operation, thus the feature map in the output layer can
has a length or width less than 16 pixels (such as a pole or a maintain the same resolution as the input layer. In practice,
person far away), then it is more than likely that bilinear up- however, dilated convolution is generally applied on feature
sampling will not be able to recover this object. Meanwhile, maps that are already downsampled to achieve a reasonable
the corresponding training labels have to be downsampled efficiency/accuracy trade-off [3].
to correspond with the output dimension, which will already However, one theoretical issue exists in the
cause information loss for fine details. The prediction of above dilated convolution framework, and we call it
DUC, on the other hand, is performed at the original resolu- “gridding”(Figure 2): For a pixel p in a dilated convolu-
tion, thus enabling pixel-level decoding. In addition, the tional layer l, the information that contributes to pixel p
DUC operation can be naturally integrated into the FCN comes from a nearby kd × kd region in layer l − 1 centered
framework, and makes the whole encoding and decoding at p. Since dilated convolution introduces zeros in the
process end-to-end trainable. convolutional kernel, the actual pixels that participate in
use a different dilation rate for each layer. In our network,
the assignment of dilation rate follows a sawtooth wave-like
heuristic: a number of layers are grouped together to form
the “rising edge”of the wave that has an increasing dilation
rate, and the next group repeats the same pattern. For exam-
ple, for all layers that have dilation rate r = 2, we form 3
succeeding layers as a group, and change their dilation rates
to be 1, 2, and 3, respectively. By doing this, the top layer
can access information from a broader range of pixels, in
the same region as the original configuration (Figure 2 (b)).
This process is repeated through all layers, thus making the
receptive field unchanged at the top layer.
Figure 2. Illustration of the gridding problem. Left to right: the
Another benefit of HDC is that it can use arbitrary dila-
pixels (marked in blue) contributes to the calculation of the center tion rates through the process, thus naturally enlarging the
pixel (marked in red) through three convolution layers with kernel receptive fields of the network without adding extra mod-
size 3 × 3. (a) all convolutional layers have a dilation rate r = 2. ules [29], which is important for recognizing objects that
(b) subsequent convolutional layers have dilation rates of r = 1, are relatively big. One important thing to note, however, is
2, 3, respectively. that the dilation rate within a group should not have a com-
the computation from the kd × kd region are just k × k, mon factor relationship (like 2,4,8, etc.), otherwise the grid-
with a gap of r − 1 between them. If k = 3, r = 2, only 9 ding problem will still hold for the top layer. This is a key
out of 25 pixels in the region are used for the computation difference between our HDC approach and the atrous spa-
(Figure 2 (a)). Since all layers have equal dilation rates r, tial pyramid pooling (ASPP) module in [3], or the context
then for pixel p in the top dilated convolution layer ltop , aggregation module in [29], where dilation factors that have
the maximum possible number of locations that contribute common factor relationships are used. In addition, HDC is
to the calculation of the value of p is (w0 × h0 )/r2 where naturally integrated with the original layers of the network,
w0 , h0 are the width and height of the bottom dilated without any need to add extra modules as in [29, 3].
convolution layer, respectively. As a result, pixel p can
only view information in a checkerboard fashion, and lose 4. Experiments and Results
a large portion (at least 75% when r = 2) of information. We report our experiments and results on three challeng-
When r becomes large in higher layers due to additional ing semantic segmentation datasets: Cityscapes [5], KITTI
downsampling operations, the sample from the input can be dataset [10] for road estimation, and PASCAL VOC2012
very sparse, which may not be good for learning because 1) [8]. We use ResNet-101 or ResNet-152 networks that have
local information is completely missing; 2) the information been pretrained on the ImageNet dataset as a starting point
can be irrelevant across large distances. Another outcome for all of our models. The output layer contains the num-
of the gridding effect is that pixels in nearby r × r regions ber of semantic categories to be classified depending on the
at layer l receive information from completely different dataset (including background, if applicable). We use the
set of “grids” which may impair the consistency of local cross-entropy error at each pixel over the categories. This is
information. then summed over all pixel locations of the output map, and
Here we propose a simple solution- hybrid dilated con- we optimize this objective function using standard Stochas-
volution (HDC), to address this theoretical issue. Suppose tic Gradient Descent (SGD). We use MXNet [4] to train and
we have N convolutional layers with kernel size K ×K that evaluate all of our models on NVIDIA TITAN X GPUs.
have dilation rates of [r1 , ..., ri , ..., rn ], the goal of HDC is
to let the final size of the RF of a series of convolutional 4.1. Cityscapes Dataset
operations fully covers a square region without any holes or
The Cityscapes Dataset is a large dataset that focuses on
missing edges. We define the “maximum distance between
semantic understanding of urban street scenes. The dataset
two nonzero values” as
contains 5000 images with fine annotations across 50 cities,
Mi = max[Mi+1 − 2ri , Mi+1 − 2(Mi+1 − ri ), ri ], (2) different seasons, varying scene layout and background.
The dataset is annotated with 30 categories, of which 19
with Mn = rn . The design goal is to let M2 ≤ K. For ex- categories are included for training and evaluation (others
ample, for kernel size K = 3, an r = [1, 2, 5] pattern works are ignored). The training, validation, and test set contains
as M2 = 2; however, an r = [1, 2, 9] pattern does not work 2975, 500, and 1525 fine images, respectively. An addi-
as M2 = 5. Practically, instead of using the same dila- tional 20000 images with coarse (polygonal) annotations
tion rate for all layers after the downsampling occurs, we are also provided, but are only used for training.
4.1.1 Baseline Model controls the resolution of the intermediate feature map; 2)
whether to apply the ASPP module, and the number of par-
We use the DeepLab-V2 [3] ResNet-101 framework to train allel paths in the module; 3) whether to perform 12-fold data
our baseline model. Specifically, the network has a down- augmentation; and 4) cell size, which determines the size of
sampling rate of 8, and dilated convolution with rate of 2 neighborhood region (cell × cell) that one predicted pixel
and 4 are applied to res4b and res5b blocks, respectively. projects to. Pixel-level DUC should use cell = 1; however,
An ASPP module with dilation rate of 6, 12, 18, and 24 is since the ground-truth label generally cannot reach pixel-
added on top of the network to extract multiscale context level precision, we also try cell = 2 in the experiments.
information. The prediction maps and training labels are From Table 1 we can see that making the downsampling
downsampled by a factor of 8 compared to the size of orig- rate smaller decreases the accuracy. Also it significantly
inal images, and bilinear upsampling is used to get the final raises the computational cost due to the increasing resolu-
prediction. Since the image size in the Cityscapes dataset tion of the feature maps. ASPP generally helps to improve
is 1024 × 2048, which is too big to fit in the GPU memory, the performance, and increasing ASPP channels from 4 to
we partition each image into twelve 800 × 800 patches with 6 (dilation rate 6 to 36 with interval 6) yields a 0.2% boost.
partial overlapping, thus augmenting the training set to have Data augmentation helps to achieve another 1.5% improve-
35700 images. This data augmentation strategy is to make ment. Using cell = 2 yields slightly better performance
sure all regions in an image can be visited. This is an im- when compared with cell = 1, and it helps to reduce com-
provement over random cropping, in which nearby regions putational cost by decreasing the channels of the last con-
may be visited repeatedly. volutional layer by a factor of 4.
We train the network using mini-batch SGD with patch
size 544×544 (randomly cropped from the 800×800 patch)
and batch size 12, using multiple GPUs. The initial learning Network DS ASPP Augmentation Cell mIoU
rate is set to 2.5 × 10−4 , and a “poly”learning rate (as in Baseline 8 4 yes n/a 72.3
[3]) with power = 0.9 is applied. Weight decay is set to Baseline 4 4 yes n/a 70.9
5 × 10−4 , and momentum is 0.9. The network is trained for
DU C 8 no no 1 71.9
20 epochs and achieves mIoU of 72.3% on the validation
DU C 8 4 no 1 72.8
set.
DU C 8 4 yes 1 74.3
DU C 4 4 yes 1 73.7
4.1.2 Dense Upsampling Convolution (DUC)
DU C 8 6 yes 1 74.5
We examine the effect of DUC on the baseline network. In DU C 8 6 yes 2 74.7
DUC the only thing we change is the shape of the top convo-
Table 1. Ablation studies for applying ResNet-101 on the
lutional layer. For example, if the dimension of the top con-
Cityscapes dataset. DS: Downsampling rate of the network. Cell:
volutional layer is 68 × 68 × 19 in the baseline model (19 is
neighborhood region that one predicted pixel represents.
the number of classes), then the dimension of the same layer
for a network with DUC will be 68 × 68 × (r2 × 19) where Bigger Patch Size Since setting cell = 2 reduces GPU
r is the total downsampling rate of the network (r = 8 memory cost for network training, we explore the effect
in this case). The prediction map is then reshaped to size of patch size on the performance. Our assumption is that,
544 × 544 × 19. DUC will introduce extra parameters com- since the original images are all 1024 × 2048, the network
pared to the baseline model, but only at the top convolu- should be trained using patches as big as possible in order
tional layer. We train the ResNet-DUC network the same to aggregate both local detail and global context informa-
way as the baseline model for 20 epochs, and achieve a tion that may help learning. As such, we make the patch
mean IOU of 74.3% on the validation set, a 2% increase size to be 880 × 880, and set the batch size to be 1 on
compared to the baseline model. Visualization of the result each of the 4 GPUs used in training. Since the patch size
of ResNet-DUC and comparison with the baseline model is exceeds the maximum dimension (800 × 800) in the previ-
shown in Figure 3 ous 12-fold data augmentation framework, we adopt a new
From Figure 3, we can clearly see that DUC is very help- 7-fold data augmentation strategy: seven center locations
ful for identifying small objects, such as poles, traffic lights, with x = 512, y = {256, 512, ..., 1792} are set in the orig-
and traffic signs. Consistent with our intuition, pixel-level inal image; for each center location, a 880 × 880 patch is
dense upsampling can recover detailed information that is obtained by randomly setting its center within a 160 × 160
generally missed by bilinear interpolation. rectangle area centered at each center. This strategy makes
Ablation Studies We examine the effect of different set- sure that we can sample all areas in the image, including
tings of the network on the performance. Specifically, we edges. Training with a bigger patch size boosts the perfor-
examine: 1) the downsampling rate of the network, which mance to 75.7%, a 1% improvement over the previous best
Figure 3. Effect of Dense Upsampling Convolution (DUC) on the Cityscapes validation set. From left to right: input image, ground truth
(areas with black color are ignored in evaluation), baseline model, and our ResNet-DUC model.
result. 3. Dilation-RF: For the res4b module that contains 23
Compared with Deconvolution We compare our DUC blocks with dilation rate r = 2, we group every 3
model with deconvolution, which also involves learning for blocks together and change their dilation rates to be
upsampling. Particularly, we compare with 1) direct decon- 1, 2, and 3, respectively. For the last two blocks, we
volution from the prediction map (dowsampled by 8) to the keep r = 2. For the res5b module which contains 3
original resolution; 2) deconvolution with an upsampling blocks with dilation rate r = 4, we change them to 3,
factor of 2 first, followed by an upsampling factor of 4. We 4, and 5, respectively.
design the deconv network to have approximately the same
4. Dilation-bigger: For res4b module, we group every 4
number of parameters as DUC. We use the ResNet-DUC
blocks together and change their dilation rates to be 1,
bigger patch model to train the networks. The above two
2, 5, and 9, respectively. The rates for the last 3 blocks
models achieve mIOU of 75.1% and 75.0%, respectively,
are 1, 2, and 5. For res5b module, we set the dilation
lower than the ResNet-DUC model (75.7% mIoU).
rates to be 5, 9, and 17.
Conditional Random Fields (CRFs) Fully-connected
CRFs [16] are widely used for improving semantic segmen- The result is summarized in Table 2. We can see that in-
tation quality as a post-processing step of an FCN [3]. We creasing receptive field size generally yields higher accu-
follow the formation of CRFs as shown in [3]. We perform racy. Figure 5 illustrates the effectiveness of the ResNet-
a grid search on parameters on the validation set, and use DUC-HDC model in eliminating the gridding effect. A vi-
σα = 15, σβ = 3, σγ = 1 , w1 = 3, and w2 = 3 for sualization result is shown in Figure 4. We can see our
all of our models. Applying CRFs to our best ResNet-DUC best ResNet-DUC-HDC model performs particularly well
model yields an mIoU of 76.7%, a 1% improvement over on objects that are relatively big.
the model does not use CRFs.
Network RF increased mIoU (without CRF)