0% found this document useful (0 votes)

6 views10 pages

L D C E S S: Earning Ense Onvolutional Mbeddings FOR Emantic Egmentation

Uploaded by

assettocorsa222222

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views10 pages

L D C E S S: Earning Ense Onvolutional Mbeddings FOR Emantic Egmentation

Uploaded by

assettocorsa222222

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Under review as a conference paper at ICLR 2016

L EARNING D ENSE C ONVOLUTIONAL E MBEDDINGS

FOR S EMANTIC S EGMENTATION

Adam W. Harley & Konstantinos G. Derpanis Iasonas Kokkinos

Ryerson University CentraleSupélec and INRIA
{aharley,kosta}@scs.ryerson.ca [email protected]

A BSTRACT
arXiv:1511.04377v3 [cs.CV] 8 Jan 2016

This paper proposes a new deep convolutional neural network (DCNN) architec-
ture that learns pixel embeddings, such that pairwise distances between the em-
beddings can be used to infer whether or not the pixels lie on the same region.
That is, for any two pixels on the same object, the embeddings are trained to be
similar; for any pair that straddles an object boundary, the embeddings are trained
to be dissimilar. Experimental results show that when this embedding network
is used in conjunction with a DCNN trained on semantic segmentation, there is a
systematic improvement in per-pixel classification accuracy. Our contributions are
integrated in the popular Caffe deep learning framework, and consist in straight-
forward modifications to convolution routines. As such, they can be exploited for
any task involving convolution layers.

1 I NTRODUCTION
Deep convolutional neural networks (DCNNs) (LeCun et al., 1998) are the method of choice for a
variety of high-level vision tasks (Razavian et al., 2014). Fully-convolutional DCNNs have recently
been a popular approach to semantic segmentation, because they can be efficiently trained end-to-
end for pixel-level classification (Sermanet et al., 2014; Chen et al., 2014; Long et al., 2014).
A weakness of DCNNs is that they tend to produce smooth and low-resolution predictions, partly
due to the subsampling that is a result of cascaded convolution and max-pooling layers. Many
different strategies have been explored for sharpening the boundaries of predictions produced by
fully-convolutional DCNNs. One popular strategy is to add a dense conditional random field (CRF)
to the end of the DCNN, introducing contextual information to the segmentation via long-range
dependencies in the CRF (Chen et al., 2014; Lin et al., 2015). Another strategy is to reduce the
subsampling effected by convolution and pooling, by using the “hole” algorithm for convolution
(Chen et al., 2014). A third strategy is to add trainable up-sampling stages to the network via “de-
convolution” layers in the DCNN (Noh et al., 2015; Long et al., 2014).
This paper’s strategy, which is complementary to those previously explored, is to train the network to
produce segmentation-like internal representations, so that foreground pixels and background pixels
within local patches can be treated differently. In particular, the aim is to increase the sharpness
of the DCNN’s final output by using local pixel affinities to filter and re-weight the final layer’s
activations. For instance, as can be seen in Figure 1, if a DCNN is centered on a “boat” pixel, but
the surrounding patch includes some pixels from an occluder or the background, the DCNN’s final
prediction will typically reflect the presence of the distractors by outputting a mix of “boat” and
“background”. The approach of this paper is to learn and use semantic affinities between pixels, so
that the DCNN output centered at the “boat” pixel can be strengthened by using information from
other “boat” pixels within the patch. More generally, the approach allows the prediction at any pixel
to be replaced with a weighted average of the similar neighboring predictions. This has the effect of
sharpening the predictions at object boundaries, while making predictions within object boundaries
more uniform.
The key to accomplishing this is to have the network produce internal representations that lend
themselves to pairwise comparisons, such that any pair that lies on the same object will produce a
high affinity measure, and pairs that straddle a boundary produce a low affinity measure. Prior work
has investigated the use of affinity cues in similar contexts (Ren & Malik, 2003; Dai et al., 2014), but

1
Under review as a conference paper at ICLR 2016

FC8
Input image Sharpened FC8

Embeddings

Figure 1: Semantic segmentation predictions produced by DCNNs can be sharpened by filtering, via
this work’s learned embeddings trained to capture semantic region similarity.
1
these required handcrafted algorithms for computing the affinity information, and would typically
be pre-computed in a separate process. The current work is unique for learning the cues directly
from image data, and for computing the affinities densely and “on the fly” within a DCNN.
The learned embeddings and their distance functions are implemented efficiently as convolution-
like layers in Caffe (Jia et al., 2014). The embedding layers can either be trained independently,
or integrated in the full DCNN pipeline and trained end-to-end (with or without per-pixel labels).
Source code and trained embeddings will be publicly released.

2 R ELATED WORK

This work is closely related to three major research topics in feature learning and computer vision:
metric learning, segmentation-aware descriptor construction, and DCNNs.
Metric learning. The goal of metric learning is to produce features from which one can estimate
similarity between pixels or regions in the input (Frome et al., 2007). Bromley et al. (1994) influen-
tially began learning these descriptors in a convolutional network, and subsequent related work has
yielded compelling results for tasks such as wide-baseline stereo correspondence (Han et al., 2015;
Zagoruyko & Komodakis, 2015; Žbontar & LeCun, 2014). Recently, the topic of metric learning has
been studied extensively in conjunction with image descriptors, such as SIFT and SID (Trulls et al.,
2013; Simo-Serra et al., 2015), improving the applicability of those descriptors to patch-matching
problems. Most prior work in metric learning has concerned the task of finding one-to-one corre-
spondences between pixels seen from different viewpoints. The current work, in contrast, concerns
the task of matching all pairs of points that lie on the same region. This requires a higher degree
of invariance than has previously been necessary – not only to rotation, scale, and partial occlusion,
but to objects’ interior details.
Segmentation-aware descriptors. The purpose of a segmentation-aware descriptor is to capture the
appearance of the foreground while being invariant to changes in the background or occlusions. To
date, most work in this domain has been on developing handcrafted segmentation-aware descriptors.
For instance, soft segmentation masks (Ott & Everingham, 2009; Leordeanu et al., 2012) and bound-
ary cues (Maire et al., 2008; Shi & Malik, 2000) have been used to augment features like SIFT and
HOG, to suppress contributions from pixels likely to come from the background (Trulls et al., 2013;
2014). The intervening contours algorithm (Fowlkes et al., 2003) provides another type of affin-
ity measure, used previously for image segmentation. The current work’s “embeddings” are a first
attempt at developing fully-learned segmentation-aware descriptors. As a secondary contribution,
the current work also implements intervening contours as a layer in a DCNN, using deep-learned
boundary cues from another work Xie & Tu (2015). Since the boundary cues require a separate
DCNN, this is meant to represent a costly alternative to the learned embeddings featured here.
DCNNs for semantic segmentation. Fully-convolutional DCNNs are fast and effective semantic
segmentation systems (Long et al., 2014). Part of the appeal of DCNNs is they can be trained end-
to-end, without the need for any handcrafted feature representations. However, DCNNs’ repeated
subsampling (through strided convolution and max-pooling) presents an issue, because it reduces
the resolution at which the DCNN can make predictions. This has been partially addressed by

2
Under review as a conference paper at ICLR 2016

trainable up-sampling layers (Long et al., 2014), and the “hole” algorithm for convolution Chen
et al. (2014), but state-of-the-art systems also attach a dense CRF Krähenbühl & Koltun (2011) to
the DCNN to increase the sharpness of the output. The CRF can be trained as a separate module
(Chen et al., 2014), or jointly with the DCNN (Lin et al., 2015), though both cases are at significant
added computational cost. Another approach to the subsampling issue, more in line with the current
paper, is to incorporate segmentation cues into the DCNN. Dai et al. (2014) recently used superpixels
to generate masks for convolutional feature maps, enforcing sharp contours in their outputs. The
current paper takes this idea further, by replacing the sparse handcrafted segmentation cues with
dense learnable variants.
Contributions. In the light of the related work, this paper’s main contributions are as follows.
First, the paper uses a DCNN architecture to learn pixel embeddings, such that pairwise distances
between the embeddings indicate whether or not the pixels belong to the same region. Second, these
embeddings (and their distance functions) are implemented as convolution-like layers in Caffe, with
minimal computational overhead. Third, the learned embeddings are integrated with the state-of-
the-art DeepLab semantic segmentation system, and this is shown to improve performance on the
VOC2012 segmentation task.

3 T ECHNICAL APPROACH
This section establishes (1) how segmentation embeddings can be learned from pixel-wise labels, (2)
how the embeddings can be merged with convolution, such that they can be learned without pixel-
wise labels, and finally (3) how contour cues can be used to create an alternative affinity measure.

3.1 L EARNING SEGMENTATION EMBEDDINGS

The goal of the current work is to train a set of convolutional layers to create dense “embeddings”,
which can be used to calculate pixel affinities relating to the semantic similarity of the underlying
regions. Pixels pairs that share a semantic category should produce similar embeddings (i.e., a high
affinity), and pairs that do not share a semantic category should produce dissimilar embeddings (i.e.,
a low affinity).
This goal is represented in a loss function L, which accumulates the quality of embedding pairs
sampled across the image. In this work, pairwise comparisons are made between each pixel i and its
spatial neighbours j ∈ N (i). Collecting pairs within a fixed window lends simplicity and tractabil-
ity, although in general the pairs can be collected at any range. Denoting the quality of a particular
pair of embeddings with `ij , the overall loss is defined as
X X
L= `ij . (1)
i∈I j∈N (i)

The network is trained to minimize this loss through stochastic gradient descent.
The inner loss function `ij represents how well a pair of embeddings ei and ej respect the affinity
goal. Pixel-wise labels are a convenient resource for quantifying this loss, since they can provide
information on whether or not the pixels belong to the same region. Using this information, the
distance between embeddings can be optimized according to label parity. That is, same-label pairs
can be optimized to have a small distance, and different-label pairs can be optimized to have a large
distance. Denoting the label of pixel i with li , and the embedding at that pixel with ei , the inner loss
is defined as
max (|ei − ej | − α, 0) if li = lj ,
`ij = (2)
max (β − |ei − ej |, 0) if li 6= lj ,
where α and β are design parameters that specify the “near” and “far” thresholds against which the
embedding distances are compared. In this work, α = 0.5, and β = 2 are used.
The embedding distances can be computed with any distance function. In this work, L1 and L2
norms were tried. Embeddings learned from both distances produced visually appealing masks,
but the L1 -based embeddings were found to be easier to train. Specifically, it was found that the
L1 -based embeddings can safely be trained with a higher learning rate than L2 -based ones, because
they are less vulnerable to the problem of exploding gradients. Figure 2 shows visualizations of the
L1 -based embeddings learned by the network.

3
Under review as a conference paper at ICLR 2016

Input image Patch Embeddings Embedding Photometric

mask mask

Figure 2: Embeddings and local masks are computed densely for input images. For four locations
in the image shown on the left, the figure shows the extracted patch, embeddings (compressed to
3 dimensions by PCA, for visualization purposes), and embedding-based mask (left-to-right). For
1
comparison, the mask generated by photometric color distances is shown on the far right.

3.2 E MBEDDING - BASED SEGMENTATION MASKS

Once these embeddings are learned, they can be used to create segmentation masks. For a pixel i
and a neighbour pixel j ∈ N (i), one can define
mi = exp(−λ|vi − vj |) (3)
to be the weight applied to pixel j in a mask centered on i, where λ is a parameter specifying the
hardness of the mask. This parameter can be learned inside a DCNN. The exponential function
scales the masks to the range [0, 1], where similar pixels are given values near 1, and dissimilar
pixels are given values near 0.
These masks can be applied convolutionally, so that the output at location i becomes
X
yi = m j xj , (4)
j∈N (i)

where xj is the input at location j. If xj is a vector, the mask is simply applied to every element
of the vector. The effect of this is to replace each input with a weighted average of its similar
neighbours. Since the affinities capture semantic similarity, this is expected to improve the quality
of the output.
Finally, each output is normalized the sum total of the mask, so that the output magnitudes do not
change as a function of neighbourhood size. With normalization, the masking equation (4) becomes
P
j∈N (i) mj xj
yi = P . (5)
j∈N (i) mj

Note that if xj is an RGB value, and mj is a Gaussian that jointly captures RGB and geometric
distance between pixels i and j, the masking equation (4) is equivalent to the bilateral filter (Tomasi
& Manduchi, 1998), which is a well-known technique in signal processing for smoothing while
preserving edges. Since the filter in the current work depends on the embeddings, and the em-
beddings are learned in a DCNN, the current approach represents a generalization of the bilateral
filter. Interestingly, the Krähenbühl & Koltun (2011) algorithm for dense CRFs is also related to
the bilateral filter, in the sense that the inference step, through mean field approximation, essentially
accomplishes a repeated application of a non-linear filter. This shared connection is appropriate,
since CRFs and embedding-based segmentation masks have common goals: to sharpen predictions
at object boundaries while smoothing in the interior.
When the embeddings are integrated into a larger network that uses them for masks, the embedding
loss function (1) is no longer necessary. Since all terms of the normalized masking equation (5) are

4
Under review as a conference paper at ICLR 2016

differentiable, the global objective (e.g., classification accuracy) can be used to tune not only the
input term xj , but also the mask term mj . Therefore, the embeddings can be learned end-to-end in
the network when used to create masks.
In this work, the embeddings are trained first with a dedicated loss, then fine-tuned in the larger
pipeline as masks. Figure 2 shows examples of the learned embeddings and masks, as compared
with masks created by photometric color distances.

3.3 C ONTOUR - BASED AFFINITIES

This work also explores the use of contour cues to generate pixel affinities. In particular, the in-
terest is to see how contour-based affinities compare with the embedding-based affinities. Auto-
matic boundary detection with DCNNs has recently attained excellent results (Xie & Tu, 2015), so
contour-based affinity cues should be a strong baseline to compare against.
For contour cues, this work uses the learned contour cues of a state-of-the-art DCNN trained for the
task, named the Holistically-nested Edge Detection (HED) network (Xie & Tu, 2015). For each pixel
i, the intervening contours algorithm (Fowlkes et al., 2003) is computed for each of its neighbours
j ∈ N (i), to determine the maximum probability of a contour being on a line that travels from i to
j. If two pixels i and j lie on different objects, there is likely to be a boundary separating them; the
intervening contours step returns the probability of that boundary, as provided by the boundary cue.
As with the embeddings, this step is implemented entirely within the DCNN. Intervening contours
are computed with a specialized layer for the task.

4 N ETWORK ARCHITECTURE AND IMPLEMENTATION DETAILS

This section first describes how the ideas of the technical approach were integrated in a DCNN archi-
tecture, and then establishes details on how the individual components were implemented efficiently
as convolution-like DCNN layers.

4.1 N ETWORK ARCHITECTURE

Figure 3 illustrates the full network featured in this paper. The input image is sent to two parallel
processing streams, which merge later: a DeepLab network (Chen et al., 2014), and an embeddings
network. Both networks are modelled after the VGG-16 network (Chatfield et al., 2014). The
DeepLab network is the multi-scale large field-of-view model from Chen et al. (2014).
The embeddings network has the following design. The first five layers are initialized from the
earliest convolutional layers in VGG-16. There is a pooling layer after the second layer, and after
the fourth layer, so the five layers capture information at different scales. The output from each of
these layers is sent to pairwise distance computations (im2dist) followed by a loss, so that each
layer develops embedding-like representations. The idea of using a loss at each intermediate layer
is inspired by Xie & Tu (2015), who used this strategy to learn boundary cues in a DCNN.
The outputs from the intermediate embedding layers are upsampled to a common resolution, con-
catenated, and sent to a randomly-initialized convolutional layer with 1 × 1 filters and 64 outputs.
This layer learns a weighted average of the first five convolutional layers’ outputs, and creates the
final 64-dimensional embeddings. The output of this layer trained in the same way as the others
(with im2dist and a loss), and is used as the final embeddings. The final embeddings are used to
mask the output of DeepLab’s final convolutional layer (i.e., “fc-fusion”), and then sent to a softmax
layer to form prediction scores.
To allow an evaluation that treats the embeddings as a modular upgrade to existing semantic seg-
mentation systems, the DeepLab network was kept as a fixed component, and never fine-tuned with
the masks. Better performance can be achieved by training the full pipeline end-to-end.
Although the size of the window used for computing embedding distances has an important effect
at test time (since it specifies the radius in which to search for contributing information), the win-
dow size was not found to have a substantial effect at training time. The embeddings used in the
experiments were trained with losses computed in 9 × 9 windows, with a stride of 2.

5
Under review as a conference paper at ICLR 2016

Input DeepLab im2col × channelSum Output

L L L L L L

im2dist im2dist im2dist im2dist im2dist im2dist

E1 E2 E3 E4 E5 Eavg

Figure 3: Schematic for the DCNN featured in this work. The embedding layers (bottom half of the
schematic) are the main contribution. Embedding layers are indicated with boxes labelled E; the
final embedding layer creates a weighted average of the other embeddings. Loss layers, indicated
with L boxes, provide gradients for each embedding layer.

To implement the intervening contours approach to pixel affinities, the state-of-the-art HED network
Xie & Tu (2015) was used to compute boundary cues, and affinities were computed with the inter-
vening contours algorithm. Cross-validation with a step size of 5 on the hardness parameter λ led to
the choice of λ = 5. Since the HED network has considerable memory requirements, achieving re-
sults directly comparable to those computed with embedding-based affinities is not actually feasible
in the memory constraints of a Tesla K-40 GPU. To overcome this constraint, boundary cues were
pre-computed for the entire PASCAL VOC validation set.

4.2 E FFICIENT CONVOLUTIONAL IMPLEMENTATION DETAILS

This section provides the implementation details that were required to efficiently integrate the em-
beddings, masks, and intervening contours, with DCNNs. Source code for this work will be made
available online. All new layers are implemented both for CPU and GPU, and are as fast as im2col.
Computing pairwise distances densely across the image is a computationally expensive process. The
current work implements this efficiently by solving it in the same way Caffe (Jia et al., 2014) realizes
convolution: via an image-to-column (im2col) transformation, followed by matrix multiplication.
The current work implements dense local distance computation in a new DCNN layer named
im2dist. For every position i in the feature-map provided by the layer below, a patch of features
is extracted from the neighborhood j ∈ N (i), and local distances are computed between the central
feature and its neighbours. These distances are arranged1 into a column vector of length K, where K
is the total dimensionality of a patch. This process turns an H × W feature-map into an H × W × K
matrix, where each element in the K dimension holds a distance.
To turn these distances into masks, the matrix is passed through an exponential function with a
particular hardness. This corresponds to the mask term definition (3), where the hardness parameter
is specified by λ. In this work, λ = 30 was chosen, based on cross-validation with a step size of 5.
To perform the actual masking, the input to be masked must simply be processed by im2col (produc-
ing another H × W × K matrix), then multiplied pointwise with the masking matrix, and summed
across K. This accomplishes the masking equation (4).
The resulting matrix of predictions can optionally be normalized. To create the normalizing co-
efficients, i.e., the denominator in the normalized masking equation (5), the masking matrix must
simply be summed across K to create a mask sum for every location. The masked output can then
be pointwise divided with the mask sums, creating the final normalized masked output.
The loss function for the embeddings (1) is implemented using similar computational techniques.
First, the label image is processed with a new layer named im2parity. This creates an H × W × K
matrix in which for each pixel i, the K-dimensional column specifies (with {0, 1}) whether or
not the local neighbours j ∈ N (i) have the same label. The result of this process can then be
straightforwardly combined with the result of im2dist, to penalize each distance according to the
correct loss case and threshold in the pairwise loss equation (2).
Finally, the intervening contours algorithm is implemented in a similar way, in a layer named
im2interv. For every position i in a boundary probability map provided by the layer below, a K-
dimensional column vector is generated, representing the intervening contour output for each posi-

6
Under review as a conference paper at ICLR 2016

tion in a neighborhood centered on i. Specifically, for each position in this neighborhood, a straight
line is traced from that position to the center position (using the Bresenham (1965) algorithm), and
the maximum boundary probability along that line is stored. The H × W × K output of this process
can be used in exactly the same way as the output of im2dist.

5 E VALUATION

The baseline for the evaluation is the current best publicly-released DeepLab network (“Deep-MSc-
Coco-LargeFOV”; Chen et al. (2014)), which is a strong baseline for semantic segmentation. This
model was initialized from a VGG network trained on ImageNet, then trained on the Microsoft
COCO training and validation sets (Lin et al., 2014), and finally fine-tuned on training and vali-
dation sets of the PASCAL VOC 2012 challenge (Everingham et al., 2012). This network is aug-
mented with embeddings learned on the COCO dataset. Additional baselines are provided by RGB
distances, pre-computed state-of-the-art handcrafted Leordeanu et al. (2012) 8-dimensional embed-
dings (with default parameters), and pre-computed intervening contour affinities. All baselines were
tested using a 9 × 9 affinity window (applied as mask once).
Evaluation on the PASCAL VOC validation set is presented in Table 1. Note that the publicly-
released DeepLab model was trained on the VOC trainval set, so results on the validation set cannot
be understood to reflect performance at test time, but rather reflect training error. However, since the
embeddings were not trained on any VOC data, improvements to this performance do reflect general
improvements in the model. Accordingly, the experiments of this paper show that improvements on
the validation set translate to improvements on the test set.
The results first of all show that using learned embeddings to mask the output of DeepLab system-
atically provides a 0.5% to 1.5% improvement in mean intersection-over-union (IOU) accuracy. ,
the improvements exceed those attainable by RGB, intervening contours, or handcrafted Leordeanu
et al. (2012) embeddings. While the numerical difference between the performance of these affinity
cues is small (e.g., there is a 0.03% difference between handcrafted embeddings and learned em-
beddings when both are computed in a 9 × 9 and applied once), it is important to emphasize that the
learned embeddings are uniquely capable of being fine-tuned. That is, this initial comparison uses
the learned embeddings “off-the-shelf” (as they were learned from the COCO dataset), but realistic
use would involve fine-tuning on all available data. The current work’s experiments on the VOC test
set benefit from this fine-tuning.
Additional validation experiments reported in Table 1 explore the effects of two design critical pa-
rameters: filter window size, and the number of times to apply the filter. A wider window, though
more expensive, is expected to improve performance by allowing information from a wider radius to
contribute to each prediction. The approach was evaluated at window sizes of 7 × 7, 9 × 9, 11 × 11,
and finally 15 × 15. The results confirm that increasing the filter size improves performance.
The second design parameter is the number of times to apply the filter. Once the embeddings and
masks are computed, it is trivial to run the masking process repeatedly. Applying the process multi-
ple times is expected to improve performance, by strengthening the contribution from similar neigh-
bours in the radius. This second parameter also effectively increases the number of contributing
neighbours: after filtering with 7 × 7 filters, each new 7 × 7 region effectively contains information
from an 11 × 11 area in the original input. Therefore, along with results from applying each filter
once, results are presented for applying each filter the maximum possible number of times, given
the memory constraints of a Tesla K-40 GPU. As expected, repeating the application of the filter
improves performance. The best model applied a 9 × 9 mask seven times recursively.
The best embedding configuration determined from the PASCAL VOC validation set was fine-tuned
end-to-end in its full pipeline on the VOC trainval set, and submitted to the VOC test server. As
shown in Table 2, improvement on the test set was approximately 1.2% over the baseline, roughly
consistent with the results observed on the validation set.
As discussed earlier, dense CRFs are often used to sharpen the predictions produced by DCNN-
based semantic segmentation systems like DeepLab. To test if the improvement offered by
embedding-based masks continues through the application of the CRF, a dense CRF (Krähenbühl &
Koltun, 2011) was trained on top of the mask-sharpened outputs, using the PASCAL VOC valida-
tion set for cross-validation of its hyper-parameters. As shown in Table 2, the embedding-augmented

7
Under review as a conference paper at ICLR 2016

Input image Deeplab Deeplab-Embeddings Deeplab-CRF Deeplab-Embeddings-CRF

Figure 4: Visualizations of the semantic segmentations produced by various considered approaches.

Input is shown at far left, followed by (left-to-right) DeepLab, DeepLab with 9 × 9 embeddings
(applied seven times), DeepLab with a dense CRF, and 1 DeepLab with embeddings and a dense CRF.

Table 1: VOC 2012 validation results (“IC” in-

dicates intervening contour-based filters; “Leo”
indicates Leordeanu et al. (2012) embeddings).
Method IOU (%)
DeepLab 78.22 Table 2: VOC 2012 test results.
DeepLab, 9 × 9 RGB x1 78.30
DeepLab, 9 × 9 Leo x1 78.86 Method IOU (%)
DeepLab, 9 × 9 IC x1 78.85 DeepLab 70.31
DeepLab, 7 × 7 embeds x1 78.70 DeepLab, 9 × 9 embeds x7 71.54
DeepLab, 7 × 7 embeds x12 79.68 DeepLab-CRF 73.60
DeepLab, 9 × 9 embeds x1 78.89 DeepLab-CRF, 9 × 9 embeds x7 74.00
DeepLab, 9 × 9 embeds x7 79.69
DeepLab, 11 × 11 embeds x1 79.03
DeepLab, 11 × 11 embeds x4 79.64
DeepLab, 15 × 15 embeds x1 79.39

DeepLab outperforms the DeepLab-CRF baseline by 0.4%. Visualizations of the results are shown
in Figure 4.

6 C ONCLUSION

This paper proposed a new deep convolutional neural network architecture for learning embeddings.
Results showed that integrating the embeddings into a strong baseline DCNN systematically im-
proved results by a noticeable margin on both validation and testing in the PASCAL VOC 2012
dataset. Compared to results achieved through RGB distances, intervening-contour based affinities
with state-of-the-art boundary cues, and state-of-the-art handcrafted embeddings, the performance
of learned embeddings is higher, despite having a much smaller memory footprint. This approach to
improvement is orthogonal to many others pursued in semantic segmentation, and it is implemented
efficiently in the popular Caffe deep-learning framework, making it a useful and straightforward aug-
mentation to existing semantic segmentation frameworks. Finally, although semantic segmentation
is the targeted application of the current work, the overall approach does not depend on pixel-wise
labels, and the embedding and masking layers can be used in any task involving DCNNs.

8
Under review as a conference paper at ICLR 2016

R EFERENCES
Bresenham, J. E. Algorithm for computer control of a digital plotter. IBM Systems journal, 4(1):
25–30, 1965.
Bromley, J., Guyon, I., Lecun, Y., Sackinger, E., and Shah, R. Signature verification using a
“siamese” time delay neural network. In NIPS, 1994.
Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. Return of the devil in the details:
Delving deep into convolutional nets. In BMVC, 2014.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., , and Yuille, A. L. Semantic image segmen-
tation with deep convolutional nets and fully connected crfs. In ICLR, 2014.
Dai, J., He, K., and Sun, J. Convolutional feature masking for joint object and stuff segmentation.
arXiv, 2014.
Everingham, M., Van-Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. The
PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-
network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
Fowlkes, C., Martin, D., and Malik, J. Learning affinity functions for image segmentation: Com-
bining patch-based and gradient-based approaches. In CVPR, volume 2, pp. 46–54, 2003.
Frome, A., Singer, Y., Sha, F., and Malik, J. Learning globally-consistent local distance functions
for shape-based image retrieval and classification. In ICCV, pp. 1–8, 2007.
Han, X., Leung, T., Jia, Y., Sukthankar, R., and Berg, A.C. MatchNet: Unifying feature and metric
learning for patch-based matching. In CVPR, pp. 3279–3286, 2015.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell,
T. Caffe: Convolutional architecture for fast feature embedding. arXiv, 2014.
Krähenbühl, P. and Koltun, V. Efficient inference in fully connected CRFs with gaussian edge
potentials. pp. 109–117, 2011.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document
recognition. Proc. IEEE, 86(11):2278–2324, 1998.
Leordeanu, M., Sukthankar, R., and Sminchisescu, C. Efficient closed-form solution to generalized
boundary detection. In ECCV, pp. 516–529, 2012.
Lin, G., Shen, C., Reid, I. D., and van den Hengel, A. Efficient piecewise training of deep structured
models for semantic segmentation. arXiv, 2015.
Lin, T.-Yi, Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L.
Microsoft COCO: Common objects in context. In ECCV, pp. 740–755, 2014.
Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. In
CVPR, 2014.
Maire, M., Arbeláez, P., Fowlkes, C., and Malik, J. Using contours to detect and localize junctions
in natural images. In CVPR, pp. 1–8, 2008.
Noh, Hyeonwoo, Hong, Seunghoon, and Han, Bohyung. Learning deconvolution network for se-
mantic segmentation. In ICCV, 2015.
Ott, P. and Everingham, M. Implicit color segmentation features for pedestrian and object detection.
In ICCV, pp. 723–730, 2009.
Razavian, Ali Sharif, Azizpour, Hossein, Sullivan, Josephine, and Carlsson, Stefan. CNN features
off-the-shelf: An astounding baseline for recognition. In CVPR, pp. 512–519, 2014.
Ren, X. and Malik, J. Learning a classification model for segmentation. In ICCV, pp. 10–17, 2003.

9
Under review as a conference paper at ICLR 2016

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. OverFeat: Integrated
recognition, localization and detection using convolutional networks. In ICLR, 2014.
Shi, J. and Malik, J. Normalized cuts and image segmentation. TPAMI, 22(8):888–905, 2000.
Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., and Moreno-Noguer, F. Discriminative
learning of deep convolutional feature point descriptors. In ICCV, 2015.
Tomasi, C. and Manduchi, R. Bilateral filtering for gray and color images. In ICCV, pp. 839–846,
1998.
Trulls, E., Kokkinos, I., Sanfeliu, A., and Moreno-Noguer, F. Dense segmentation-aware descriptors.
In CVPR, pp. 2890–2897, 2013.
Trulls, E., Tsogkas, S., Kokkinos, I., Sanfeliu, A., and Moreno-Noguer, F. Segmentation-aware
deformable part models. In CVPR, pp. 168–175, 2014.
Xie, S. and Tu, Z. Holistically-nested edge detection. In CVPR, 2015.
Zagoruyko, S. and Komodakis, N. Learning to compare image patches via convolutional neural
networks. In CVPR, 2015.
Žbontar, J. and LeCun, Y. Computing the stereo matching cost with a convolutional neural network.
In CVPR, 2014.

02 Semantic Segmentation 2024
No ratings yet
02 Semantic Segmentation 2024
53 pages
A Comprehensive Review of Modern Object Segmentation Approaches
No ratings yet
A Comprehensive Review of Modern Object Segmentation Approaches
177 pages
Thesis Z Ai
No ratings yet
Thesis Z Ai
46 pages
Deep Segmentation
No ratings yet
Deep Segmentation
38 pages
Manual Operador Amaro 5000 - OMRON - HOSPITALAR EN
100% (1)
Manual Operador Amaro 5000 - OMRON - HOSPITALAR EN
54 pages
A Review On Multiscale-Deep-Learning Applications
No ratings yet
A Review On Multiscale-Deep-Learning Applications
28 pages
Boundary Attention
No ratings yet
Boundary Attention
25 pages
Harley MSC Thesis Menos Especializadpo
No ratings yet
Harley MSC Thesis Menos Especializadpo
71 pages
Lecture 5 - CNNs For Detection and Segmentation
No ratings yet
Lecture 5 - CNNs For Detection and Segmentation
62 pages
Segmentation-Aware Convolutional Networks Using Local Attention Masks
No ratings yet
Segmentation-Aware Convolutional Networks Using Local Attention Masks
11 pages
Large Kernel Matters
No ratings yet
Large Kernel Matters
11 pages
2015 - DeepLab v1 - Semantic Image Segmentation With Deep Convolutional Nets and Fully Connected Crfs
No ratings yet
2015 - DeepLab v1 - Semantic Image Segmentation With Deep Convolutional Nets and Fully Connected Crfs
14 pages
REF-6-DeepLab Semantic Image Segmentation With Deep Convolutional Nets Atrous Convolution and Fully Connected CRFs
No ratings yet
REF-6-DeepLab Semantic Image Segmentation With Deep Convolutional Nets Atrous Convolution and Fully Connected CRFs
15 pages
Carafe
No ratings yet
Carafe
15 pages
3D Graph Neural Networks For RGBD Semantic Segmentation
No ratings yet
3D Graph Neural Networks For RGBD Semantic Segmentation
10 pages
A Multi-Path Semantic Segmentation Network Based o
No ratings yet
A Multi-Path Semantic Segmentation Network Based o
17 pages
Refine Net
No ratings yet
Refine Net
11 pages
DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering
No ratings yet
DeepCut Unsupervised Segmentation Using Graph Neural Networks Clustering
14 pages
【全局卷积GAP】2017 - Large - Kernel - Matters - Improve - Semantic - Segmentation - by - Global - Convolutional - Network
No ratings yet
【全局卷积GAP】2017 - Large - Kernel - Matters - Improve - Semantic - Segmentation - by - Global - Convolutional - Network
9 pages
Segmentation by Gan
No ratings yet
Segmentation by Gan
18 pages
2018 - Understanding Convolution For Semantic Segmentation
No ratings yet
2018 - Understanding Convolution For Semantic Segmentation
10 pages
Ding Context Contrasted Feature CVPR 2018 Paper
No ratings yet
Ding Context Contrasted Feature CVPR 2018 Paper
10 pages
Object Detection and Segmentation - Part 2
No ratings yet
Object Detection and Segmentation - Part 2
36 pages
Learning Detect Match
No ratings yet
Learning Detect Match
12 pages
Transferring To Real-World Layouts: A Depth-Aware Framework For Scene Adaptation
No ratings yet
Transferring To Real-World Layouts: A Depth-Aware Framework For Scene Adaptation
11 pages
Liu High-Level Semantic Feature Detection A New Perspective For Pedestrian Detection CVPR 2019 Paper
No ratings yet
Liu High-Level Semantic Feature Detection A New Perspective For Pedestrian Detection CVPR 2019 Paper
10 pages
(2018) RFB
No ratings yet
(2018) RFB
16 pages
Semantic Image Segmentation Via Deep Parsing Network
No ratings yet
Semantic Image Segmentation Via Deep Parsing Network
11 pages
Fully Convolutional Networks For Semantic Segmentation
No ratings yet
Fully Convolutional Networks For Semantic Segmentation
12 pages
DFANet Deep Feature Aggregation For Real-Time Semantic Segmentation
No ratings yet
DFANet Deep Feature Aggregation For Real-Time Semantic Segmentation
10 pages
Rocco Convolutional Neural Network CVPR 2017 Paper
No ratings yet
Rocco Convolutional Neural Network CVPR 2017 Paper
10 pages
5634 Convolutional Neural Networks With Intra Layer Recurrent Connections For Scene Labeling
No ratings yet
5634 Convolutional Neural Networks With Intra Layer Recurrent Connections For Scene Labeling
9 pages
A Brief Survey and An Application of Sem
No ratings yet
A Brief Survey and An Application of Sem
38 pages
Boundary-Aware Segmentation Network For Mobile and Web Applications
No ratings yet
Boundary-Aware Segmentation Network For Mobile and Web Applications
19 pages
1 CASENet: Deep Category-Aware Semantic Edge Detection
No ratings yet
1 CASENet: Deep Category-Aware Semantic Edge Detection
16 pages
Semantic Segmentation: Tingwu Wang Machine Learning Group, University of Toronto
No ratings yet
Semantic Segmentation: Tingwu Wang Machine Learning Group, University of Toronto
28 pages
Fully Convolutional Networks For Semantic Segmentation
No ratings yet
Fully Convolutional Networks For Semantic Segmentation
12 pages
Semantic Segmentation by Using Down-Sampling and S
No ratings yet
Semantic Segmentation by Using Down-Sampling and S
14 pages
Pointwise Convolutional Neural Networks
No ratings yet
Pointwise Convolutional Neural Networks
10 pages
High-Level Semantic Feature Detection: A New Perspective For Pedestrian Detection
No ratings yet
High-Level Semantic Feature Detection: A New Perspective For Pedestrian Detection
10 pages
Dlcv2017d3l1segmentation 170623173102
No ratings yet
Dlcv2017d3l1segmentation 170623173102
36 pages
End-to-End Boundary Aware Networks For Medical Image Segmentation
No ratings yet
End-to-End Boundary Aware Networks For Medical Image Segmentation
8 pages
2018 - SeGAN - Adversarial Network With Multi-Scale L 1 Loss For Medical
No ratings yet
2018 - SeGAN - Adversarial Network With Multi-Scale L 1 Loss For Medical
10 pages
Mask
No ratings yet
Mask
12 pages
Implementation of Deep Neural Networks Learning On Unmanned Aerial Vehicle Based Remote-Sensing
No ratings yet
Implementation of Deep Neural Networks Learning On Unmanned Aerial Vehicle Based Remote-Sensing
7 pages
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
No ratings yet
Sensors: Depth Estimation and Semantic Segmentation From A Single RGB Image Using A Hybrid Convolutional Neural Network
20 pages
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
No ratings yet
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
10 pages
Deep Semantic Segmentation New Model of Natural and Medical Images
No ratings yet
Deep Semantic Segmentation New Model of Natural and Medical Images
4 pages
Deconvolution Network ICCV 2015 Paper PDF
No ratings yet
Deconvolution Network ICCV 2015 Paper PDF
9 pages
PHILOSOPHY-Q2-MODULE-7& 8-Philosophy-on-Humans-as-oriented
No ratings yet
PHILOSOPHY-Q2-MODULE-7& 8-Philosophy-on-Humans-as-oriented
25 pages
Deep Semantic Segmentation New Model of Natural and Medical Images
No ratings yet
Deep Semantic Segmentation New Model of Natural and Medical Images
4 pages
Zfap410dk Service Manual PDF
100% (3)
Zfap410dk Service Manual PDF
85 pages
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
No ratings yet
A Comparative Study of Real-Time Semantic Segmentation For Autonomous Driving
11 pages
Semantic Image Segmentation With Task-Specific Edge Detection Using Cnns and A Discriminatively Trained Domain Transform
No ratings yet
Semantic Image Segmentation With Task-Specific Edge Detection Using Cnns and A Discriminatively Trained Domain Transform
10 pages
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
No ratings yet
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
42 pages
W-Net A Deep Model For Fully Unsupervised Image Segmentation
No ratings yet
W-Net A Deep Model For Fully Unsupervised Image Segmentation
13 pages
tmpD684 TMP
No ratings yet
tmpD684 TMP
8 pages
Thesis AlexanderJaus BIBTEX
No ratings yet
Thesis AlexanderJaus BIBTEX
9 pages
Petrel 2014 1 Release Notes
No ratings yet
Petrel 2014 1 Release Notes
46 pages
Deeplab: Semantic Image Segmentation With Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs
No ratings yet
Deeplab: Semantic Image Segmentation With Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs
14 pages
The One Hundred Layers Tiramisu: Fully Convolutional Densenets For Semantic Segmentation
No ratings yet
The One Hundred Layers Tiramisu: Fully Convolutional Densenets For Semantic Segmentation
9 pages
Real-Time Traffic Scene Segmentation Based On Multi-Feature Map and Deep Learning
No ratings yet
Real-Time Traffic Scene Segmentation Based On Multi-Feature Map and Deep Learning
6 pages
Moodular Coordination
No ratings yet
Moodular Coordination
10 pages
Astrology and Marriage
No ratings yet
Astrology and Marriage
2 pages
R2 Unet PDF
No ratings yet
R2 Unet PDF
12 pages
Series 2A Pneu Cyl 2a - 0910-Uk PDF
No ratings yet
Series 2A Pneu Cyl 2a - 0910-Uk PDF
48 pages
DE09 Sol
No ratings yet
DE09 Sol
157 pages
Highway Pavement Structural Design: (JRCP)
No ratings yet
Highway Pavement Structural Design: (JRCP)
37 pages
Catalogue Centrifugal Pumps 2
No ratings yet
Catalogue Centrifugal Pumps 2
54 pages
740 (B) Calculation of Smoke Spilled System
No ratings yet
740 (B) Calculation of Smoke Spilled System
8 pages
اسس الاتصالات مرحلة ثانية د حمود
No ratings yet
اسس الاتصالات مرحلة ثانية د حمود
131 pages
All Postings Report
No ratings yet
All Postings Report
10 pages
Eel 5245 Power Electronics I Lecture #2: Chapter 1 Introduction To Power Electronics
No ratings yet
Eel 5245 Power Electronics I Lecture #2: Chapter 1 Introduction To Power Electronics
27 pages
Claves
No ratings yet
Claves
4 pages
DLL - English 4 - Q1 - W5
No ratings yet
DLL - English 4 - Q1 - W5
5 pages
18 Spring Mid
No ratings yet
18 Spring Mid
16 pages
Schools Division of Parañaque City Technology and Livelihood Education Electrical Installation & Maintenance 9 Quarter 4 Week 7 & 8 Wiring Diagrams
No ratings yet
Schools Division of Parañaque City Technology and Livelihood Education Electrical Installation & Maintenance 9 Quarter 4 Week 7 & 8 Wiring Diagrams
4 pages
Solenoides y Relays
No ratings yet
Solenoides y Relays
14 pages
Edi 104 - Chapter 3
No ratings yet
Edi 104 - Chapter 3
47 pages
Automotive Diagnosis Terminal (Dbscar Ii) : User Manual
No ratings yet
Automotive Diagnosis Terminal (Dbscar Ii) : User Manual
5 pages
Ni 2671
No ratings yet
Ni 2671
20 pages
Proportional Relief Valves, High Pressure: SS-4R3A
No ratings yet
Proportional Relief Valves, High Pressure: SS-4R3A
2 pages
MBTI Final
No ratings yet
MBTI Final
8 pages
Gold Care
No ratings yet
Gold Care
16 pages
List of MCA For CSC
No ratings yet
List of MCA For CSC
9 pages
Brosur Erne
No ratings yet
Brosur Erne
6 pages
Weekly Home Learning Plan g10 q4 w7
No ratings yet
Weekly Home Learning Plan g10 q4 w7
3 pages
A Study On Employees Satisfaction Towards Their Job in Seshsayee Paper and Boards Limited
No ratings yet
A Study On Employees Satisfaction Towards Their Job in Seshsayee Paper and Boards Limited
7 pages
Dsa 24 H Imp
No ratings yet
Dsa 24 H Imp
1 page
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet