Holistically-Nested Edge Detection
Holistically-Nested Edge Detection
Abstract
We develop a new edge detection algorithm that ad-
dresses two important issues in this long-standing vision
problem: (1) holistic image training and prediction; and (2)
(a) original image (b) ground truth (c) HED: output
multi-scale and multi-level feature learning. Our proposed
method, holistically-nested edge detection (HED), performs
image-to-image prediction by means of a deep learning
model that leverages fully convolutional neural networks
and deeply-supervised nets. HED automatically learns rich
(d) HED: side output 2 (e) HED: side output 3 (f) HED: side output 4
hierarchical representations (guided by deep supervision on
side responses) that are important in order to resolve the
challenging ambiguity in edge and object boundary detec-
tion. We significantly advance the state-of-the-art on the
BSD500 dataset (ODS F-score of .782) and the NYU Depth
dataset (ODS F-score of .746), and do so with an improved (g) Canny: = 2 (h) Canny: = 4 (i) Canny: = 8
speed (0.4s per image) that is orders of magnitude faster Figure 1. Illustration of the proposed HED algorithm. In the first row:
(a) shows an example test image in the BSD500 dataset [28]; (b) shows its
than some recent CNN-based edge detection algorithms. corresponding edges as annotated by human subjects; (c) displays the HED
results. In the second row: (d), (e), and (f), respectively, show side edge
responses from layers 2, 3, and 4 of our convolutional neural networks. In
1. Introduction the third row: (g), (h), and (i), respectively, show edge responses from the
Canny detector [4] at the scales σ = 2.0, σ = 4.0, and σ = 8.0. HED
In this paper, we address the problem of detecting edges shows a clear advantage in consistency over Canny.
and object boundaries in natural images. This problem is
both fundamental and of great importance to a variety of ing, one may categorize works into a few groups such as I:
computer vision areas ranging from traditional tasks such as early pioneering methods like the Sobel detector [20], zero-
visual saliency, segmentation, object detection/recognition, crossing [27, 37], and the widely adopted Canny detector
tracking and motion analysis, medical imaging, structure- [4]; methods driven by II: information theory on top of fea-
from-motion and 3D reconstruction, to modern applications tures arrived at through careful manual design, such as Sta-
like autonomous driving, mobile computing, and image-to- tistical Edges [22], Pb [28], and gPb [1]; and III: learning-
text analysis. It has been long understood that precisely lo- based methods that remain reliant on features of human
calizing edges in natural images involves visual perception design, such as BEL [5], Multi-scale [30], Sketch Tokens
of various “levels” [18, 27]. A relatively comprehensive [24], and Structured Edges [6]. In addition, there has been
data collection and cognitive study [28] shows that while a recent wave of development using Convolutional Neural
different subjects do have somewhat different preferences Networks that emphasize the importance of automatic hier-
regarding where to place the edges and boundaries, there archical feature learning, including N 4 -Fields [10], Deep-
was nonetheless impressive consistency between subjects, Contour [34], DeepEdge [2], and CSCNN [19]. Prior to
e.g. reaching F-score 0.80 in the consistency study [28]. this explosive development in deep learning, the Struc-
The history of computational edge detection is extremely tured Edges method (typically abbreviated SE) [6] emerged
rich; we now highlight a few representative works that have as one of the most celebrated systems for edge detection,
proven to be of great practical importance. Broadly speak- thanks to its state-of-the-art performance on the BSD500
Authorized licensed use limited to: Universidad de La Salle. Downloaded on November 16,2020 at 01:28:19 UTC from IEEE Xplore. Restrictions apply.
dataset [28] (with, e.g., F-score of .746) and its practically and/or learned features [28, 5], (2) multi-scale response fu-
significant speed of 2.5 frames per second. Recent CNN- sion [40, 32, 30], (3) engagement of different levels of vi-
based methods [10, 34, 2, 19] have demonstrated promis- sual perception [18, 27, 39, 17] such as mid-level Gestalt
ing F-score performance improvements over SE. However, law information [7], (4) incorporating structural informa-
there still remains large room for improvement in these tion (intrinsic correlation carried within the input data and
CNN-based methods, in both F-score performance and in output solution) [6] and context (both short- and long- range
speed — at present, time to make a prediction ranges from interactions) [38], (5) making holistic image predictions (re-
several seconds [10] to a few hours [2] (even when using ferring to approaches that perform prediction by taking the
modern GPUs). image contents globally and directly) [25], (6) exploiting
Here, we develop an end-to-end edge detection system, 3D geometry [15], and (7) addressing occlusion boundaries
holistically-nested edge detection (HED), that automati- [16].
cally learns the type of rich hierarchical features that are Structured Edges (SE) [6] primarily focuses on three of
crucial if we are to approach the human ability to resolve these aspects: using a large number of manually designed
ambiguity in natural image edge and object boundary de- features (property 1), fusing multi-scale responses (prop-
tection. We use the term “holistic”, because HED, despite erty 2), and incorporating structural information (property
not explicitly modeling structured output, aims to train and 4). A recent wave of work using CNN for patch-based
predict edges in an image-to-image fashion. With “nested”, edge prediction [10, 34, 2, 19] contains an alternative com-
we emphasize the inherited and progressively refined edge mon thread that focuses on three aspects: automatic feature
maps produced as side outputs — we intend to show that learning (property 1), multi-scale response fusion (prop-
the path along which each prediction is made is common erty 2), and possible engagement of different levels of vi-
to each of these edge maps, with successive edge maps be- sual perception (property 3). However, due to the lack of
ing more concise. This integrated learning of hierarchical deep supervision (that we include in our method), the multi-
features is in distinction to previous multi-scale approaches scale responses produced at the hidden layers in [2, 19]
[40, 41, 30] in which scale-space edge fields are neither au- are less semantically meaningful, since feedback must be
tomatically learned nor hierarchically connected. Figure 1 back-propagated through the intermediate layers. More im-
gives an illustration of an example image together with the portantly, their patch-to-pixel or patch-to-patch strategy re-
human subject ground truth annotation, as well as results sults in significantly downgraded training and prediction ef-
by the proposed HED edge detector (including the side re- ficiency. By “holistically-nested”, we intend to emphasize
sponses of the individual layers), and results by the Canny that we are producing an end-to-end edge detection sys-
edge detector [4] with different scale parameters. Not only tem, a strategy inspired by fully convolutional neural net-
are Canny edges at different scales not directly connected, works [26], but with additional deep supervision on top of
they also exhibit spatial shift and inconsistency. trimmed VGG nets [36] (shown in Figure 3). In the absence
The proposed holistically-nested edge detector (HED) of deep supervision and side outputs, a fully convolutional
tackles two critical issues: (1) holistic image training and network [26] (FCN) produces a less satisfactory result (e.g.
prediction, inspired by fully convolutional neural networks F-score .745 on BSD500) than HED, since edge detection
[26], for image-to-image classification (the system takes an demands highly accurate edge pixel localization. One thing
image as input, and directly produces the edge map image worth mentioning is that our image-to-image training and
as output); and (2) nested multi-scale feature learning, in- prediction strategy still has not explicitly engaged contex-
spired by deeply-supervised nets [23], that performs deep tual information, since constraints on the neighboring pixel
layer supervision to “guide” early classification results. We labels are not directly enforced in HED. In addition to the
find that the favorable characteristics of these underlying speed gain over patch-based CNN edge detection methods,
techniques manifest in HED being both accurate and com- the performance gain is largely due to three aspects: (1)
putationally efficient. FCN-like image-to-image training allows us to simultane-
ously train on a significantly larger amount of samples (see
2. Holistically-Nested Edge Detection Table 4); (2) deep supervision in our model guides the learn-
ing of more transparent features (see Table 2); (3) interpo-
In this section, we describe in detail the formulation of
lating the side outputs in the end-to-end learning encourages
our proposed edge detection system. We start by discussing
coherent contributions from each layer (see Table 3).
related neural-network-based approaches, particularly those
that emphasize multi-scale and multi-level feature learning. 2.1. Existing multi-scale and multi-level NN
The task of edge and object boundary detection is inherently Due to the nature of hierarchical learning in the deep
challenging. After decades of research, there have emerged convolutional neural networks, the concept of multi-scale
a number of properties that are key and that are likely to and multi-level learning might differ from situation to sit-
play a role in a successful system: (1) carefully designed uation. For example, multi-scale learning can be “inside”
1396
Authorized licensed use limited to: Universidad de La Salle. Downloaded on November 16,2020 at 01:28:19 UTC from IEEE Xplore. Restrictions apply.
2XWSXW/D\HU
+LGGHQ/D\HU
2XWSXW'DWD
the neural network, in the form of increasingly larger recep- (as “ensemble testing”). One notable example is the tied-
tive fields and downsampled (strided) layers. In this “in- weight pyramid networks [8]. This approach is also com-
side” case, the feature representations learned in each layer mon in non-deep-learning based methods [6]. Note that en-
are naturally multi-scale. On the other hand, multi-scale semble testing impairs the prediction efficiency of learning
learning can be “outside” of the neural network, for exam- systems, especially with deeper models[2, 10].
ple by “tweaking the scales” of input images. While these Training independent networks: As an extreme variant
two variants have some notable similarities, we have seen to Fig 2(a), one might pursue Fig 2(d), in which multi-scale
both of them applied to various tasks. predictions are made by training multiple independent net-
We continue by next formalizing the possible configu- works with different depths and different output loss lay-
rations of multi-scale deep learning into four categories, ers. This might be practically challenging to implement as
namely, multi-stream learning, skip-net learning, a single this duplication would multiply the amount of resources re-
model running on multiple inputs, and training of indepen- quired for training.
dent networks. An illustration is shown in Fig 2. Having Holistically-nested networks: We list these variants to
these possibilities in mind will help make clearer the ways help clarify the distinction between existing approaches and
in which our proposed holistically-nested network approach our proposed holistically-nested network approach, illus-
differs from previous efforts and will help to highlight the trated in Fig 2(e). There is often significant redundancy
important benefits in terms of representation and efficiency. in existing approaches, in terms of both representation
Multi-stream learning [3, 29] A typical multi-stream and computational complexity. Our proposed holistically-
learning architecture is illustrated in Fig 2(a). Note that the nested network is a relatively simple variant that is able to
multiple (parallel) network streams have different parame- produce predictions from multiple scales. The architecture
ter numbers and receptive field sizes, corresponding to mul- can be interpreted as a “holistically-nested” version of the
tiple scales. Input data are simultaneously fed into multi- “independent networks” approach in Fig 2(d), motivating
ple streams, after which the concatenated feature responses our choice of name. Our architecture comprises a single-
produced by the various streams are fed into a global output stream deep network with multiple side outputs. This archi-
layer to produce the final result. tecture resembles several previous works, particularly the
Skip-layer network learning: Examples of this form of deeply-supervised net[23] approach in which the authors
network include [26, 14, 2, 33, 10]. The key concept in show that hidden layer supervision can improve both op-
“skip-layer” network learning is shown in Fig 2(b). Instead timization and generalization for image classification tasks.
of training multiple parallel streams, the topology for the The multiple side outputs also give us the flexibility to add
skip-net architecture centers on a primary stream. Links are an additional fusion layer if a unified output is desired.
added to incorporate the feature responses from different
2.2. Formulation
levels of the primary network stream, and these responses
are then combined in a shared output layer. Training Phase We denote our input training data set by
A common point in the two settings above is that, in both S = {(Xn , Yn ), n = 1, . . . , N }, where sample Xn =
(n)
of the architectures, there is only one output loss function {xj , j = 1, . . . , |Xn |} denotes the raw input image and
with a single prediction produced. However, in edge detec- (n) (n)
Yn = {yj , j = 1, . . . , |Xn |}, yj ∈ {0, 1} denotes the
tion, it is often favorable (and indeed prevalent) to obtain corresponding ground truth binary edge map for image Xn .
multiple predictions to combine the edge maps together. We subsequently drop the subscript n for notational sim-
Single model on multiple inputs: To get multi-scale pre- plicity, since we consider each image holistically and inde-
dictions, one can also run a single network (or networks pendently. Our goal is to have a network that learns features
with tied weights) on multiple (scaled) input images, as il- from which it is possible to produce edge maps approaching
lustrated in Fig 2(c). This strategy can happen at both the the ground truth. For simplicity, we denote the collection of
training stage (as data augmentation) and at the testing stage all standard network layer parameters as W. Suppose in the
1397
Authorized licensed use limited to: Universidad de La Salle. Downloaded on November 16,2020 at 01:28:19 UTC from IEEE Xplore. Restrictions apply.
network we have M side-output layers. Each side-output Input image X
()
layer is also associated with a classifier, in which the cor- Y
responding weights are denoted as w = (w(1) , . . . , w(M ) ). (
)
We consider the objective function
M
()
(m)
Lside (W, w) = αm side (W, w(m) ), (1)
Side-output 1
m=1 ()
where side denotes the image-level loss function for side- Side-output 2
outputs. In our image-to-image training, the loss function is ()
Side-output 3
computed over all pixels in a training image X = (xj , j = Y
1, . . . , |X|) and edge map Y = (yj , j = 1, . . . , |X|), yj ∈ Receptive Field Size
Side-output 4
{0, 1}. For a typical natural image, the distribution of 5 14 40 92 196 Side-output
output 5
edge/non-edge pixels is heavily biased: 90% of the ground
Weighted-fusion layer Error Propagation Path
truth is non-edge. A cost-sensitive loss function is proposed Side-output layer Error Propagation Path
ground truth
1398
Authorized licensed use limited to: Universidad de La Salle. Downloaded on November 16,2020 at 01:28:19 UTC from IEEE Xplore. Restrictions apply.
stage, respectively conv1 2, conv2 2, conv3 3, conv4 3, w/ deep supervision w/o deep supervision w// deep
p supervision
p / deep
w/o p supervision
p
1399
Authorized licensed use limited to: Universidad de La Salle. Downloaded on November 16,2020 at 01:28:19 UTC from IEEE Xplore. Restrictions apply.
mental setup, the result on the benchmark dataset (row three behavior, even with the help of a pre-trained model. We
of Table 2) differs only marginally in F-score but displays observe that this mismatch leads to back-propagated gradi-
severely degenerated average precision; without direct con- ents that explode at the high-level side-output layers. We
trol and guidance across multiple scales, this network is therefore adjust how we make use of the ground truth labels
heavily biased towards learning large structure edges. in the BSDS dataset to combat this issue. Specifically, the
ground truth labels are provided by multiple annotators and
4. Experiments thus, implicitly, greater labeler consensus indicates stronger
ground truth edges. We adopt a relatively brute-force solu-
In this section we discuss our detailed implementation
tion: only assign a pixel a positive label if it is labeled as
and report the performance of our proposed algorithm.
positive by at least three annotators; regard all other labeled
4.1. Implementation pixels as negatives. This helps with the problem of gradi-
ent explosion in high level side-output layers. For low level
We implement our framework using the publicly avail- layers, this consensus approach brings additional robustness
able Caffe Library and build on top of the publicly available to edge classification and prevents the network from being
implementations of FCN[26] and DSN[23]. Thus, relatively distracted by weak edges. Although not fully explored in
little engineering hacking is required. In our HED system, our paper, a careful handling of consensus levels of ground
the whole network is fine-tuned from an initialization with truth edges might lead to further improvement.
the pre-trained VGG-16 Net model.
Data augmentation Data augmentation has proven to be a
Model parameters In contrast to fine-tuning CNN to per-
crucial technique in deep networks. We rotate the images
form image classification or semantic segmentation, adapt-
to 16 different angles and crop the largest rectangle in the
ing CNN to perform low-level edge detection requires spe-
rotated image; we also flip the image at each angle, lead-
cial care. Differences in data distribution, ground truth dis-
ing to an augmented training set that is a factor of 32 larger
tribution, and loss function all contribute to difficulties in
than the unaugmented set. During testing we operate on an
network convergence, even with the initialization of a pre-
input image at its original size. We also note that “ensem-
trained model. We first use a validation set and follow
ble testing” (making predictions on rotated/flipped images
the evaluation strategy used in [6] to tune the deep model
and averaging the predictions) yields no improvements in
hyper-parameters. The hyper-parameters (and the values we
F-score, nor in average precision.
choose) include: mini-batch size (10), learning rate (1e-6),
Different pooling functions Previous work [2] suggests
loss-weight αm for each side-output layer (1), momentum
that different pooling functions can have a major impact
(0.9), initialization of the nested filters (0), initialization of
on edge detection results. We conduct a controlled exper-
the fusion layer weights (1/5), weight decay (0.0002), num-
iment in which all pooling layers are replaced by average
ber of training iterations (10,000; divide learning rate by 10
pooling. We find that using average pooling decrease the
after 5,000). We focus on the convergence behavior of the
performance to ODS=.741.
network. We observe that whenever training converges, the
In-network bilinear interpolation Side-output prediction
deviations in F-score on the validation set tend to be very
upsampling is implemented with in-network deconvolu-
small. In order to investigate whether including additional
tional layers, similar to those in [26]. We fix all the decon-
nonlinearity helps, we also consider a setting in which we
volutional layers to perform linear interpolation. Although
add an additional layer (with 50 filters and a ReLU) be-
it was pointed out in [26] that one can learn arbitrary in-
fore each side-output layer; we find that this worsens per-
terpolation functions, we find that learned deconvolutions
formance. On another note, we observe that our nested
provide no noticeable improvements in our experiments.
multi-scale framework is insensitive to input image scales;
during our training process, we take advantage of this by Running time Training takes about 7 hours on a single
resizing all the images to 400 × 400 to reduce GPU mem- NVIDIA K40 GPU. For a 320 × 480 image, it takes HED
ory usage and to take advantage of efficient batch process- 400 ms to produce the final edge map (including the inter-
ing. In the experiments that follow, we fix the values of all face overhead), which is significantly faster than existing
hyper-parameters discussed above to explore the benefits of CNN-based methods [34, 2]. Some previous edge detec-
possible variants of HED. tors also try to improve performance by the less desirable
Consensus sampling In our approach, we duplicate the expedient of sacrificing efficiency (for example, by testing
ground truth at each side-output layer and resize the (down- on input images from multiple scales and averaging the re-
sampled) side output to its original scale. Thus, there ex- sults).
ists a mismatch in the high-level side-outputs: the edge
4.2. BSDS500 dataset
predictions are coarse and global, while the ground truth
still contains many weak edges that could even be consid- We evaluate HED on the Berkeley Segmentation Dataset
ered as noise. This issue leads to problematic convergence and Benchmark (BSDS 500) [1] which is composed of 200
1400
Authorized licensed use limited to: Universidad de La Salle. Downloaded on November 16,2020 at 01:28:19 UTC from IEEE Xplore. Restrictions apply.
for instance, combining predictions from multiple scales
yields better performance; moreover, all the side-output lay-
ers contribute to the performance gain, either in F-score or
averaged precision. To see this, in Table 3, the side-output
layer 1 and layer 5 (the lowest and highest layers) achieve
similar relatively low performance. One might expect these
two side-output layers to not be useful in the averaged re-
3UHFLVLRQ
>) @+XPDQ
>) @+('RXUV
>)
>)
@'HHS&RQWRXU
@&6&11
sults. However this turns out not to be the case — for exam-
>) @'HHS(GJH
>) @2() ple, the Average 1-4 achieves ODS=.760 and incorporating
>) @6(PXOWLíXFP
>)
>)
@6(
@6&*
the side-output layer 5, the averaged prediction achieves an
>)
>)
@6NHWFK7RNHQV
@J3EíRZWíXFP
ODS=.774. We find similar phenomenon when considering
>) @,6&5$
>) @*E other ranges. As mentioned above, the predictions obtained
>) @0HDQ6KLIW
>)
>)
@1RUPDOL]HG&XWV
@)HO]í+XWW
using different combination strategies are complementary,
>)
@&DQQ\
and a late merging of the averaged predictions with learned
5HFDOO
fusion-layer predictions leads to the best result. Another ob-
Figure 5. Results on the BSDS500 dataset. Our proposed HED frame-
work achieves the best result (ODS=.782). Compared to several recent servation is, when compared to previous ”non-deep” meth-
CNN-based edge detectors, our approach is also orders of magnitude faster. ods, performance of all ”deep” methods drops more in the
See Table 4 for a detailed discussion. high recall regime. This might indicate that deep learned
training, 100 validation, and 200 testing images. Each im- features are capable of (and favor) learning the global ob-
age has manually annotated ground truth contours. Edge de- ject boundary — thus many weak edges are omitted. HED
tection accuracy is evaluated using three standard measures: is better than other deep learning based methods in the high
fixed contour threshold (ODS), per-image best threshold recall regime because deep supervision helps us to take the
(OIS), and average precision (AP). We apply a standard low level predictions into account.
non-maximal suppression technique to our edge maps to ob-
tain thinned edges for evaluation. The results are shown in Table 4. Results on BSDS500. ∗BSDS300 results,†GPU time
Figure 5 and Table 4. ODS OIS AP FPS
Human .80 .80 - -
Table 3. Results of single and averaged side output in HED on
Canny .600 .640 .580 15
the BSDS 500 dataset. The individual side output contributes to
the fused/averaged result. Note that the learned weighted-fusion
Felz-Hutt [9] .610 .640 .560 10
(Fusion-output) achieves best F-score, while directly averaging all BEL [5] .660∗ - - 1/10
of the five layers (Average 1-5) produces better average precision. gPb-owt-ucm [1] .726 .757 .696 1/240
Merging those two readily available outputs further boost the per- Sketch Tokens [24] .727 .746 .780 1
formance. SCG [31] .739 .758 .773 1/280
ODS OIS AP SE-Var [6] .746 .767 .803 2.5
Side-output 1 .595 .620 .582 OEF [13] .749 .772 .817 -
Side-output 2 .697 .715 .673 DeepNets [21] .738 .759 .758 1/5†
Side-output 3 .738 .756 .717 N4-Fields [10] .753 .769 .784 1/6†
Side-output 4 .740 .759 .672 DeepEdge [2] .753 .772 .807 1/103 †
Side-output 5 .606 .611 .429
CSCNN [19] .756 .775 .798 -
Fusion-output .782 .802 .787
DeepContour [34] .756 .773 .797 1/30†
Average 1-4 .760 .784 .800
2.5†,
Average 1-5 .774 .797 .822 HED (ours) .782 .804 .833
1/12
Average 2-4 .766 .788 .798
Average 2-5 .777 .800 .814 Late merging to boost average precision We find that the
Merged result .782 .804 .833 weighted-fusion layer output gives best performance in F-
Side outputs To explicitly validate the side outputs, we score. However the average precision degrades compared
summarize the results produced by the individual side- to directly averaging all the side outputs. This might due to
outputs at different scales in Table 3, including different our focus on “global” object boundaries for the fusion-layer
combinations of the multi-scale edge maps. We empha- weight learning. Taking advantage of the readily available
size here that all the side-output predictions are obtained side outputs in HED, we merge the fusion layer output with
in one pass; this enables us to fully investigate different the side outputs (at no extra cost) in order to compensate for
configurations of combining the outputs at no extra cost. the loss in average precision. This simple heuristic gives us
There are several interesting observations from the results: the best performance across all measures that we report in
Figure 5 and Table 4.
1401
Authorized licensed use limited to: Universidad de La Salle. Downloaded on November 16,2020 at 01:28:19 UTC from IEEE Xplore. Restrictions apply.
More training data Deep models have significantly ad- Table 5. Results on the NYUD dataset [35] †GPU time
ODS OIS AP FPS
vanced results in a variety of computer vision applications,
at least in part due to the availability of large training data. gPb-ucm .632 .661 .562 1/360
In edge detection, however, we are limited by the number of Silberman [35] .658 .661 - <1/360
training images available in the existing benchmarks. Here gPb+NG[11] .687 .716 .629 1/375
we want to explore whether adding more training data will SE[6] .685 .699 .679 5
help further improve the results. To do this, we expand the SE+NG+[12] .710 .723 .738 1/15
training set by randomly sampling 100 images from the test HED-RGB .720 .734 .734 2.5†
set. We then evaluate the result on the remaining 100 test HED-HHA .682 .695 .702 2.5†
images. We report the averaged result over 5 such trials. HED-RGB-HHA .746 .761 .786 1†
We observe that by adding only 100 training images, per- results below. We directly average the RGB and HHA pre-
formance improves from ODS=.782 to ODS=.797 (±.003), dictions to produce the final result by leveraging RGB-D
nearly touching the human benchmark. This shows a poten- information. We also tried other approaches to incorporate
tially promising direction to further enhance HED by train- the depth information, for example, by training on the raw
ing it with a larger dataset. depth channel, or by concatenating the depth channel with
the RGB channels before the first convolutional layer. None
4.3. NYUDv2 Dataset
of these attempts yields notable improvement compared to
The NYU Depth (NYUD) dataset [35] has 1449 RGB-D the approach using HHA. The effectiveness of the HHA fea-
images. This dataset was used for edge detection in [31] tures shows that, although deep neural networks are capa-
and [11]. Here we use the setting described in [6] and eval- ble of automatic feature learning, for depth data, carefully
uate HED on data processed by [11]. The NYUD dataset is hand-designed features are still necessary, especially when
split into 381 training, 414 validation, and 654 testing im- only limited training data is available.
ages. All images are made to the same size and we train our Table 5 and Figure 6 show the precision-recall evalua-
network on full resolution images. As used in [12, 6], dur- tions of HED in comparison to other competing methods.
ing evaluation we increase the maximum tolerance allowed Our network structures for training are kept the same as for
for correct matches of edge predictions to ground truth from BSDS. During testing we use the Average2-4 prediction in-
.0075 to .011. stead of the Fusion-layer output as it yields the best perfor-
mance. We do not perform late merging since combining
two sources of edge map predictions (RGB and HHA) al-
ready gives good average precision. Note that the results
achieved using the RGB modality only are already better
than those of the previous approaches.
5. Conclusion
3UHFLVLRQ
In this paper, we have developed a new convolutional-
neural-network-based edge detection system that demon-
>) @+(' strates state-of-the-art performance on natural images at a
speed of practical relevance (e.g., 0.4 seconds using GPU
RXUV>) @6(
1*>) @6(
>) @J3E1*
>) @6LOEHUPDQ
>) @J3EíRZWíXFP and 12 seconds using CPU). Our algorithm builds on top of
5HFDOO
the ideas of fully convolutional neural networks and deeply-
Figure 6. Precision/recall curves on NYUD dataset. Holistically-nested supervised nets. We also initialize our network structure
edge detection (HED) trained with RGB and HHA features achieves the and parameters by adopting a pre-trained trimmed VG-
best result (ODS=.746). See Table 5 for additional information. GNet. Our method shows promising results in perform-
Depth information encoding Following the success in [12] ing image-to-image learning by combining multi-scale and
and [26], we leverage the depth information by utilizing multi-level visual responses, even though explicit contex-
HHA features in which the depth information is embed- tual and high-level information has not been enforced.
ded into three channels: horizontal disparity, height above Acknowledgment This work is supported by NSF IIS-
ground, and angle of the local surface normal with the in- 1216528 (IIS-1360566), NSF award IIS-0844566 (IIS-
ferred direction of gravity . We use the same HED architec- 1360568), and a Northrop Grumman Contextual Robotics
ture and hyper-parameter settings as were used for BSDS grant. We gratefully thank Patrick Gallagher for helping
500. We train two different models in parallel, one on RGB improve this manuscript. We are grateful for the generous
images and another on HHA feature images, and report the donation of the GPUs by NVIDIA.
1402
Authorized licensed use limited to: Universidad de La Salle. Downloaded on November 16,2020 at 01:28:19 UTC from IEEE Xplore. Restrictions apply.
References [23] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-
supervised nets. In AISTATS, 2015.
[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Con- [24] J. J. Lim, C. L. Zitnick, and P. Dollár. Sketch tokens: A
tour detection and hierarchical image segmentation. PAMI, learned mid-level representation for contour and object de-
33(5):898–916, 2011. tection. In CVPR, 2013.
[2] G. Bertasius, J. Shi, and L. Torresani. Deepedge: A multi- [25] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene pars-
scale bifurcated deep network for top-down contour detec- ing via label transfer. PAMI, 33(12):2368–2382, 2011.
tion. In CVPR, 2015. [26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
[3] P. Buyssens, A. Elmoataz, and O. Lézoray. Multiscale con- networks for semantic segmentation. In CVPR, 2015.
volutional neural networks for vision–based classification of
[27] D. Marr and E. Hildreth. Theory of edge detection. Pro-
cells. In ACCV. 2013.
ceedings of the Royal Society of London. Series B. Biological
[4] J. Canny. A computational approach to edge detection. Sciences, 207(1167):187–217, 1980.
PAMI, (6):679–698, 1986.
[28] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect
[5] P. Dollar, Z. Tu, and S. Belongie. Supervised learning of
natural image boundaries using local brightness, color, and
edges and object boundaries. In CVPR, 2006.
texture cues. PAMI, 26(5):530–549, 2004.
[6] P. Dollár and C. L. Zitnick. Fast edge detection using struc-
[29] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout. Multi-
tured forests. PAMI, 2015.
scale deep learning for gesture detection and localization. In
[7] J. H. Elder and R. M. Goldberg. Ecological statistics of
ECCV Workshops, 2014.
gestalt laws for the perceptual organization of contours.
[30] X. Ren. Multi-scale improves boundary detection in natural
Journal of Vision, 2(4):5, 2002.
images. In ECCV. 2008.
[8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning
hierarchical features for scene labeling. PAMI, 2013. [31] X. Ren and L. Bo. Discriminatively trained sparse code gra-
dients for contour detection. In NIPS, 2012.
[9] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-
based image segmentation. IJCV, 59(2):167–181, 2004. [32] D. L. Ruderman and W. Bialek. Statistics of natural images:
[10] Y. Ganin and V. Lempitsky. N4-fields: Neural network near- Scaling in the woods. Physical review letters, 73(6):814,
est neighbor fields for image transforms. arXiv preprint 1994.
arXiv:1406.6558, 2014. [33] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neu-
[11] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organiza- ral networks applied to house numbers digit classification. In
tion and recognition of indoor scenes from rgb-d images. In ICPR, 2012.
CVPR, 2013. [34] W. Shen, X. Wang, Y. Wang, X. Bai, and Z. Zhang. Deep-
[12] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning contour: A deep convolutional feature learned by positive-
rich features from rgb-d images for object detection and seg- sharing loss for contour detection draft version. In CVPR,
mentation. In ECCV, 2014. 2015.
[13] S. Hallman and C. C. Fowlkes. Oriented edge forests for [35] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor
boundary detection. arXiv preprint arXiv:1412.4181, 2014. segmentation and support inference from rgbd images. In
[14] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper- ECCV. 2012.
columns for object segmentation and fine-grained localiza- [36] K. Simonyan and A. Zisserman. Very deep convolutional
tion. In CVPR, 2015. networks for large-scale image recognition. In ICLR, 2015.
[15] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects in [37] V. Torre and T. A. Poggio. On edge detection. PAMI,
perspective. IJCV, 80(1):3–15, 2008. (2):147–163, 1986.
[16] D. Hoiem, A. N. Stein, A. A. Efros, and M. Hebert. Recov- [38] Z. Tu. Auto-context and its application to high-level vision
ering occlusion boundaries from a single image. In ICCV, tasks. In CVPR, 2008.
2007. [39] D. C. Van Essen and J. L. Gallant. Neural mechanisms of
[17] X. Hou, A. Yuille, and C. Koch. Boundary detection bench- form and motion processing in the primate visual system.
marking: Beyond f-measures. In CVPR, 2013. Neuron, 13(1):1–10, 1994.
[18] D. H. Hubel and T. N. Wiesel. Receptive fields, binocular [40] A. P. Witkin. Scale-space filtering: A new approach to multi-
interaction and functional architecture in the cat’s visual cor- scale description. In ICASSP, 1984.
tex. The Journal of physiology, 160(1):106–154, 1962. [41] A. L. Yuille and T. A. Poggio. Scaling theorems for zero
[19] J.-J. Hwang and T.-L. Liu. Pixel-wise deep learning for con- crossings. PAMI, (1):15–25, 1986.
tour detection. In ICLR, 2015.
[20] J. Kittler. On the accuracy of the sobel edge detector. Image
and Vision Computing, 1(1):37–42, 1983.
[21] J. J. Kivinen, C. K. Williams, N. Heess, and D. Technolo-
gies. Visual boundary prediction: A deep neural prediction
network and quality dissection. In AISTATS, 2014.
[22] S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu. Sta-
tistical edge detection: Learning and evaluating edge cues.
PAMI, 25(1):57–74, 2003.
1403
Authorized licensed use limited to: Universidad de La Salle. Downloaded on November 16,2020 at 01:28:19 UTC from IEEE Xplore. Restrictions apply.