Predicting Images Using Convolutional Networks - Visual Scene Understanding With Pixel Maps
Predicting Images Using Convolutional Networks - Visual Scene Understanding With Pixel Maps
by
David Eigen
Doctor of Philosophy
May, 2015
——————————–
Rob Fergus
c David Eigen
�
I would like to thank my advisor Rob Fergus, whose optimistic enthusiasm and sense of
inquiry helped guide and push me to completing the different works in this thesis, and a
couple others as well. He was always very available and I had many valuable discussions
with him, sometimes at odd hours. I learned a lot about conducting research through
I would also like to thank all of my collaborators and coauthors, and the many students
and postdocs in the lab, especially: Ross Goroshin, Dilip Krishnan, Pierre Sermanet,
Christian Puhrsch, Li Wan, Nathan Silberman, Jason Rolfe, Matthew Zeiler, Jonathan
Tompson and Arthur Szlam. I learned much from our work together and numerous
inspiring conversations.
Thanks also to Yann LeCun and David Sontag for cultivating the lab on the 12th floor,
and the stimulating discussions and group meetings. I also thank Marc’Aurelio Ranzato
for hosting me at Google during my summer internship, and Ronan Collobert for being
on my dissertation committee.
Finally and most essentially, I am grateful for all the support from my parents, family
iii
Abstract
In the greater part of this thesis, we develop a set of convolutional networks that infer
predictions at each pixel of an input image. This is a common problem that arises in
many computer vision applications: For example, predicting a semantic label at each
pixel describes not only the image content, but also fine-grained locations and segmenta-
tions; at the same time, finding depth or surface normals provide 3D geometric relations
between points. The second part of this thesis investigates convolutional models also in
that can be applied to diverse vision problems using simple adaptations, and apply it
to predict depth at each pixel, surface normals and semantic labels. Our model uses
a series of convolutional network stacks applied at progressively finer scales. The first
uses the entire image field of view to predict a spatially coarse set of feature maps based
on global relations; subsequent scales correct and refine the output, yielding a high
resolution prediction. We look exclusively at depth prediction first, then generalize our
method to multiple tasks. Our system achieves state-of-the-art results on all tasks we
investigate, and can match many image details without the need for superpixelation.
Leading to our multi-scale network, we also design a purely local convolutional network
to remove dirt and raindrops present on a window surface, which learns to identify
labeling system applied to superpixels, in which we learn weights for each example, and
convolutional network, finding that network depth is most critical. We also develop
coding at a fraction of the cost, combine it with a local entropy objective, and describe
iv
Contents
Acknowledgements iii
Abstract iv
List of Figures ix
1 Introduction 1
2 Background 5
2.2 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
v
4 Restoring An Image Taken Through a Window Covered with Dirt
or Rain 32
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Dirt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.1 Dirt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.2 Rain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Network 51
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
vi
5.4.2 KITTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5.2 KITTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.1 Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.6.1 Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
vii
7 Understanding Deep Architectures using a Recursive Convolutional
Network 92
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9 Conclusion 121
Bibliography 124
viii
List of Figures
ix
4.9 Smartphone Application Example . . . . . . . . . . . . . . . . . . . . . . . 49
6.6 Example semantic labeling results for Pascal VOC 2011. For each image,
x
8.8 NORB reconstructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
xi
List of Tables
xii
Chapter 1
Introduction
Computer vision systems may infer numerous types of estimates in order to relate an
input image to the scene it captures, to the world in which the scene is a part, and
to our own understanding of the world. For example, object recognition systems infer
objects present in a scene by predicting class labels (e.g. “bed”, “picture”), often along
with the objects’ locations, e.g. with a bitmask or bounding box. Such a detection
system is depicted in Fig. 1.1. These systems, however, scratch only the surface of
possible representations: beyond labels and bounding boxes, many other useful types of
understandings can be inferred as well. For example, geometric estimates such as world-
space locations, depth maps or surface normals; per-pixel object labels that provide
more detailed location information; object attributes that provide more fine-grained
even decompositions of the image into those portions depicting the underlying scene and
While each of these problems might be tackled in different ways and have numerous
choices for output representation, this thesis focuses on a prevalent theme of inferring
2D pixel maps from a single input image. Pixel map prediction arises naturally for many
problems, several of which are depicted in Fig. 1.2. Finding semantic class labels for each
1
Figure 1.1: Object detection using bounding boxes. A ConvNet makes predictions at
multiple locations and scales, which are then merged. Figure reproduced from [113], to
which the author contributed.
Figure 1.2: Inferring pixel maps for depth, surface normals and semantic labels using a
convolutional network. Predictions made by the system described in Chapter 6.
pixel provides information both on which objects are in the scene (“what” is present)
and their image location and extent (“where” they are). Estimating the depth from
camera at each pixel provides a more 3D geometric understanding of the scene, as does
per-pixel estimation of surface normals; these may be useful for physical applications,
including robotics and 3D modeling. Likewise, denoising an image extracts relevant low-
level structures away from corrupted input, producing a clean version of the image as
The following chapters explore systems for each of these tasks in turn, and focus in
particular on convolutional networks that are able to learn their own internal feature
from the data directly. These systems have recently achieved large performance gains,
most notably for current object classification and detection tasks, and this thesis further
extends their application to produce 2D outputs that align each pixel in the input with
2
We start out by investigating a system to perform pixelwise semantic labeling in Chap-
ter 3, using superpixels and hand-crafted features. By learning weights for each feature
cally tuning the relative importance for each feature and database point (although not
yet the features themselves). This anticipates our later work in multilayer networks by
for mid-level context information that we use to augment the number of rare class in-
stances available at classification time. We also evaluate a rough use of global context
by varying the number of best-matching database images used for k-NN queries for each
image; the latter has a loose relationship to our later use of global context in multi-scale
convolutional networks.
ConvNet to remove dirt or raindrops present on a window surface in front of the camera,
thus restoring the underlying scene. This network is trained end-to-end to produce clean
output images, and is able to estimate low-level natural image structures using purely
local fields of view. Here we also find a benefit to training convolutionally, i.e. through
the final combination of strided local predictions using an averaging step, and analyze
Chapter 5 then combines ConvNets at both global and local levels to infer depth maps.
In contrast to the denoising application, we find the global view is essential for depth
prediction, and show how we integrate this view along with the original image at a local
scale to produce detailed pixel-map outputs. The resulting system is simple, using a
sequence of two networks that first produces a coarse estimate of the depth, then refines
In Chapter 6 we extend the Multi-Scale ConvNet of the previous chapter to infer not
only depth, but also surface normals and semantic labels at each point, and implement
a third refinement scale to obtain higher resolution outputs. These provide both richer
3
performance for these tasks as well. We compare our semantic labeling system to the
one we looked at originally in Chapter 3, and find that our newer method achieves
greatly better accuracy, without the need for either hand-crafted feature descriptors or
superpixel preprocessing.
Following this, in Chapter 7 we look at some of the factors involved in tuning convolu-
a network whose weights are tied between layers, we find that adding layers alone can
measure the relative effects of the number of feature maps and number of parameters;
our results suggest that layers and parameters are the most important factors.
We also look at three ideas using related convolutional models for unsupervised feature
to learn similar features at a small fraction of the cost. We then explore an entropy
objective that encourages feature map units to factorize into a few prototype templates
with high activation, plus many deformation units concerned that edit reconstruction
details. We lastly describe a convolutional ZCA whitening method that can be applied
4
Chapter 2
Background
Convolutional networks have been used for many computer vision applications, starting
from their roots in digit classification [33, 77], to more recent systems for image classi-
fication and object detection [73, 113, 116, 125], as well as many others such as image
denoising [65], pose estimation [131, 94] and stereo depth [145, 87]. They have also been
applied with great success to speech recognition [95, 57] and natural language process-
ing [14, 15]. The success of these systems stems from their ability to learn both image
features, as well as the tiers of rules needed to combine them, all from the data itself. As
a consequence, they can leverage large amounts of data to maximize their effectiveness,
while still maintaining a compact size that can be deployed to future test cases.
A basic convolutional network, such as the one in Fig. 2.1, consists of applying multi-
ple layers of learned convolution kernels together with elementwise nonlinearities, often
intermixed with spatial pooling (subsampling). Each convolutional hidden layer is fed
5
C3: f. maps 16@10x10
C1: feature maps S4: f. maps 16@5x5
INPUT 6@28x28
32x32 S2: f. maps C5: layer F6: layer OUTPUT
6@14x14 120 10
84
Figure 2.1: A convolutional network for classification, from LeCun et al. [77].
h0 = x
where x is the input image, hl is the hidden layer activations at layer l, Wl and bl are
the learned convolution kernel and feature bias, f is an activation function applied to
each unit (e.g. rectification/thresholding; sigmoid), and pooll is the (optional) pooling at
layer l. Common pooling schemes include max-, l2 or average-pooling [7, 67], or strided
can serve to enforce local invariance to small feature movements, and/or increase the
For a classification model, one or more fully-connected layers are placed on top, followed
the classes:
ez
y = softmax(WC hL + bC ), where softmax(z) = � z
ie
i
These models are commonly trained using backpropagation and stochastic gradient de-
scent [77]. After defining the error to be optimized using a loss function, e.g. cross-
entropy classification loss, the gradient of the error with respect to all model parameters
6
In the OverFeat detection system [113] (joint work between Pierre Sermanet, myself, Xi-
ang Zhang, Michael Mathieu, Rob Fergus and Yann LeCun), we apply the fully-connected
the input image, but it applies all networks at once from the bottom up, thus sharing
computation for the overlap regions. We simultaneously learn an object bounding box
regressor, also applied at each input window, and then merge all the predictions together.
In so doing, we use the multiple class/box predictions to vote on the final result, greatly
we can both better align different sized objects to the ConvNet windows and boost the
number of prediction samples, improving accuracy even more. This system is depicted
in Fig. 1.1.
In any network, each convolution layer produces a set of output values at each spatial
location (a “feature map”), using a weighted combination of the inputs from a local area.
7
advantage of local correlation and long-range decorrelation present in images, particularly
at lower levels. If a fully connected network were trained for the same task, with infinite
diverse data, the vast majority of connections would end up being local, and with the
same weights replicated among most spatial locations. Thus convolutions provide a
double-win: They regularize by enforcing zeros for long-range connections that might
not otherwise be learned from a smaller finite dataset, while at the same time greatly
Many kinds of regularization can also be used to improve generalization and make better
use of limited training data. One of the most effective is data augmentation: In the best
case, adding random perturbations of the provided data adds a large diverse set of new
samples; it can also encourage the model to be invariant to transformations not easily
encoded in the architecture, and derails its ability to memorize exact samples. Other
common techniques include several that average over injected training-time stochasticity,
such as Dropout [122] or DropConnect [139], l2 weight decay, and ensemble voting.
Features learned by convolutional networks at lower layers have analogies in both bio-
particular, the first convolutional layer in most networks tends to learn oriented gabors
and color contrasts, which are then pooled. These features are then comprised chiefly of
aggregated quantized edge orientations. Such representations can explain many behav-
In addition, computer vision features like HoG [29] or SIFT [84] perform many of the same
edge-summary operations: Each also identifies prevalent edge orientations, and combines
them over spatial areas using histograms. Systems such as spatial pyramid matching
[76] or deformable parts models [29] combine these further over larger areas; DPMs in
particular are in essence ConvNets with few layers and large offsets [138]. However,
their design. Moreover, while low-level features can be possible to intuit, it is often
far from clear how to combine them. The need for choices that can limit a model’s
8
effectiveness only worsens at higher layers, since these are where the system must make
define both the low-level features themselves, as well as their combination. This thesis
that require the prediction of pixel maps, i.e. 2D arrays containing inferred values at
2.2 Autoencoders
The use of neural networks to produce pixel maps has a close relation to autoencoders,
which also output an image using a neural network. However, instead of predicting a
pixel map of a different mode, the aim of an autoencoder is to reconstruct the original
input under constraints. These have been used in particular for learning initial hidden
layer representations in pretraining supervised networks [59, 37]. They also have close
^x
^x ^x
h3
^x W2
h3
h W2 h2
h W1 h2
h1
x
W1
h1
x x
(a) x (b)
At its most basic, an autoencoder reconstructs the input from a single hidden layer;
multiple hidden layers may be stacked as well to form a deep autoencoder (see Fig. 2.2).
9
In the case of a single layer, the output is formed using a linear combination of hidden
h1 = f (W1 x + b1 ), x̂ = W2 h1 + b2
Here, x and x̂ are the input and its reconstruction, h1 is the hidden layer activation units,
W1 and W2 are weight matrices, b1 and b2 the biases, and f is the activation function
The weights W1 and W2 are optimized according to an objective function that defines the
desired relationship between the input x and its reconstruction x̂. For an autoencoder,
N
1�
L= ||xi − x̂i ||2
2
i=1
Note if f is the identity, x̂ is a linear projection of x, and the optimal solution for W2 W1
with respect to l2 error is a PCA projection to the first k principal components of the
data, where k is the size of h1 [3]. That is, W2 W1 = V V T where V are the first k PCA
directions of the training data, and the biases b1 and b2 are used to subtract and add
back the mean of the training data. If we also enforce W1 = W2T = W , then W = V T
up to isometry.
With a nonlinear choice of f , the solution will start to move away from PCA. Additional
constraints or regularizations make the learned behavior more different yet. Common
variations include dimension bottlenecks [59], sparsity-inducing constraints [100, 103, 37]
[135, 136], which introduces random noise to the input and forces the network to learn
to reconstruct the full original image from the incomplete data. To do so, it must learn
10
correlations between parts of the input so that the missing regions can be restored.
representations, linking autoencoders with sparse coding. Sparse coding aims to find a
hidden code that reconstructs the input, but with few active units; a prevalent form [93]
where z is the code vector (hidden layer), and W T is a decoding dictionary. Iterative
algorithms are often used to infer a code z given the input x [5], however sparse coding
of the code, e.g. PSD and LISTA [69, 45, 105]. Deconvolutional networks [147, 148]
matrix connecting visible units with hidden units. A simple instance is where visible
and hidden units are Bernoulli random variables (i.e. can take on the discrete states of
is to use Gaussian units for the visibles, in which case the energy is
1
EgbRBM (x, h) = hT W x − bTh h + xT x − bTx x
2
1 −EgbRBM (x,h)
PgbRBM (x, h) = e
Z
Here, x is a vector of visible units (an image), h a vector of hidden units, bx and bh
are the visible and hidden biases, and W is the weight matrix connecting visibles and
hiddens.
The probability of an image x is obtained by marginalizing over h; this has the following
11
associated energy (called “free energy”):
�
EgbRBM (x) = − log exp(−EgbRBM (x, h))
h∈{0,1}k
� k
1 T
= x x − bTx x − log(1 + exp(Wi · x + bh,i ))
2
i=1
Taking the derivative with respect to x and setting it to 0, we find that the energy has
k
� exp(Wi · x + bh,i ))
x = WiT + bx
1 + exp(Wi · x + bh,i ))
i=1
T
= W σ(W x + bh ) + bx
The right hand side is a feed-forward autoencoder with sigmoid hidden activation σ and
weights identical to the RBM; a fixed point of this autoencoder is a critical point of the
RBM energy. In addition, the derivative of the RBM energy is equal to the difference
between the input x and its autoencoder reconstruction (i.e. the reconstruction error).
Highly related in a more general setting, Alain et al. [1] show that any autoencoder
trained with a contractive or denoising method models the derivative of the log data
density, in the sense that the difference between the input and its reconstruction ap-
proaches the log density derivative. Although it is tempting to think of the autoencoder
fixed points as energy minima, [1] note that some of these must be maxima or saddles,
Beyond applications to feature learning and density estimation, neural networks with
similar architectures are starting to be used for image prediction tasks, such as image
denoising [9, 10, 65, 143, 150], object detection [126] and semantic segmentation [16, 98].
This thesis explores this line of research further, applying image-generating convolutional
12
Chapter 3
The work presented in this chapter appeared in CVPR 2012 [22], and was a collaboration
3.1 Introduction
an effective brute-force mechanism for using a database of labeled example regions, and
serves as an essential point of comparison for our later convolutional network system in
Chapter 6.
While simple, such kNN systems make only limited use of the data available. Features
must be hand-tuned, and feature sets and data both must be carefully calibrated so
that the different sources contribute relatively similar amounts and no single source
dominates. Furthermore, in the semantic labeling task there may be a potentially large
number of different label classes, stemming from the high diversity of the visual world,
13
and the distribution of classes is often highly uneven (see Fig. 3.8). kNN classifiers
present a trade-off here: They naturally handle large label sets, since one need only
consider the labels of those points retrieved during test queries; on the other hand, rare
class examples are hard to find and underrepresented in query results. Consequently,
many classes will have a small number of example instances even using a large dataset,
Starting from the kNN “superparsing” method of Tighe and Lazebnik [128] as a baseline,
1. In an off-line training phase, we learn a set of weights for each descriptor type of
every segment in the training set. The weights are trained to minimize classification
has the effect of introducing a distance metric that varies throughout the descriptor
above. It also allows us to discard outlier descriptors that would otherwise hurt
based on context from the query image. We first remove segments based on a
global context match. Crucially, we then add back previously discarded segments
from rare classes. Here we use the local context of segments to look up rare class
examples from the training set. This boosts the representation of rare classes within
the kNN sets, giving a more even class distribution that improves classification
accuracy.
The overall theme of these methods is the customization of the dataset for a particular
In addition to Tighe and Lazebnik [128, 129], other related non-parametric approaches
to recognition include: the SIFT-Flow scene parsing method of Liu et al. [80, 81]; scene
classification using Tiny Images by Torralba et al. [132] and the Naive-Bayes NN ap-
14
proach from Boiman et al. [6]. However, none of these involve re-weighting of the data,
Analysis [38]. NCA also learns a distance metric for kNN classification using leave-
matrix applied to all feature descriptors. By contrast, we find neighbors using unmodified
descriptors, then tune the weights of each to influence the class predictions, effectively
learning a metric that varies according to the local region. It is possible that the two
approaches may be combined, however we did not explore that in this work.
Our re-weighting approach has interesting similarities to Frome et al. [32] (and related
work from Malisiewicz & Efros [85, 86]). Motivated by the inadequacies of a single global
distance metric, they use a different metric for each exemplar in their training set, which
is integrated into an SVM framework. The main drawback to this is that the evaluation
of a query is slow (∼minutes/image). The weights learned by our scheme are equivalent
to a local modulation of the distance metric, with a large weight moving the point closer
The re-weighting scheme we propose also has connections to a traditional machine learn-
ing approach called editing [20, 71]. In edting, individual points in the dataset may be
modified, however these are usually binary in that they either keep or completely remove
each training point. Of this family, the most similar to ours is Paredes and Vidal [96],
who also use real-valued weights on the points. However, their approach does not han-
dle multiple descriptor types and is demonstrated on a range of small text classification
datasets.
There is extensive work on using context to help recognition [61, 92, 133, 134]; the
most relevant approaches being those of Gould et al. [41, 42] and in particular Heitz &
Koller [55] who use “stuff” to help find “things.” Heitz et al. [54] use similar ideas in a
15
sophisticated graphical model that reasons about objects, regions and geometry. These
works have similar goals regarding the use of context but quite different methods. Our
approach is simpler, relying on NN lookups and standard gradient descent for learning
the weights.
Our work also has similar goals to multiple kernel learning approaches (e.g. [35]) which
combine weighted feature kernels, but the underlying mechanisms are quite different: we
do not use SVMs, and our weights are per-descriptor. By contrast, the weights used in
these methods are constant across all descriptors of a given type. Finally, Boosting [109]
3.2 Approach
Our approach builds on the nearest-neighbor voting framework of Tighe and Lazebnik
[128] and uses three distinct stages to classify image segments: (i) global context selec-
tion; (ii) learning descriptor weights; (iii) adding local context segments. Stages (i) and
(ii) are used in off-line training, while (i) and (iii) are used during evaluation. While
stage (i) is adopted from [128], the other two stages are novel and the main focus of our
paper.
classify into one of C classes. The training dataset T consists of super-pixel segments
s, taken from images I1 to IM . The true class c∗s for each segment in T is known.
used in [128]. These include quantized SIFT, color, position, shape and area features.
Additionally, each image Im has a set global context descriptors, {gm } that capture the
content of the entire image; these are computed in advance and stored in kd-trees for
efficient retrieval.
16
3.2.1 Global Context Selection
In this stage, we use overall scene appearance to remove descriptors from scenes bearing
little resemblance to the query. For example, the segments taken from a street scene are
likely to be distractors when trying to parse a mountain scene. Thus their removal is
expected to improve performance. A secondary benefit is that the subsequent two stages
need only consider a small subset of the training dataset T , which gives a considerable
For each query Q we compute global context descriptors {gq }, which consists of 4 types:
(i) a spatial pyramid of vector quantized SIFT [76]; (ii) a color histogram spatial pyramid
and (iii) Gist computed with two different parameter settings [92]. For each of the types,
we find the nearest neighbors amongst the training set {gm }. The ranks across the four
types of context descriptor are averaged to give an overall ranking. We then form a
to the top v images from our image-level ranking. We denote the global match set
Section 5.4.
To learn the weights, we adopt a leave-one-out strategy, using each segment s (from
image Im ) in the training dataset T as probe segment (a pretend query). The weights of
the neighbors of s are then adjusted to increase the probability of correctly predicting
the class of s.
For a query segment s, we first compute the global match set Gs = GlobalMatches(Im , v).
Let the set of descriptors of s be Ds . Following [128], the predicted class ĉ for each seg-
ment is the one that maximizes the ratio of posterior probabilities P (c|Ds )/P (c̄|Ds ).
After the application of Bayes rule using a uniform class prior1 and making a naive-
1
Using the true, highly-skewed, class distribution P (c)/P (c̄) dramatically impairs performance for
17
(a) (b)
(c) (d)
Figure 3.1: Toy example of our re-weighting scheme. (a): Initially all descriptors have
uniform weight. (b), (c) & (d): a probe point is chosen (cross) and points in the neighbor-
hood (black circle) of the same class as the probe have their weights increased. Points
of a different class have their weights decreased, so rejecting outlier points. In prac-
tice, (i) there are multiple descriptor spaces, one for each descriptor type and (ii) the
GlobalMatch operation removes some of the descriptors.
Bayes assumption for combining descriptor types, this is equivalent to maximizing the
� P (d|c)
ĉ = arg max L(s, c) = arg max (3.1)
c c P (d|c̄)
d∈Ds
The probabilities P (d|c) and P (d|c̄) are computed using nearest-neighbor lookups in the
space of the descriptor type of d, over all segments in the global match set G. In the
nN
d (c) n̄N
d (c)
P (d|c) ∝ pd (c) = , P (d|c̄) ∝ p̄d (c) =
nd (c) n̄d (c)
where nN
d (c) is the number of points of class c in the nearest neighbor set N of d,
determined by taking the closest k neighbors of d. 2 nd (c) is the total number of points
in class c. n̄N
d (c) is the number of points not of class c in the nearest neighbor set N of d
rare classes.
2
We also include all points at zero distance from d, so nN
d (c) is occasionally larger than k.
18
� N �
(i.e. c� �=c nd (c )), and similarly for n̄d . Conceptually, both nN
d (c) and nd (c) should be
computed over the match set G; in practice, this sample may be small enough that using
G just for nN
d (c) and estimating nd (c) over the entire training database T can reduce
noise.
To eliminate zeros in P (d|c̄), we smooth the above probabilities using a smoothing factor
t:
qd (c) = (nN N 2
d (c) + n̄d (c)) · pd (c) + t (3.2)
qd (c)
Ld (c) =
q̄d (c)
We now introduce weights wdi for each descriptor d of each segment i. This changes the
definitions of nd and nN
d :
�
nd (c) = wdi δ(c∗i , c) = W T ∆
i∈T
�
nN
d (c) = wdi δ(c∗i , c) = W T ∆N
i∈N
where c∗i is the true class of point i and T is the training set. Note that when using
only the match set G to estimate nd (c), the sum over T need only be performed over G.
In matrix form, W is the vector of weights wdi , and ∆ is the |T | × |C| class indicator
matrix whose ci-th entry is δ(ci , c). For neighbor counts, ∆N is the restriction of ∆ to
�
n̄d (c) = ¯
wdi δ(c∗i , c̄) = W T ∆
i∈T
19
�
n̄N ¯N
wdi δ(c∗i , c̄) = W T ∆
d (c) =
i∈N
� � �
J(W ) = Js (W ) = − log L(s, c∗ ) + log L(s, c)
s∈T s∈T c∈C
� � � �
= − log Ld (c∗ ) + log Ld (c)
s∈T d∈Ds c∈C d∈Ds
The derivatives with respect to W are back-propagated through the nearest neighbor
probability calculations using 5 chain rule steps. The vector of weights Wd (the weights
Step 1:
∂nd ∂nN ∂ n̄d ¯ ∂ n̄N ¯N
= ∆, d
= ∆N , = ∆, d
=∆
∂Wd ∂Wd ∂Wd ∂Wd
Step 2:
∂pd ∂ p̄d ¯
¯ N − p̄d · ∆)/n̄
= (∆N − pd · ∆)/nd , = (∆ d
∂Wd ∂Wd
Step 3:
∂qd ∂pd
= 2(nN N N N N 2
d + n̄d ) · p · 1 + (nd + n̄d ) ·
∂Wd ∂Wd
∂ q̄d ∂ p̄d
= 2(nN N N N N 2
d + n̄d ) · p̄ · 1 + (nd + n̄d ) ·
∂Wd ∂Wd
Step 4:
∂ log Ld 1 ∂qd 1 ∂ q̄d
= −
∂Wd qd ∂Wd q̄d ∂Wd
Step 5:
∂Js ∂ log Ld ∗ 1 � ∂ log Ld (c)
=− (c ) + � L(c) ·
∂Wd ∂Wd c L(c) c ∂Wd
20
weight matrix is updated using gradient descent:
∂Js
W ←W −η
∂W
where η is the learning rate parameter. In addition, we enforce positivity and upper
bound constraints on each weight, so that 0 ≤ wdi ≤ 1 for all d, i. We initialize the
deploy on large datasets: although the the time to compute a single gradient step is
O(|T ||C|), we found that fixing nd and n̄d to their values with the initial weights yields
good performance, and limits the time for each step to O(|G||C|).
Aside from smoothing the NN probabilities, the smoothing parameter t also modulates
Ld (c) as a function of nd (c), the number of descriptors of each class. As such, it gives a
natural way to bias the algorithm toward common classes or toward rare ones.
This lets us rearrange Ld (c) to obtain (omitting d for brevity and defining u = t/k 2 ):
nN (c)n̄(c) + u · n(c)n̄(c)
L(c) =
n̄N (c)n(c) + u · n(c)n̄(c)
Note that n(c)n̄(c) depends only on the frequency of class c in the dataset, not on the
NN lookup. The influence of t therefore becomes larger for progressively more com-
mon classes. So by increasing t we bias the algorithm toward rare classes, an effect we
21
3.2.3 Adding Segments
The global context selection procedure discards a large fraction of segments from the
training set T , leaving a significantly smaller match set G. This restriction means that
rare classes may have very few examples in G — and sometimes none at all. Conse-
quently, (i) the sample resolution of rare classes is too small to accurately represent their
density, and (ii) for NN classifiers that use only a single lookup among points of all
classes (as ours does), common points may fill a search window before any rare ones are
reached. We seek to remedy this by explicitly adding more segments of rare classes back
into G.
To decide which points to add, we index rare classes using a descriptor based on semantic
context. Since the classifier is already fairly accurate at common background classes, we
can use its existing output to find probable background labels around a given segment.
The context descriptor of a segment is the normalized histogram of class labels in the
50 pixel dilated region around it (excluding the segment region itself). See Fig. 3.2(a) &
training set, and index each super-pixel whose class occurs below a threshold of r times
in its image’s match set G. In this way, the definition of a rare class adapts naturally
When classifying a test image, we first classify the image without any extra segments.
These labels are used to generate the context descriptors as described above. For each
super-pixel, we look up the nearest r points in the rare segments index, and add these to
the set of points G used to classify that super-pixel. See Algorithm 2 for more details.
22
(a) (b)
Class Context Descriptor
Classes
ContextIndex
(c)
Additional
Segments: ……... ……...
Figure 3.2: Context-based addition of segments to the global match set G. (a): Segment
in the query image, surrounded by an initial label map. (b): Histogram of class labels,
built by dilating the segment over the label map, which captures the semantic context of
the region. This is matched with histograms built in the same manner from the training
set T . (c): Segments in T with a similar surrounding class distribution are added to G.
The overall training procedure is summarized in Algorithm 1. We first learn the weights
for each segment/descriptor, before building the context index that will be used to add
segments at test time. Note that we do not rely on ground truth labels for constructing
this index, since not all segments in T are necessarily labeled. Instead, we use the
predictions from our weighted NN classifier. NN algorithms work better with more data,
so to boost performance we make a horizontally flipped copy of each training image and
The first uses the weighted NN scheme to give an initial label set for the query image.
more segments from rare classes. We then run a second weighted classification using this
23
Algorithm 1 Training Procedure
1: procedure LearnWeights(T )
2: Parameters: v, k
3: Wdi = 0.5
4: for all segments s ∈ T do
5: G =GlobalMatches(Im , v)
6: ¯N
NN-lookup to obtain ∆N , ∆
∂Js
7: Compute ∂W d
∂Js
8: Wd ← Wd − η ∂W d
9: end for
10: end procedure
24
3.4 Experiments
We evaluate our approach on two datasets: (i) Stanford background [41] (572/143 train-
ing/test images, 8 classes) and (ii) the larger SIFT-Flow [80] dataset (2488/200 train-
In evaluating sense parsing algorithms there are two metrics that are commonly used:
per-pixel classification rate and per-class classification rate. If the class distribution were
uniform then the two would be the same, but this is not the case for real-world scenes.
A problem with optimizing pixel error alone is that rare classes are ignored since they
occupy only a few percent of image pixels. Consequently, the mean class error is a
more useful metric for applications that require performance on all classes, not just the
common ones. Our algorithm is able to smoothly trade off between the two performance
measures by varying the smoothing parameter t at evaluation time. Using a 2D plot for
the pair of metrics, the curve produced by varying t gives the full performance picture
Our baseline is the system described in Section 2, but with no image flips, no learned
weights (i.e. they are uniform) and no added segments. It is essentially the same as
the Tighe and Lazebnik [128], but with a slightly different smoothing of the NN counts.
Our method relies on the same set of 19 super-pixel descriptors used by [128]. As other
authors do, we compare the performance without an additional CRF layer so that any
differences in local classification performance can be seen clearly. Our algorithm uses
the following parameters for all experiments (unless otherwise stated): v = 200, k = 10,
r = 200.
Fig. 3.3 shows the performance curve of our algorithm on the Stanford Background
dataset, along with the baseline system. Also shown is the result from Gould et al. [41],
but since they do not measure per-class performance, we show an estimated range on
25
the x-axis. While we convincingly beat the baseline and do better than Gould et al. 3 ,
our best per-pixel performance of 75.3% fall short of the current state-of-the-art on the
dataset, 78.1% by Socher et al. [121]. The small size of the training set is problematic
for our algorithm, since it relies on good density estimates from the NN lookup. Indeed,
the limited size of the dataset means that the global match set is most of the dataset
(i.e. |G| is close to |T |), so the global context stage is not effective. Furthermore, since
there are only 8 classes, adding segments using contextual cues gave no performance gain
either. We therefore focus on the SIFT-Flow dataset which is larger and better suited
to our algorithm.
0.76
0.75
Mean % Pixels Correct
0.74
0.73
0.72
Baseline
Trained Weights
0.71
Gould ICCV 2009
0.7
0.63 0.64 0.65 0.66 0.67 0.68 0.69
Mean % Pixels Correct Per Class
Figure 3.3: Evaluation of our algorithm on the Stanford background dataset, using
local labeling only. x-axis is mean per-class classification rate, y-axis is mean per-pixel
classification rate. Better performance corresponds to the top right corner. Black = Our
version of [128]; Red = Our algorithm (without added segments step); Blue = Gould
et al. [41] (estimated range).
The results of our algorithm on the SIFT-Flow dataset are shown in Fig. 3.4, where
we compare to other approaches using local labeling only. Both the trained weights and
adding segments procedures give a significant jump in performance. The latter procedure
3
Assuming some a per-class performance consistent with their per-pixel performance.
26
only gives a per-class improvement, consistent with its goal of helping the rare classes
To the best of our knowledge, Tighe and Lazebnik [128] is the current state-of-the-art
method on this dataset (Fig. 3.4, black square). For local labeling, our overall system
outperforms their approach by 10.1% (29.1% vs 39.2%) in per-class accuracy, for the
same per-pixel performance, a 35% relative improvement. The gain in per-pixel accuracy
Adding an MRF to our approach (Fig. 3.4, cyan curve) gives 77.1% per-pixel and 32.5%
per-class accuracy, outperforming the best published result of Tighe and Lazebnik [128]
(76.9% per-pixel and 29.4% per-class ). Note that their result uses geometric features not
used by our approach. Adding an MRF to our implementation of their system gives a
small improvement over the baseline which is significantly outperformed by our approach
+ an MRF.
0.8
0.75
Mean % Pixels Correct
0.7
Baseline
+ Flipped Images
0.65
+ Trained Weights
+ Added Segments
+ MRF
Baseline+MRF
0.6 Tighe et al.
Liu et al.
0.55
0.25 0.3 0.35 0.4 0.45
Mean % Pixels Correct Per Class
Figure 3.4: Evaluation of our algorithm on the SIFT-Flow dataset. Better performance is
in the top right corner. Our implementation of [128] (black + curve) closely matches their
published result (black square). Adding flipped versions of the images to the training
set improves the baseline a small amount (blue). A more significant gain is seen when
after training the NN weights (green). Refining our classification after adding segments
(red) gives a further gain in per-class performance. Adding an MRF (cyan) also gives
further gain. Also shown is Liu et al. [80] (magenta). Not shown is Shotton et al. [114]:
0.13 class, 0.52 pixel.
27
Sample images classified by our algorithm are shown in Fig. 5.4. We also demonstrate
the significance of our results by re-running our methods on a different train/test split
of the SIFT-Flow dataset. The results obtained are very similar to the original split and
0.8
0.78
0.74
0.72
0.7
0.68
0.66
0.64 Baseline
+ Flipped Images
0.62 + Trained Weights
+ Added Segments
0.6
0.25 0.3 0.35 0.4 0.45
Mean % Pixels Correct Per Class
Figure 3.5: Results for a different train/test split of the SIFT-Flow dataset to one
standard one used in Fig. 3.4. Similar results are obtained on both test sets.
In Fig. 3.6, we explore the role of the global context selection by varying the number
of image-level matches, controlled by the v parameter which dictates |G|. For small
metrics. But if v is too large, G contains many unrelated descriptors and the per-class
performance is decreased. This demonstrates the value of the global context selection
In Fig. 3.7 we visualize the descriptor weights, showing how they vary across class and
descriptor type (by averaging them over all instances of each class, since they differ for
each segment). Note how the weights jointly vary across both class and descriptor. For
example, the min height descriptor usually has high weight, except for some spatially
Fig. 3.8 shows the expected class distribution of super-pixels in G for the SIFT-Flow
dataset before and after the adding segments procedure, demonstrating its efficacy. The
28
0.75
1000 500
200
0.7
100
0.6
10
0.55
0.25 0.3 0.35 0.4
Mean % Pixels Correct Per Class
Figure 3.6: The global context selection procedure. Changing the parameter v (value at
each magenta dot) affects both types of error. See text for details. For comparison, the
baseline approach using a fixed v = 200 (and varying the smoothing t) is shown.
desc_quant_grow_sift_sp_100_16
desc_quant_grow_sift_sp_100_16
desc_quant_int_sift_sp_100_16
desc_quant_int_sift_sp_100_16
desc_quant_grow_mr8
desc_quant_grow_mr8
desc_quant_int_mr8
desc_quant_int_mr8
mask_thumb_32
mask_thumb_32
bbox_size
bbox_size
area
area
min_height
min_height
mask_abs_thumb_8
mask_abs_thumb_8
color_mean
color_mean
color_std
color_std
color_hist
color_hist
color_hist_grow
color_hist_grow
color_thumb
color_thumb
color_thumb_mask
color_thumb_mask
desc_quant_bdy_sift100_left
desc_quant_bdy_sift100_left
desc_quant_bdy_sift100_right
desc_quant_bdy_sift100_right
desc_quant_bdy_sift100_top
desc_quant_bdy_sift100_top
desc_quant_bdy_sift100_bot
desc_quant_bdy_sift100_bot crosswalk
sidewalk
fence
building
window
grass
rock
awning
bus
cow
balcony
river
mountain
tree
sky
road
sea
car
field
plant
sand
door
person
staircase
sign
boat
streetlight
pole
sun
bird
moon
bridge
crosswalk
desert
sidewalk
fence
building
window
grass
rock
awning
bus
cow
balcony
river
mountain
tree
sky
road
sea
car
field
plant
sand
door
person
staircase
sign
boat
streetlight
pole
sun
bird
moon
bridge
desert
Figure 3.7: A visualization of the mean weight for different classes by descriptor type.
Red/Blue corresponds to high/low weights. See text for details.
increase in rare segments is important in improving per-class accuracy (see Fig. 3.4).
Table 3.1: Timing breakdown (seconds) for the evaluation of a single query image using
the full system and our system without adding segments (just global context match +
learning weights). Note the descriptor computation makes up around half of the time.
In Table 3.1, we list the timings for each stage of our algorithm running on the SIFT-
29
2500
no context indexing
with context indexing
2250
2000
1750
1500
1250
1000
750
500
250
streetlight
mountain
building
window
awning
sidewalk
balcony
bridge
crosswalk
moon
plant
person
door
boat
bird
field
fence
pole
desert
cow
staircase
river
road
rock
tree
sign
sand
bus
sun
grass
car
sky
sea
Figure 3.8: Expected number of super-pixels in G with the same true class c∗s of a query
segment, ordered by frequency (blue). Note the power-law distribution of frequencies,
with many classes having fewer than 50 counts. Following the Adding Segments proce-
dure, counts of rare classes are significantly boosted while those for common classes are
unaltered (red). Queries were performed using the SIFT-Flow dataset.
Flow dataset, implemented in Matlab. Note that a substantial fraction of the time is
just taken up with descriptor computations. The search parts of our algorithm run in
than methods that use per-exemplar distance measures (e.g. Frome et al. [32] which
3.5 Discussion
In this chapter we have described two mechanisms for enhancing the performance of
non-parametric scene parsing based on kNN methods. Both share the underlying idea
of customizing the dataset for each kNN query. Rather than assuming that the full
training set is optimally discriminative, adapting the dataset allows for better use of
imperfectly generated descriptors with limited power. Learning weights focuses the clas-
sifier on more discriminative features and removes outlier points. Likewise, context-based
rare class examples improves density lost in the initial global pruning. On sufficiently
30
=%>$)' !"#$%&'("$)*' 7-1,3/%,' +,-"%,&'.,/0*)1' 2$33'451),6' =%>$)' !"#$%&'("$)*' 7-1,3/%,' +,-"%,&'.,/0*)1' 2$33'451),6'
p:0.844 p:0.897 p:0.880 p:0.513 p:0.625 p:0.678
c:0.550 c:0.789 c:0.745 c:0.486 c:0.542 c:0.572
8-9' 8<9'
8:9' 809'
8;9' 8*9'
8&9' 8/9'
8,9' 8?9'
awning balcony bird boat bridge building car
crosswalk
awning fence
balcony field
bird grass
boat mountain
bridge person
building plant
car pole
crosswalk river
fence road
field rock
grass sand
mountain sea
person sidewalk
plant
pole
crosswalk river
fence road
field rock
grass sand
mountain sea
person sidewalk
plant sign
pole sky
river staircase
road streetlight
rock sun
sand tree
sea window
sidewalk
sign
pole sky
river staircase
road streetlight
rock sun
sand tree
sea window
sidewalk sign sky staircase streetlight sun tree window
Figure 3.9: Example images from the SIFT-Flow dataset, annotated with classification
sign sky staircase streetlight sun tree window
rates using per-pixel (“p”) and per-class (“c”) metrics. Learning weights improves overall
performance. Adding rare class examples improves classification of less common classes,
like the boat in (b) and sidewalk in (g). Failures include labeling the road as sand in (h)
and the mountain as rock (a rarer class) in (c).
While this kNN system can learn weights for each feature descriptor indicating its relative
importance, it does not yet learn the features themselves, nor the steps that combine
them into a classification prediction. We now begin to investigate a system that has
convolutional networks take the principle of model customization much further, learning
entire stacks of features automatically tuned for the task and the data.
31
Chapter 4
The work presented in this chapter appeared in ICCV 2013 [24], and was a collaboration
Figure 4.1: A photograph taken through a glass pane covered in dirt (left) and rain
(right), along with the output of our neural network model trained to remove this type
of corruption.
32
4.1 Introduction
In the previous chapter, we described a system for predicting per-pixel semantic labels
networks to infer per-pixel outputs. As we will see in Chapter 6, the ability of these net-
works to learn multiple layers of weighted combinations will allow them to leverage large
datasets and achieve greatly better performance on the same task, using a combination
of global and local fields of view. First, however, we investigate an effective application
of convolutional networks using purely local fields of view: restoring images that contain
compact structured noise in the form of dirt or rain droplets. In this chapter we also
find a benefit to training convolutionally, i.e. evaluating the loss on the final predicted
image obtained after averaging individual patch predictions, and examine the effects of
this.
Photographs taken through a window are often compromised by dirt or rain present on
the window surface. Common cases of this include when a person is inside a car, train or
building and wishes to photograph the scene outside, or exhibits in museums displayed
behind protective glass. Such scenarios have become increasingly common with the
are mounted outside, e.g. on buildings for surveillance or on vehicles to prevent collisions.
These cameras are protected from the elements by an enclosure with a transparent
window.
Such images are affected by many factors including reflections and attenuation. However,
in this paper we address the particular situation where the window is covered with dirt
or water drops, resulting from rain. As shown in Fig. 4.1, these artifacts significantly
The classic approach to removing occluders from an image is to defocus them to the
point of invisibility at the time of capture. This requires placing the camera right up
against the glass and using a large aperture to produce small depth-of-field. However,
33
in practice it can be hard to move the camera sufficiently close, and aperture control
with smartphone cameras through dirty or rainy glass still have significant artifacts, as
In this paper we instead restore the image post-capture, treating the dirt or rain as a
structured form of image noise. Our method only relies on the artifacts being spatially
compact, thus is aided by the rain/dirt being in focus — hence the shots need not be
Our approach is to use a convolutional neural network to predict clean patches, given
dirty or clean ones as input. By asking the network to produce a clean output, regardless
of the corruption level of the input, it implicitly must both detect the corruption and, if
present, in-paint over it. Integrating both tasks simplifies and speeds test-time operation,
Training the models requires a large set of patch pairs to adequately cover the space
inputs and corruption, the gathering of which was non-trivial and required the devel-
operation is simple: a new image is presented to the neural network and it directly
Image denoising is a very well studied problem, with current approaches such as BM3D
[17] approaching theoretical performance limits [79]. However, the vast majority of
this literature is concerned with additive white Gaussian noise, quite different to the
image artifacts resulting from dirt or water drops. Our problem is closer to shot-noise
removal, but differs in that the artifacts are not constrained to single pixels and have
no way of leveraging this structure, thus cannot effectively remove the artifacts (see
34
Section 5.4).
Learning-based methods have found widespread use in image denoising, e.g. [152, 93,
99, 153]. These approaches remove additive white Gaussian noise (AWGN) by building
a generative model of clean image patches. In this paper, however, we focus on more
complex structured corruption, and address it using a neural network that directly maps
corrupt images to clean ones; this obviates the slow inference procedures used by most
generative models.
Neural networks have previously been explored for denoising natural images, mostly in
the context of AWGN, e.g. Jain and Seung [65], and Zhang and Salari [150]. Algorith-
mically, the closest work to ours is that of Burger et al. [9], which applies a large neural
JPEG quantization artifacts. Although more challenging than AWGN, the corruption
is still significantly easier than the highly variable dirt and rain drops that we address.
Furthermore, our network has important architectural differences that are crucial for
Removing localized corruption can be considered a form of blind inpainting, where the
position of the corrupted regions is not given (unlike traditional inpainting [27]). Dong
et al. [21] show how salt-and-pepper noise can be removed, but the approach does not
extend to multi-pixel corruption. Recently, Xie et al. [143] showed how a neural network
can perform blind inpainting, demonstrating the removal of text synthetically placed
in an image. This work is close to ours, but the solid-color text has quite different
statistics to natural images, thus is easier to remove than rain or dirt which vary greatly
in appearance and can resemble legitimate image structures. Jancsary et al. [66] denoise
images with a Gaussian conditional random field, constructed using decision trees on
local regions of the input; however, they too consider only synthetic corruptions.
Several papers explore the removal of rain from images. Garg and Nayar [34] and Bar-
num et al. [4] address airborne rain. The former uses defocus, while the latter uses
35
frequency-domain filtering. Both require video sequences rather than a single image,
however. Roser and Geiger [104] detect raindrops in single images; although they do
not demonstrate removal, their approach could be paired with a standard inpainting
Closely related to our application is Gu et al. [47], who show how lens dust and nearby
occluders can be removed, but their method requires extensive calibration or a video
sequence, as opposed to a single frame. Wilson et al. [142] and Zhou and Lin [151]
demonstrate dirt and dust removal. The former removes defocused dust for a Mars
Rover camera, while the latter removes sensor dust using multiple images and a physics
model.
4.2 Approach
To restore an image from a corrupt input, we predict a clean output using a specialized
form of convolutional neural network [77]. The same network architecture is used for all
forms of corruption; however, a different network is trained for dirt and for rain. This
allows the network to tailor its detection capabilities for each task.
Given a noisy image x, our goal is to predict a clean image y that is close to the true clean
36
Concretely, if the number of layers in the network is L, then
F0 (x) = x
the spatial support. bl is a vector of size nl containing the output bias (the same bias is
While the first and last layer kernels have a nontrivial spatial component, we restrict
the middle layers (2 ≤ l ≤ L − 1) to use pl = 1, i.e. they apply a linear map at each
spatial location. We also element-wise divide the final output by the overlap mask1 m
to account for different amounts of kernel overlap near the image boundary. The first
layer uses a “valid” convolution, while the last layer uses a “full” (these are the same for
In our system, the input kernels’ support is p1 = 16, and the output support is pL = 8.
We use two hidden layers (i.e. L = 3), each with 512 units. As stated earlier, the middle
applies 512 kernels of size 1 × 1 × 512, and W3 applies 3 kernels of size 8 × 8 × 512.
Fig. 4.2 shows examples of weights learned for the rain data.
1
m = 1K ∗ 1I , where 1K is a kernel of size pL × pL filled with ones, and 1I is a 2D array of ones with
as many pixels as the last layer input.
37
4.2.2 Training
We train the weights Wl and biases bl by minimizing the mean squared error over a
dataset D = (xi , yi∗ ) of corresponding noisy and clean image pairs. The loss is
1 �
J(θ) = ||F (xi ) − yi∗ ||2
2|D|
i∈D
where θ = (W1 , ..., WL , b1 , ..., bL ) are the model parameters. The pairs in the dataset D
are random 64 × 64 pixel subregions of training images with and without corruption (see
Fig. 4.4 for samples). Because the input and output kernel sizes of our network differ,
We minimize the loss using Stochastic Gradient Descent (SGD). The update for a single
step at time t is
∂
θt+1 ← θt − ηt (F (xi ) − yi∗ )T F (xi )
∂θ
where ηt is the learning rate hyper-parameter and i is a randomly drawn index from the
We initialize the weights at all layers by randomly drawing from a normal distribution
with mean 0 and standard deviation 0.001. The biases are initialized to 0. The learning
weight regularization.
A key improvement of our method over [9] is that we minimize the error of the final
image prediction, whereas [9] minimizes the error only of individual patches. We found
Since the middle layer convolution in our network has 1 × 1 spatial support, the network
38
Figure 4.2: A subset of rain model network weights, sorted by l2 -norm. Left: first
layer filters which act as detectors for the rain drops. Right: top layer filters used to
reconstruct the clean patch.
can be viewed as first patchifying the input, applying a fully-connected neural network
to each patch, and averaging the resulting output patches. More explicitly, we can split
the input image x into stride-1 overlapping patches {xp } = patchify(x), and predict
average of the patch predictions at pixels where they overlap. In this context, the
In contrast to [9], our method trains the full network F , including patchification and
both to remove occluders as well as reduce blur in the final output. To see this, consider
two adjacent patches y1 and y2 with overlap regions yo1 and yo2 , and desired output
yo∗ . If we were to train according to the individual predictions, the loss would minimize
(yo1 −yo∗ )2 +(yo2 −yo∗ )2 , the sum of their error. However, if we minimize the error of their
� �2
average, the loss becomes yo1 +y 2
o2
− yo∗ = 14 [(yo1 − yo∗ )2 + (yo2 − yo∗ )2 + 2(yo1 − yo∗ )(yo2 −
yo∗ )]. The new mixed term pushes the individual patch errors in opposing directions,
Fig. 4.3 depicts this for a real example. When trained at the patch level, as in the system
described by [9], each prediction leaves the same residual trace of the noise, which their
39
(a) (b) (c)
Figure 4.3: Denoising near a piece of noise. (a) shows a 64 × 64 image region with
dirt occluders (top), and target ground truth clean image (bottom). (b) and (c) show
the results obtained using non-convolutional and convolutionally trained networks, re-
spectively. The top row shows the full output after averaging. The bottom row shows
the signed error of each individual patch prediction for all 8 × 8 patches obtained us-
ing a sliding window in the boxed area, displayed as a montage. The errors from the
convolutionally-trained network (c) are less correlated with one another compared to
(b), and cancel to produce a better average.
average then maintains (b). When trained with our convolutional network, however, the
predictions decorrelate where not perfect, and average to a better output (c).
By restricting the middle layer kernels to have 1 × 1 spatial support, our method requires
no synchronization until the final summation in the last layer convolution. This makes
our method natural to parallelize, and it can easily be run in sections on large input
images by adding the outputs from each section into a single image output buffer. Our
Matlab GPU implementation is able to restore a 3888 × 2592 color image in 60s using a
40
4.3 Training Data Collection
The network has 753,664 weights and 1,216 biases which need to be set during training.
This requires a large number of training patches to avoid over-fitting. We now describe
the procedures used to gather the corrupted/clean patch pairs2 used to train each of the
4.3.1 Dirt
To train our network to remove dirt noise, we generated clean/noisy image pairs by
synthesizing dirt on images. Similarly to [47], we also found that dirt noise was well-
modeled by an opacity mask and additive component, which we extract from real dirt-on-
glass panes in a lab setup. Once we have the masks, we generate noisy images according
to
I � = pαD + (1 − α)I
Here, I and I � are the original clean and generated noisy image, respectively. α is a
transparency mask the same size as the image, and D is the additive component of the
dirt, also the same size as the image. p is a random perturbation vector in RGB space,
and the factors pαD are multiplied together element-wise. p is drawn from a uniform
distribution over (0.9, 1.1) for each of red, green and blue, then multiplied by another
random number between 0 and 1 to vary brightness. These random perturbations are
necessary to capture natural variation in the corruption and make the network robust
to these changes.
To find α and αD, we took pictures of several slide-projected backgrounds, both with
and without a dirt-on-glass pane placed in front of the camera. We then solved a linear
least-squares system for α and αD at each pixel; further details are included in the
supplementary material.
2
The corrupt patches still have many unaffected pixels, thus even without clean/clean patch pairs in
the training set, the network will still learn to preserve clean input regions.
41
Figure 4.4: Examples of clean (top row) and corrupted (bottom row) patches used for
training. The dirt (left column) was added synthetically, while the rain (right column)
was obtained from real image pairs.
Unlike the dirt, water droplets refract light around them and are not well described
model of [46], but accurately simulating outdoor illumination made this inviable. Thus,
instead of synthesizing the effects of water, we built a training set by taking photographs
of multiple scenes with and without the corruption present. For corrupt images, we
MgF2 -coated glass, taking care to produce drops that closely resemble real rain. To
limit motion differences between clean and rainy shots, all scenes contained only static
similar to [9], as well as three baseline approaches: median filtering, bilateral filtering
[130, 97], and BM3D [17]. In each case, we tuned the algorithm parameters to yield the
42
Original Our Output
Figure 4.5: Example image containing dirt, and the restoration produced by our network.
Note the detail preserved in high-frequency areas like the branches. The nonconvolutional
network leaves behind much of the noise, while the median filter causes substantial
blurring.
best qualitative performance in terms of visibly reducing noise while keeping clean parts
of the image intact. On the dirt images, we used an 8 × 8 window for the median filter,
parameters σs = 3 and σr = 0.3 for the bilateral filter, and σ = 0.15 for BM3D. For the
rain images, we used similar parameters, but adjusted for the fact that the images were
downsampled by half: 5 × 5 for the median filter, σs = 2 and σr = 0.3 for the bilateral
4.5 Experiments
4.5.1 Dirt
We tested dirt removal by running our network on pictures of various scenes taken behind
dirt-on-glass panes. Both the scenes and glass panes were not present in the training
set, ensuring that the network did not simply memorize and match exact patterns. We
43
tested restoration of both real and synthetic corruption. Although the training set was
composed entirely of synthetic dirt, it was representative enough for the network to
The network was trained using 5.8 million examples of 64 × 64 image patches with
synthetic dirt, paired with ground truth clean patches. We trained only on examples
where the variance of the clean 64 × 64 patch was at least 0.001, and also required that
at least 1 pixel in the patch had a dirt α-mask value of at least 0.03. To compare to [9],
We first measure quantitative performance using synthetic dirt. The results are shown
in Table 4.1. Here, we generated test examples using images and dirt masks held out
from the training set, using the process described in Section 4.3.1. Our convolutional
are much better than the three baselines, which do not make use of the structure in the
We also applied our network to two types of artificial noise absent from the training set:
synthetic “snow” made from small white line segments, and “scratches” of random cubic
splines. An example region is shown in Fig. 4.6. In contrast to the gain of +6.50 dB
for dirt, the network leaves these corruptions largely intact, producing near-zero PSNR
gains of -0.10 and +0.30 dB, respectively, over the same set of images. This demonstrates
Dirt Results
Fig. 4.5 shows a real test image along with our output and the output of the patch-based
network and median filter. Because of illumination changes and movement in the scenes,
44
PSNR Input Ours Nonconv Median Bilateral BM3D
Mean 28.93 35.43 34.52 31.47 29.97 29.99
Std.Dev. 0.93 1.24 1.04 1.45 1.18 0.96
Gain - 6.50 5.59 2.53 1.04 1.06
Table 4.1: PSNR for our convolutional neural network, nonconvolutional patch-based
network, and baselines on a synthetically generated test set of 16 images (8 scenes with
2 different dirt masks). Our approach significantly outperforms the other methods.
Figure 4.6: Our dirt-removal network applied to an image with (a) no corruption, (b)
synthetic dirt, (c) artificial “snow” and (d) random “scratches.” Because the network
was trained to remove dirt, it successfully restores (b) while leaving the corruptions in
(c,d) largely untouched. Top: Original images. Bottom: Output.
we were not able to capture ground truth images for quantitative evaluation. Our method
is able to remove most of the corruption while retaining details in the image, particularly
around the branches and shutters. The non-convolutional network leaves many pieces of
dirt behind, while the median filter loses much detail present in the original. Note also
that the neural networks leave already-clean parts of the image mostly untouched.
Two common causes of failure of our model are large corruption, and very oddly-shaped
or unusually colored corruption. Our 16 × 16 input kernel support limits the size of
corruption recognizable by the system, leading to the former. The latter is caused by a
lack of generalization: although we trained the network to be robust to shape and color
by supplying it a range of variations, it will not recognize cases too far from those seen in
training. Another interesting failure of our method appears in the bright orange cones in
Fig. 4.5, which our method reduces in intensity — this is due to the fact that the training
dataset did not contain any examples of such fluorescent objects. More examples are
45
provided in the supplementary material.
4.5.2 Rain
We ran the rain removal network on two sets of test data: (i) pictures of scenes taken
through a pane of glass on which we sprayed water to simulate rain, and (ii) pictures
of scenes taken while it was actually raining, from behind an initially clean glass pane.
Both sets were composed of real-world outdoor scenes not in the training set.
We trained the network using 6.5 million examples of 64×64 image patch pairs, captured
as described in Section 4.3.2. Similarly to the dirt case, we used a variance threshold
of 0.001 on the clean images and required each training pair to have at least 1 pixel
Examples of our network removing sprayed-on water is shown in Fig. 4.7. As was the
case for the dirt images, we were not able to capture accurate ground truth due to
illumination changes and subject motion. Since we also do not have synthetic water
As before, our network is able to remove most of the water droplets, while preserving
finer details and edges reasonably well. The non-convolutional network leaves behind
additional droplets, e.g. by the subject’s face in the top image; it performs somewhat
better in the bottom image, but blurs the subject’s hand. The median filter must blur
the image substantially before visibly reducing the corruption. However, the neural
networks mistake the boltheads on the bench for raindrops, and remove them.
Despite the fact that our network was trained on static scenes to limit object motion
between clean/noisy pairs, it still preserves animate parts of the images well: The face
and body of the subject are reproduced with few visible artifacts, as are grass, leaves
46
Original Our Output
Figure 4.7: Our network removes most of the water while retaining image details; the
non-convolutional network leaves more droplets behind, particularly in the top image,
and blurs the subject’s fingers in the bottom image. The median filter blurs many details,
but still cannot remove much of the noise.
47
Figure 4.8: Shot from the rain video sequence (see supplementary video), along with the
output of our network. Note each frame is processed independently, without using any
temporal information or background subtraction.
and branches (which move from wind). Thus the network can be applied to many scenes
A picture taken using actual rain is shown in Fig. 4.8. We include more pictures of this
time series as well as a video in the supplementary material. Each frame of the video was
the sequence, we set a clean glass pane on a tripod and allowed rain to fall onto it, taking
pictures at 20s intervals. The camera was placed 0.5m behind the glass, and was focused
Even though our network was trained using sprayed-on water, it was still able to remove
much of the actual rain. The largest failures appear towards the end of the sequence,
when the rain on the glass is very heavy and starts to agglomerate, forming droplets larger
than our network can handle. Although this is a limitation of the current approach, we
Lastly, in addition to pictures captured with a DSLR, in Fig. 4.9 we apply our network
to a picture taken using a smartphone on a train. While the scene and reflections are
preserved, raindrops on the window are removed, though a few small artifacts do remain.
48
This demonstrates that our model is able to restore images taken by a variety of camera
types.
Figure 4.9: Top: Smartphone shot through a rainy window on a train. Bottom: Output
of our algorithm.
4.6 Discussion
In this chapter we introduced a method for removing rain or dirt artifacts from a single
image. Although the problem appears underconstrained, the artifacts have a distinctive
appearance which we are able to learn with a specialized convolutional network and
a carefully constructed training set. Results on real test examples show most artifacts
being removed without undue loss of detail, unlike previous approaches such as median or
bilateral filtering. Using a convolutional network accounts for the error in the final image
network.
49
The quality of the results does however depend on the statistics of test cases being
similar to those of the training set. In cases where this does not hold, we see significant
artifacts in the output. This can be alleviated by expanding the diversity and size of
the training set. A second issue is that the corruption cannot be much larger than the
training patches. This means the input image may need to be downsampled, e.g. as in
Although we have only considered day-time outdoor shots, the approach could be ex-
tended to other settings such as indoor or night-time, given suitable training data. It
could also be extended to other problem domains such as scratch removal or color shift
correction. Our algorithm also provides the underlying technology for a number of po-
tential applications such as a digital car windshield to aid driving in adverse weather
locations.
While a local field of view is sufficient to detect and remove compact noise structures,
a more global view is needed for many other tasks in order to incorporate context and
cues from the larger image area. We now turn to a more challenging task that requires
50
Chapter 5
The work presented in this chapter appeared in NIPS 2014 [25], and was a collaboration
5.1 Introduction
In this chapter we develop a convolutional network that integrates both global and local
views of an input image together to generate a depth map; this map contains the depth
from the camera for each pixel of the input. While for stereo images local correspondence
suffices for estimation, finding depth relations from a single image is less straightforward,
and requires integrating information from both global and local scales.
Depth relations help provide richer representations of objects and their environment
51
existing recognition tasks [115], and can enable many further applications such as 3D
modeling [108, 62], physics and support models [115], robotics [50, 88], and potentially
While there is much prior work on estimating depth based on stereo images or motion
[110], there has been relatively little on estimating depth from a single image. Yet
the monocular case often arises in practice: Potential applications include better under-
standings of the many images distributed on the web and social media outlets, real estate
listings, and shopping sites. These include many examples of both indoor and outdoor
scenes.
There are likely several reasons why the monocular case has not yet been tackled to
the same degree as the stereo one. Provided accurate image correspondences, depth
can be recovered deterministically in the stereo case [53]. Thus, stereo depth estimation
can be reduced to developing robust image point correspondences — which can often
be found using local appearance features. By contrast, estimating depth from a single
image requires the use of monocular depth cues such as line angles and perspective,
object sizes, image position, and atmospheric effects. Furthermore, a global view of the
scene may be needed to relate these effectively, whereas local disparity is sufficient for
stereo.
Moreover, the task is inherently ambiguous, and a technically ill-posed problem: Given
an image, an infinite number of possible world scenes may have produced it. Of course,
most of these are physically implausible for real-world spaces, and thus the depth may
still be predicted with considerable accuracy. At least one major ambiguity remains,
though: the global scale. Although extreme cases (such as a normal room versus a
dollhouse) do not exist in the data, moderate variations in room and furniture sizes
are present. We address this using a scale-invariant error in addition to more common
scale-dependent errors. This focuses attention on the spatial relations within a scene
rather than general scale, and is particularly apt for applications such as 3D modeling,
52
In this chapter we present a new approach for estimating depth from a single image. We
directly regress on the depth using a neural network with two components: one that first
estimates the global structure of the scene, then a second that refines it using local infor-
mation. The network is trained using a loss that explicitly accounts for depth relations
between pixel locations, in addition to pointwise error. Our system achieves state-of-the
art estimation rates on NYU Depth and KITTI, as well as improved qualitative outputs.
Directly related to our work are several approaches that estimate depth from a single
image. Saxena et al. [107] predict depth from a set of image features using linear re-
gression and a MRF, and later extend their work into the Make3D [108] system for 3D
model generation. However, the system relies on horizontal alignment of images, and
suffers in less controlled settings. Hoiem et al. [62] do not predict depth explicitly, but
instead categorize image regions into geometric structures (ground, sky, vertical), which
More recently, Ladicky et al. [74] show how to integrate semantic object labels with
features and use superpixels to segment the image. Karsch et al. [68] use a kNN transfer
mechanism based on SIFT Flow [81] to estimate depths of static backgrounds from
single images, which they augment with motion information to better estimate moving
foreground subjects in videos. This can achieve better alignment, but requires the entire
contrast, our method learns an easier-to-store set of network parameters, and can be
More broadly, stereo depth estimation has been extensively investigated. Scharstein
et al. [110] provide a survey and evaluation of many methods for 2-frame stereo cor-
53
creative application of multiview stereo, Snavely et al. [118] match across views of many
Machine learning techniques have also been applied in the stereo case, often obtaining
better results while relaxing the need for careful camera alignment [70, 87, 144, 117].
Most relevant to this work is Konda et al. [70], who train a factored autoencoder on
image patches to predict depth from stereo sequences; however, this relies on the local
There are also several hardware-based solutions for single-image depth estimation. Levin
et al. [78] perform depth from defocus using a modified camera aperture, while the
Kinect and Kinect v2 use active stereo and time-of-flight to capture depth. Our method
makes indirect use of such sensors to provide ground truth depth targets during training;
however, at test time our system is purely software-based, predicting depth from RGB
images.
5.3 Approach
Our network is made of two component stacks, shown in Fig. 5.1. A coarse-scale network
first predicts the depth of the scene at a global level. This is then refined within local
regions by a fine-scale network. Both stacks are applied to the original input, but in
addition, the coarse network’s output is passed to the fine network as additional first-
layer image features. In this way, the local network can edit the global prediction to
54
054 1 Coarse
055 96
256 384 384 256 4096
056
11x11 conv 5x5 conv 3x3 conv 3x3 conv 3x3 conv full full
057 4 stride 2x2 pool
058 2x2 pool
Coarse 1 Coarse 2 Coarse 3 Coarse 4 Coarse 5 Coarse 6 Coarse 7
059
96060
1 Coarse
061 256 384
63
384 256 64 4096 64
1 Refined
11x11 conv 062 5x5 conv 3x3 conv 3x3 conv 3x3 conv full full
4 stride 063 2x2 pool
2x2 pool 9x9 conv Concatenate 5x5 conv 5x5 conv
064
Coarse 1 2 stride
Coarse 2 Coarse 3 Coarse 4 Coarse 5 Coarse 6 Coarse 7
2x2 pool Fine 1 Fine 2 Fine 3 Fine 4
065
066
067 Input 1 Refined
63 64 64
068
069
9x9 conv
070
Concatenate 5x5 conv 5x5 conv
2 stride
071pool
2x2 Fine 1 Fine 2 Fine 3 Fine 4
072
Input 073
074
Coarse Fine
Layer input 1 2,3,4 5 6 7 1,2,3,4
075
Size (NYUDepth) 304x228 37x27 18x13 8x6 1x1 74x55 74x55
076 Size (KITTI) 576x172 71x20 35x9 17x4 1x1 142x27 142x27
077 Ratio to input /1 /8 /16 /32 – /4 /4
078
079 Figure 1: Model architecture.
080
Figure 5.1: Model architecture.
081
082
Global 083
Coarse-Scale Network
084
085 predict depth explicitly, but instead categorize image regions into geometric structures (ground, sky,
The task of 086 vertical), which
the coarse-scale they use
network to compose
is to predict athe
simple 3D model
overall depthof map
the scene.
structure using
087
More recently, Ladicky et al. [?] show how to integrate semantic object labels with monocular depth
a global view088
of thefeatures
scene. toThe upper
improve layers of this
performance; network
however, are on
they rely fully connected,
handcrafted andand
features thus
use superpixels to
089
segment the image. Karsch et al. [?] use a kNN transfer mechanism based on SIFT Flow [?] to esti-
contain the entire
090 image in their
mate depths fieldbackgrounds
of static of view. Similarly,
from singlethe lower
images, andthey
which middle layers
augment with are
motion information
091 to better estimate moving foreground subjects in videos. This can achieve better alignment, but re-
designed to 092
combinequires
information
the entire from
datasetdifferent partsatofruntime
to be available the image throughexpensive
and performs max-pooling
alignment procedures.
093 By contrast, our method learns an easier-to-store set of network parameters, and can be applied to
operations to
094a small spatial
images dimension. In so doing, the network is able to integrate a
in real-time.
095
More
global understanding of broadly,
the fullstereo depth
scene to estimation has been
predict the extensively
depth. Such investigated. Scharsteiniset al. [?] provide
an understanding
096 a survey and evaluation of many methods for 2-frame stereo correspondence methods, organized by
097 matching, case
aggregation and effective
optimization
needed in the single-image to make usetechniques. In a creative
of cues such application
as vanishing of multiview stereo,
points,
098 Snavely et al. [?] match across views of many uncalibrated consumer photographs of the same scene
099
object locations, to create
and roomaccurate 3D reconstructions
alignment. of common
A local view (as islandmarks.
commonly used for stereo
100
Machine learning techniques have been applied in the stereo case, often obtaining better results
matching) is101
insufficient to notice important features such as these.
while relaxing the need for careful camera alignment [?, ?, ?, ?]. Most relevant to this work is
102 Konda et al. [?], who train a factored autoencoder on image patches to predict depth from stereo
103 sequences; however, this relies on the local displacements provided by stereo.
As illustrated in Fig.
104
5.1, the global, coarse-scale network contains five feature extraction
105
There are also several hardware-based solutions for single-image depth estimation. Levin et al. [?]
layers of convolutionperform depth from defocus
and max-pooling, using by
followed a modified camera
two fully aperature,
connected while the
layers. TheKinect
input,and Kinect v2 use
106
active stereo and time-of-flight to capture depth. Our method makes indirect use of such sensors
107
to provide
feature map and output ground
sizes are truth
also depth
giventargets during
in Fig. 5.1.training; however,
The final at testistime
output at our
1/4-system is purely
software-based, predicting depth from RGB images only.
resolution compared to the input (which is itself downsampled from the original dataset
2
by a factor of 2), and corresponds to a center crop containing most of the input (as we
55
describe later, we lose a small border area due to the first layer of the fine-scale network
Note that the spatial dimension of the output is larger than that of the topmost con-
volutional feature map. Rather than limiting the output to the feature map size and
relying on hardcoded upsampling before passing the prediction to the fine network, we
allow the top full layer to learn templates over the larger area (74x55 for NYU Depth).
These are expected to be blurry, but will be better than the upsampled output of a 8x6
prediction (the top feature map size); essentially, we allow the network to learn its own
upsampling based on the features. Sample output weights are shown in Fig. 5.2
All hidden layers use rectified linear units for activations, with the exception of the
coarse output layer 7, which is linear. Dropout is applied to the fully-connected hidden
layer 6. The convolutional layers (1-5) of the coarse-scale network are pretrained on the
ImageNet classification task [19] — while developing the model, we found pretraining
on ImageNet worked better than initializing randomly, although the difference was not
very large1 .
(a) (b)
Figure 5.2: Weight vectors from layer Coarse 7 (coarse output), for (a) KITTI and (b)
NYUDepth. Red is positive (farther) and blue is negative (closer); black is zero. Weights
are selected uniformly and shown in descending order by l2 norm. KITTI weights often
show changes in depth on either side of the road. NYUDepth weights often show wall
positions and doorways.
1
When pretraining, we stack two fully connected layers with 4096 - 4096 - 1000 output units each,
with dropout applied to the two hidden layers, as in [73]. We train the network using random 224x224
crops from the center 256x256 region of each training image, rescaled so the shortest side has length 256.
This model achieves a top-5 error rate of 18.1% on the ILSVRC2012 validation set, voting with 2 flips
and 5 translations per image.
56
Local Fine-Scale Network
After taking a global perspective to predict the coarse depth map, we make local re-
finements using a second, fine-scale network. The task of this component is to edit the
coarse prediction it receives to align with local details such as object and wall edges.
The fine-scale network stack consists of convolutional layers only, along with one pooling
While the coarse network sees the entire scene, the field of view of an output unit in the
fine network is 45x45 pixels of input. The convolutional layers are applied across feature
maps at the target output size, allowing a relatively high-resolution output at 1/4 the
input scale.
More concretely, the coarse output is fed in as an additional low-level feature map. By
design, the coarse prediction is the same spatial size as the output of the first fine-
scale layer (after pooling), and we concatenate the two together (Fine 2 in Fig. 5.1).
All hidden units use rectified linear activations. The last convolutional layer is linear, as
it predicts the target depth. We train the coarse network first against the ground-truth
targets, then train the fine-scale network keeping the coarse-scale output fixed (i.e. when
training the fine network, we do not backpropagate through the coarse one).
much of the error accrued using current elementwise metrics may be explained simply
by how well the mean depth is predicted. For example, Make3D trained on NYUDepth
obtains 0.41 error using RMSE in log space (see Table 5.1). However, using an oracle to
substitute the mean log depth of each prediction with the mean from the corresponding
ground truth reduces the error to 0.33, a 20% relative improvement. Likewise, for our
57
system, these error rates are 0.28 and 0.22, respectively. Thus, just finding the average
scale of the scene accounts for a large fraction of the total error.
points in the scene, irrespective of the absolute global scale. For a predicted depth map
y and ground truth y ∗ , each with n pixels indexed by i, we define the scale-invariant
n
1�
D(y, y ∗ ) = (log yi − log yi∗ + α(y, y ∗ ))2 , (5.1)
n
i=1
1 �
where α(y, y ∗ ) = n
∗
i (log yi − log yi ) is the value of α that minimizes the error for a
given (y, y ∗ ). For any prediction y, eα is the scale that best aligns it to the ground truth.
All scalar multiples of y have the same error, hence the scale invariance.
Two additional ways to view this metric are provided by the following equivalent forms.
Setting di = log yi − log yi∗ to be the difference between the prediction and ground truth
at pixel i, we have
1 �� ∗ 2
�
D(y, y ∗ ) = (log y i − log y j ) − (log y ∗
i − log y j ) (5.2)
n2
i,j
� �2
1� 2 1 � 1� 2 1 �
= di − 2 di dj = di − 2 di (5.3)
n n n n
i i,j i i
Eqn. 5.2 expresses the error by comparing relationships between pairs of pixels i, j in
the output: to have low error, each pair of pixels in the prediction must differ in depth
by an amount similar to that of the corresponding pair in the ground truth. Eqn. 5.3
�
relates the metric to the original l2 error, but with an additional term, − n12 ij di dj ,
that credits mistakes if they are in the same direction and penalizes them if they oppose.
58
Thus, an imperfect prediction will have lower error when its mistakes are consistent with
one another. The last part of Eqn. 5.3 rewrites this as a linear-time computation.
In addition to the scale-invariant error, we also measure the performance of our method
according to several error metrics have been proposed in prior works, as described in
Section 5.4.
training loss. Inspired by Eqn. 5.3, we set the per-sample training loss to
� �2
1� 2 λ �
L(y, y ∗ ) = di − 2 di (5.4)
n n
i i
where di = log yi − log yi∗ and λ ∈ [0, 1]. Note the output of the network is log y; that is,
the final linear layer predicts the log depth. Setting λ = 0 reduces to elementwise l2 , while
λ = 1 is the scale-invariant error exactly. We use the average of these, i.e. λ = 0.5, finding
that this produces good absolute-scale predictions while slightly improving qualitative
output.
During training, most of the target depth maps will have some missing values, particu-
larly near object boundaries, windows and specular surfaces. We deal with these simply
by masking them out and evaluating the loss only on valid points, i.e. we replace n in
Eqn. 5.4 with the number of pixels that have a target depth, and perform the sums
We augment the training data with random online transformations (values shown for
NYUDepth; for KITTI, s ∈ [1, 1.2], and rotations are not performed since images are
59
• Scale: Input and target images are scaled by s ∈ [1, 1.5], and the depths are divided
by s.
• Translation: Input and target are randomly cropped to the sizes indicated in
Fig. 5.1.
• Color : Input values are multiplied globally by a random RGB value c ∈ [0.8, 1.2]3 .
• Flips: Input and target are horizontally flipped with 0.5 probability.
Note that image scaling and translation do not preserve the world-space geometry of
the scene. This is easily corrected in the case of scaling by dividing the depth values
by the scale s (making the image s times larger effectively moves the camera s times
closer). Although translations are not easily fixed (they effectively change the camera
to be incompatible with the depth values), we found that the extra data they provided
benefited the network even though the scenes they represent were slightly warped. The
other transforms, flips and in-plane rotation, are geometry-preserving. At test time, we
use a single center crop at scale 1.0 with no rotation or color transforms.
5.4 Experiments
We train our model on the raw versions both NYU Depth v2 [115] and KITTI [36]. The
raw distributions contain many additional images collected from the same scenes as in
the more commonly used small distributions, but with no preprocessing; in particular,
points for which there is no depth value are left unfilled. However, our model’s natural
ability to handle such gaps as well as its demand for large training sets make these fitting
sources of data.
The NYU Depth dataset [115] is composed of 464 indoor scenes, taken as video sequences
using a Microsoft Kinect camera. We use the official train/test split, using 249 scenes
60
for training and 215 for testing, and construct our training set using the raw data for
these scenes. RGB inputs are downsampled by half, from 640x480 to 320x240. Because
the depth and RGB cameras operate at different variable frame rates, we associate each
depth image with its closest RGB image in time, and throw away frames where one
RGB image is associated with more than one depth (such a one-to-many mapping is not
predictable). We use the camera projections provided with the dataset to align RGB
and depth pairs; pixels with no depth value are left missing and are masked out. To
remove many invalid regions caused by windows, open doorways and specular surfaces
we also mask out depths equal to the minimum or maximum recorded for each image.
The training set has 120K unique images, which we shuffle into a list of 220K after
evening the scene distribution (1200 per scene). We test on the 694-image NYU Depth
v2 test set (with filled-in depth values). We train coarse network for 2M samples using
SGD with batches of size 32. We then hold it fixed and train the fine network for 1.5M
samples (given outputs from the already-trained coarse one). Learning rates are: 0.001
for coarse convolutional layers 1-5, 0.1 for coarse full layers 6 and 7, 0.001 for fine layers 1
and 3, and 0.01 for fine layer 2. These ratios were found by trial-and-error on a validation
set (folded back into the training set for our final evaluations), and the global scale of
5.4.2 KITTI
The KITTI dataset [36] is composed of several outdoor scenes captured while driving with
car-mounted cameras and depth sensor. We use 56 scenes from the “city,” “residential,”
and “road” categories of the raw data. These are split into 28 for training and 28 for
testing. The RGB images are originally 1224x368, and downsampled by half to form the
network inputs.
The depth for this dataset is sampled at irregularly spaced points, captured at different
times using a rotating LIDAR scanner. When constructing the ground truth depths for
61
training, there may be conflicting values; since the RGB cameras shoot when the scanner
points forward, we resolve conflicts at each pixel by choosing the depth recorded closest
to the RGB capture time. Depth is only provided within the bottom part of the RGB
image, however we feed the entire image into our model to provide additional context to
the global coarse-scale network (the fine network sees the bottom crop corresponding to
The training set has 800 images per scene. We exclude shots where the car is stationary
(acceleration below a threshold) to avoid duplicates. Both left and right RGB cameras
are used, but are treated as unassociated shots. The training set has 20K unique im-
ages, which we shuffle into a list of 40K (including duplicates) after evening the scene
distribution. We train the coarse model first for 1.5M samples, then the fine model for
We compare our method against Make3D trained on the same datasets, as well as the
also compare to the mean depth image computed across the training set. We trained
Make3D on KITTI using a subset of 700 images (25 per scene), as the system was unable
to scale beyond this size. Depth targets were filled in using the colorization routine in the
NYUDepth development kit. For NYUDepth, we used the common distribution training
set of 795 images. We evaluate each method using several errors from prior works, as
yi∗
• Threshold: % of yi s.t. max( yy∗i , yi ) = δ < thr
i
1 �
• Abs Relative difference: |T | y∈T |y − y ∗ |/y ∗
1 �
• Squared Relative difference: |T | y∈T ||y − y ∗ ||2 /y ∗
� �
1
• RMSE (linear): |T | y∈T ||yi − yi∗ ||2
62
� �
1
• RMSE (log): |T | y∈T || log yi − log yi∗ ||2
Note that the predictions from Make3D and our network correspond to slightly differ-
ent center crops of the input. We compare them on the intersection of their regions,
and upsample predictions to the full original input resolution using nearest-neighbor.
5.5 Results
Results for NYU Depth dataset are provided in Table 5.1. As explained in Section 5.4.3,
we compare against the data mean and Make3D as baselines, as well as Karsch et al. [68]
and Ladicky et al. [74]. (Ladicky et al. uses a joint model which is trained using both
depth and semantic labels). Our system achieves the best performance on all metrics,
obtaining an average 35% relative gain compared to the runner-up. Note that our system
is trained using the raw dataset, which contains many more example instances than the
data used by other approaches, and is able to effectively leverage it to learn relevant
This dataset breaks many assumptions made by Make3D, particularly horizontal align-
ment of the ground plane; as a result, Make3D has relatively poor performance in this
task. Importantly, our method improves over it on both scale-dependent and scale-
invariant metrics, showing that our system is able to predict better relations as well as
better means.
2
On NYUDepth, log RMSE is 0.285 vs 0.286 for upsampling and downsampling, respectively, and
scale-invariant RMSE is 0.219 vs 0.221. The intersection is 86% of the network region and 100% of
Make3D for NYUDepth, and 100% of the network and 82% of Make3D for KITTI.
63
Qualitative results are shown on the left side of Fig. 5.4, sorted top-to-bottom by scale-
invariant MSE. Although the fine-scale network does not improve in the error measure-
ments, its effect is clearly visible in the depth maps — surface boundaries have sharper
transitions, aligning to local details. However, some texture edges are sometimes also
included. Fig. 5.3 compares Make3D as well as outputs from our network trained with
losses using λ = 0 and λ = 0.5. While we did not observe numeric gains using λ = 0.5
over λ = 0, it did produce slight qualitative improvements in the more detailed outputs.
Mean Make3D Ladicky&al Karsch&al Coarse Coarse+Fine
threshold δ < 1.25 0.418 0.447 0.542 – 0.618 0.611 higher
threshold δ < 1.252 0.711 0.745 0.829 – 0.891 0.887 is
threshold δ < 1.253 0.874 0.897 0.940 – 0.969 0.971 better
abs relative diff. 0.408 0.349 – 0.350 0.228 0.215
sqr relative diff. 0.581 0.492 – – 0.223 0.212 lower
RMSE (linear) 1.244 1.214 – 1.2 0.871 0.907 is
RMSE (log) 0.430 0.409 – – 0.283 0.285 better
RMSE (log,sc.inv.) 0.304 0.325 – – 0.221 0.219
input g.truth
m3d L2
coarse sc.-inv
Figure 5.3: Qualitative comparison of Make3D, our method trained with l2 loss (λ = 0),
and our method trained with both l2 and scale-invariant loss (λ = 0.5).
5.5.2 KITTI
We next examine results on the KITTI driving dataset. Here, the Make3D baseline is
well-suited to the dataset, being composed of horizontally aligned images, and achieves
relatively good results. Still, our method improves over it on all metrics, by an average
31% relative gain. Just as importantly, there is a 25% gain in both the scale-dependent
and scale-invariant RMSE errors, showing there is substantial improvement in the pre-
dicted structure. Again, the fine-scale network does not improve much over the coarse
one in the error metrics, but differences between the two can be seen in the qualitative
outputs.
64
Mean Make3D Coarse Coarse + Fine
threshold δ < 1.25 0.556 0.601 0.679 0.692 higher
threshold δ < 1.252 0.752 0.820 0.897 0.899 is
threshold δ < 1.253 0.870 0.926 0.967 0.967 better
abs relative difference 0.412 0.280 0.194 0.190
sqr relative difference 5.712 3.012 1.531 1.515 lower
RMSE (linear) 9.635 8.734 7.216 7.156 is
RMSE (log) 0.444 0.361 0.273 0.270 better
RMSE (log, scale inv.) 0.359 0.327 0.248 0.246
The right side of Fig. 5.4 shows examples of predictions, again sorted by error. The
fine-scale network produces sharper transitions here as well, particularly near the road
edge. However, the changes are somewhat limited. This is likely caused by uncorrected
alignment issues between the depth map and input in the training data, due to the rotat-
ing scanner setup. This dissociates edges from their true position, causing the network
to average over their more random placements. Fig. 5.3 shows Make3D performing much
better on this data, as expected, while using the scale-invariant error as a loss seems to
5.6 Discussion
Predicting depth estimates from a single image is a challenging task. Yet by combining
information from both global and local views, it can be performed reasonably well. Our
system accomplishes this through the use of two deep networks, one that estimates the
global depth structure, and another that refines it locally at finer resolution. We achieve
a new state-of-the-art on this task for NYU Depth and KITTI datasets, having effectively
In the next chapter, we extend our method to also predict surface normals and seman-
tic labels, thus providing even richer geometric outputs and object class information.
We also apply successively finer-scaled networks to increase the output map resolution,
65
!"# !$# !%# !&#
!"#
!$#
!%#
!&#
Figure 5.4: Example predictions from our algorithm. NYUDepth on left, KITTI on right.
For each image, we show (a) input, (b) output of coarse network, (c) refined output of
fine network, (d) ground truth. Examples are sorted from best (top) to worst (bottom).
66
Chapter 6
Predicting Depth,
Surface Normals and
Semantic Labels
with a Common Multi-Scale
Convolutional Architecture
The work presented in this chapter was a collaboration with Rob Fergus, and is currently
6.1 Introduction
ent computer vision tasks: depth prediction, surface normal estimation, and semantic
labeling. Our new model builds upon the approach we took in the previous chapter on
depth map prediction, and contains enhancements that both enable generalization to
67
new tasks, as well as help performance. All three tasks use the same core architecture,
Our new method generates pixel-maps directly from an input image, without the need
for low-level superpixels or contours, and is able to align to many image details by using
all three outputs can be generated in real time (∼30Hz). We achieve state-of-the art
results on all three tasks we investigate, demonstrating the versatility of our approach.
There are several advantages in developing a general model for pixel-map regression.
First, applications to new tasks may be quickly developed, with much of the new work
lying in defining an appropriate training set and loss function; in this light, our work
is a step towards building off-the-shelf regressor models that can be used for many
of systems that require multiple modalities, e.g. robotics or augmented reality, which
in turn can help enable research progress in these areas. Lastly, in the case of depth
and normals in our system, much of the computation can be shared between modalities,
Single-image surface normal estimation has been addressed by Fouhey et al. [30, 31],
Ladicky et al. [75], and most recently by Wang et al. [140], the latter in work concurrent
with ours. Fouhey et al. match to discriminative local templates [30] followed by a global
optimization on a grid drawn from vanishing point rays [31], while Ladicky et al. learn
a regression from over-segmented regions to a discrete set of normals and mixture co-
efficients. Wang et al. [140] use convolutional networks to combine normals estimates
from local and global scales, while also employing cues from room layout, edge labels
and vanishing points. Importantly, we achieve as good or superior results with a more
general multiscale architecture that can naturally be used to perform many different
tasks.
68
Prior work on semantic segmentation includes many different approaches, both using
RGB-only data [129, 11, 28] as well as RGB-D [115, 101, 89, 16, 51, 56, 48]. Most of these
approach: We make a consistent global prediction first, then follow it with iterative local
refinements. In so doing, the local networks are made aware of their place within the
global scene, and can can use this information in their refined predictions.
Gupta et al. [48, 49] create semantic segmentations first by generating contours, then
classifying regions using either hand-generated features and SVM [48], or a convolutional
network for object detection [49]. Notably, [48] also performs amodal completion, which
transfers labels between disparate regions of the image by comparing planes from the
depth.
Most related to our method in semantic segmentation are other approaches using con-
volutional networks. Farabet et al. [28] and Couprie et al. [16] each use a convolutional
network applied at multiple scales to find local predictions, then aggregate the predic-
tions using superpixels. Our method differs in several important ways. First, our model
has a large, full-image field of view at the coarsest scale; as we demonstrate, this is of
critical importance, particularly for depth and normals tasks. Second, we do not use su-
perpixels or any post-process smoothing — instead, our network produces fairly smooth
outputs on its own, allowing us to take a simple pixel-wise maximum. Moreover, our
model can naturally be applied both to piecewise-constant targets (e.g. labels) as well
Pinheiro et al. [98] use a recurrent convolutional network in which each application
predicts labels at the center location of an input region, given predicted labels from
the previous scale and a rescaled input patch. In contrast to our model, the scales
progress from local to global, incorporating progressively more context — precisely the
reverse of our approach. In addition, they apply the same network parameters at all
scales, while we learn distinct networks that can specialize in the edits appropriate to
69
their stage, and communicate between the first two scales with more flexible feature
maps rather than constraining to the classes; these choices are also consistent with our
findings in Chapter 7, in which we find that not tying weights between layers generally
operate on fixed-size images, whereas [98] can in theory be repeatedly applied to cover
In concurrent work, Long et al. [83] adapt the recent VGG ImageNet model [116] to
from different layers, corresponding to different scales, and averaging the outputs. By
contrast, we apply networks for different scales in series, which allows them to make
more complex edits and refinements, starting from a full image field of view. Thus our
architecture easily adapts to many tasks, whereas by using fields of view always centered
Some recent works have applied related architectures to object segmentation. Wang
et al. [141] perform salient object segmentation using a single-scale ConvNet applied
jointly with bounding box detection, but segment only one object per image and are
limited to 50x50 single-channel bitmaps. Huang and Jain [63] segment neurons by recur-
sively applying an affinity graph generator, but apply their model exclusively to neuron
segmentation, use a VQ/SVM pipeline, and make different use of scale. By contrast, we
network that first predicts a coarse global output based on the entire image area, then
refines it using finer-scale local networks. This scheme is illustrated in Fig. 6.1. Our new
model has several architectural improvements: First, we make the model deeper (more
70
convolutional layers). Second, we add a third scale at higher resolution, bringing the
final output resolution up to half the input, or 109 × 147 for NYUDepth. Third, instead
of passing output predictions from scale 1 to scale 2, we pass multichannel feature maps;
in so doing, we found we could also train the first two scales of the network jointly from
the start, somewhat simplifying the training procedure and yielding performance gains.
Scale 1: Full-Image View The first scale in the network predicts a coarse but
spatially-varying set of features for the entire image area, based on a large, full-image
field of view. We accomplish this through the use of two fully-connected layers — the
output of the last full layer is reshaped to 1/16-scale in its spatial dimensions by 64
features, then upsampled by a factor of 4 to 1/4-scale. Note since the feature upsam-
pling is linear, this corresponds to a decomposition of a big fully connected layer from
layer 1.6 to the larger 74 × 55 map; since such a matrix would be prohibitively large and
only capable of producing a blurry output given the more constrained input features,
we constrain the resolution and upsample. Note, however, that the 1/16-scale output is
still large enough to capture considerable spatial variation, and in fact is twice as large
Since the top layers are fully connected, each spatial location in the output connects
to the all the image features, incorporating a very large field of view. This stands in
contrast to the multiscale approach of [16, 28], who apply convolutions and pooling alone
to downsampled versions of the image, producing maps whose output locations’ fields
of view are always centered on the output pixel. This full-view connection is especially
As shown in Fig. 6.1, we trained two different sizes of our model: One where this scale
is based on an ImageNet-trained AlexNet [73], and one where it is initialized using the
Oxford VGG network [116]. We report differences in performance between the models
71
Input
Scale 1
!!!"
conv/pool full conn.
upsample
Scale 2
concat !!!"
conv/pool convolutions
upsample
Scale 3
concat !!!"
convolutions
conv/pool
Figure 6.1: Model architecture. C is the number of output channels in the final predic-
tion, which depends on the task. The input to the network is 320x240.
72
mid-level resolution, by incorporating a more detailed but narrower view of the image
along with the full-image information supplied by the coarse network. We accomplish
this by concatenating the feature maps of the coarse network with those from a single
layer of convolution and pooling, performed at finer stride (see Fig. 6.1). The output of
the second scale is a 74 × 55 prediction (for NYUDepth), with the number of channels
depending on the task. We train Scales 1 and 2 of the model together jointly, using SGD
Scale 3: Higher Resolution The final scale of our model refines the predictions
to higher resolution. We concatenate the Scale-2 outputs with feature maps generated
from the original input at yet finer stride, thus incorporating a more detailed view. The
further refinement aligns the output to higher-resolution details in the image, producing
spatially coherent yet quite detailed outputs. The final output resolution is half the
network input.
6.4 Tasks
We apply this same architecture structure to each of the three tasks we investigate:
depths, normals and semantic labeling. Each makes use of a different loss function and
6.4.1 Depth
For depth prediction, we use a loss function comparing the predicted and ground-truth
log depth maps D and D∗ . Letting d = D − D∗ be their difference, we set the loss to
� �2
1� 2 1 � 1�
Ldepth (D, D∗ ) = di − 2 di + [(∇x di )2 + (∇y di )2 ] (6.1)
n 2n n
i i i
73
where the sums are over valid1 pixels i, and n is the number of valid pixels. Here, ∇x di
and ∇y di are the horizontal and vertical image gradients of the difference.
This loss combines the l2 and scale-invariant terms we used in Chapter 5 with a first-order
matching term (∇x di )2 +(∇y di )2 , which compares image gradients of the prediction with
the ground truth. This encourages predictions to have not only close-by values, but also
similar local structure. We found it indeed produces outputs that better follow depth
To predict surface normals, we change the output from one channel to three, and predict
the x, y and z components of the normal at each pixel. We also normalize the vector
at each pixel to unit l2 norm, and backpropagate through this normalization. We then
employ a simple elementwise loss comparing the predicted normal at each pixel to the
1� 1
Lnormals (N, N ∗ ) = − Ni · Ni∗ = − N · N ∗ (6.2)
n n
i
where N and N ∗ are predicted and ground truth normal vector maps, and the sums
again run over valid pixels (i.e. those with a ground truth normal).
For ground truth targets, we compute the normal map using the same method as in
Silberman et al. [115], which estimates normals from depth by fitting least-squares planes
For semantic labeling, we use a pixelwise softmax classifier to predict a class label for
each pixel. The final output then has as many channels as there are classes. We use a
1
We mask out pixels where the ground truth is missing.
74
simple pixelwise cross-entropy loss,
1� ∗
Lsemantic (C, C ∗ ) = − Ci log(Ci ) (6.3)
n
i
�
where Ci = ezi / ce
zi,c is the class prediction at pixel i given the output z of the final
When labeling the NYUDepth RGB-D dataset, we use the ground truth depth and
normals as additional input channels. We convolve each of the three input types (RGB,
depth and normals) with a different set of 32×9×9 filters, then concatenate the resulting
three feature sets along with the network output from the previous scale to form the input
to the next. 2 Note the first scale is initialized using ImageNet, and we keep it RGB-
only. Applying convolutions to each input type separately, rather than concatenating all
the channels together in pixel space and filtering the joint input, enforces independence
between the features at the lowest filter level, which we found helped performance.
6.5 Training
We train our model in two phases using SGD: First, we jointly train both Scales 1 and
2. Second, we fix the parameters of these scales and train Scale 3. Since Scale 3 contains
four times as many pixels as Scale 2, it is expensive to train using the entire image area
for each gradient step. To speed up training, we instead use random crops of size 74x55:
We first forward-propagate the entire image through scales 1 and 2, upsample, and crop
the resulting Scale 3 input, as well as the original RGB input at the corresponding
location. The cropped image and Scale 2 prediction are forward- and back-propagated
through the Scale 3 network, and the weights updated. We find this speeds up training
2
We also tried the “HHA” encoding proposed by [49], but did not see a benefit in our case, thus we
opt for the simpler approach of using the depth and xyz-normals directly.
75
by about a factor of 3, including the overhead for inference of the first two scales, and
results in about the same if not slightly better error from the increased stochasticity.
All three tasks use the same initialization and learning rates in nearly all layers, indi-
cating that hyperparameter settings are in fact fairly robust to changes in task. These
values were first tuned using the depth task, then verified to be an appropriate order
of magnitude for each other task using a small validation set. The only differences are:
(i) The learning rate for the normals task is 10 times larger than depth or labels. (ii)
Relative learning rates of layers 1.6 and 1.7 are 0.1 each for depth/normals, but 1.0 and
0.01 for semantic labeling. (iii) The dropout rate of layer 1.6 is 0.5 for depth/normals,
but 0.8 for semantic labels, as there are fewer training images.
randomly initialize the fully connected layers of Scale 1 and all layers in Scales 2 and 3.
We train using batches of size 32 for the AlexNet-initialized model but batches of size
16 for the VGG-initialized model due to memory constraints. In each case we step down
the global learning rate by a factor of 10 after approximately 2M gradient steps, and
In all cases, we apply random data transforms to augment the training data. We use
random scaling, in-plane rotation, translation, color, flips and contrast. When trans-
normals and labels. Note the normal vector transformation is the inverse-transpose of
the worldspace transform: Flips and in-plane rotations require flipping or rotating the
normals, while to scale the image by a factor s, we divide the depths by s but multiply
76
6.5.3 Combining Depth and Normals
We combine both depths and normals networks together to share computation, creating
a network using a single scale 1 stack, but separate scale 2 and 3 stacks. Thus we predict
both depth and normals at the same time, given an RGB image. This produces a 1.6x
This shared model also enabled us to try enforcing compatibility between predicted nor-
mals and those obtained via finite difference of the predicted depth (predicting normals
directly performs considerably better than using finite difference). However, while this
constraint was able to improve the normals from finite difference, it failed to improve
either task individually. Thus, while we make use of the shared model for computational
6.6.1 Depth
We first apply our method to depth prediction on NYUDepth v2. We train using the
entire NYUDepth v2 raw data distribution, using the scene split specified in the official
train/test distribution. We then test on the common distribution depth maps, including
filled-in areas, but constrained to the axis-aligned rectangle where there there is a valid
depth map projection. Since the network output is a lower resolution than the original
NYUDepth images, and excludes a small border, we bilinearly upsample our network
outputs to the original 640x480 image scale, and extrapolate the missing border using a
cross-bilateral filter. We compare our method to prior works Ladicky et al. [74], Karsh
et al. [68], Baig et al. [2], Liu et al. [82], and our previous system from Chapter 5 (Eigen
et al. [25]).
Results are shown in Table 6.1. Our model obtains best performance in every metric, due
to our larger architecture. Qualitative results in Fig. 6.2 show considerable improvement
77
(a) (b) (c) (d) (e)
Figure 6.2: Example depth results. (a) RGB input; (b) Result from Chapter 5 [25]; (c)
Our result (Scale 1: AlexNet); (d) Our result (Scale 1: VGG); (e) Ground Truth. Note
the color range of each image is individually scaled.
78
Depth Prediction
Ladicky[74]Karsch[68] Baig [2] Liu [82] Eigen[25] Ours(A) Ours(VGG)
δ < 1.25 0.542 – 0.597 0.614 0.614 0.697 0.769 higher
δ < 1.252 0.829 – – 0.883 0.888 0.912 0.950 is
δ < 1.253 0.940 – – 0.971 0.972 0.977 0.988 better
abs rel – 0.350 0.259 0.230 0.214 0.198 0.158
sqr rel – – – – 0.204 0.180 0.121 lower
RMS(lin) – 1.2 0.839 0.824 0.877 0.753 0.641 is
RMS(log) – – – – 0.283 0.255 0.214 better
sc-inv. – – 0.242 – 0.219 0.202 0.171
Next we apply our method to surface normals prediction. We compare against the
3D Primitives (3DP) and “Indoor Origami” works of Fouhey et al. [30, 31]; Ladicky
et al. [75]; and Wang et al. [140]. As with the depth network, we used the full raw
dataset for training, since ground-truth normal maps can be generated for all images.
Since different systems have different ways of calculating ground truth normal maps, we
compare using both the ground truth as constructed in [30, 31] as well as the method
used in [115], using precomputed predictions supplied by the authors of method. Note
that Wang et al. use a method similar to [30] to construct training targets, while we use
the method of [115] for this purpose. We measure performance with the same metrics as
in [30]: The mean and median angle from the ground truth across all unmasked pixels,
as well as the percent of vectors whose angle falls within a series of three thresholds.
Results are shown in Table 6.2. The smaller version of our model performs similarly
or slightly better than Wang et al. , while the larger version substantially outperforms
all comparison methods. Note that of the ground truths, [30] is somewhat more pre-
processed compared to [115], and thus [30] tends to present flatter areas, while [115] is
Figure 6.3 and 6.4 show example predictions. Note the details captured by our method,
such as the curvature of the blanket on the bed in the first row, sofas in the second row,
79
Surface Normal Estimation (GT [30])
Angle Distance Within t◦ Deg.
Mean Median 11.25◦ 22.5◦ 30◦
3DP [30] 34.2 30.0 18.5 38.6 50.0
Ladicky &al [75] 32.5 22.3 27.4 50.2 60.1
Fouhey &al [31] 35.1 19.2 37.6 53.3 58.9
Wang &al [140] 26.6 15.3 40.1 61.4 69.0
Ours (AlexNet) 23.1 15.1 39.4 63.6 72.7
Ours (VGG) 20.5 13.2 44.0 68.5 77.2
Surface Normal Estimation (GT [115])
Angle Distance Within t◦ Deg.
Mean Median 11.25◦ 22.5◦ 30◦
3DP [30] 37.7 34.1 14.0 32.7 44.1
Ladicky &al [75] 35.5 25.5 24.0 45.6 55.9
Wang &al [140] 28.8 17.9 35.2 57.1 65.5
Ours (AlexNet) 25.9 18.2 33.2 57.5 67.7
Ours (VGG) 22.2 15.3 38.6 64.0 73.9
Table 6.2: Surface normals prediction measured against different types of ground truth
acquisition. Each column shows results for a different ground truth.
RGB input 3DP [30] Ladicky&al [75] Wang&al [140] Ours (VGG) Ground Truth
80
Figure 6.4: Example surface normals results.
81
6.6.3 Semantic Labels
NYU Depth
We finally apply our method to semantic segmentation, first also on NYUDepth. Because
this data provides a depth channel, we use the ground-truth depth and normals as input
into the semantic network, as described in Section 6.4.3. We evaluate our method on
semantic class sets with 4, 13 and 40 labels, described in [115], [16] and [48], respectively.
The 4-class segmentation task uses high-level category labels “floor”, “structure”, “furni-
ture” and “props”, while the 13- and 40-class tasks use different sets of more fine-grained
categories. We compare with several recent methods, using the metrics commonly used
to evaluate each task: For the 4- and 13-class tasks we use pixelwise and per-class ac-
curacy; for the 40-class task, we also compare using the mean pixel-frequency weighted
Jaccard index of each class, and the flat mean Jaccard index.
Results are shown in Table 6.3. We decisively outperform the comparison methods on
the 4- and 14-class tasks. In the 40-class task, our model outperforms Gupta et al. ’14
for both model sizes, and Long et al. with the larger size. Qualitative results are shown
in Fig. 6.7. Even though our method does not use superpixels or any piecewise constant
82
assumptions, it nevertheless tends to produce large constant regions most of the time.
Sift Flow
We confirm our method can be applied to additional scene types by evaluating on the Sift
Flow dataset [81], which contains images of outdoor cityscapes and landscapes segmented
into 33 categories. All images are 256x256, rather than 320x240 for NYUDepth, and so
our model outputs images of a different size. Note that we do not adjust any of the
convolutional kernel sizes or learning rates for this dataset — we simply transfer the
values used for NYUDepth directly; however, we adjust the random crop augmentations
by a few pixels so that feature maps can be combined evenly between scales.
We compare against Tighe et al. [129], Farabet et al. [28], Pinheiro [98] and Long
et al. [83], as well as the weighted kNN system we presented in Chapter 3. Note that
Farabet et al. train two models, using either empirical or rebalanced class distributions
our model by reweighting each class in the cross-entropy loss; we weight each pixel us-
ing weight αc = median f req/f req(c) where f req(c) is the number of pixels of class c
divided by the total number of pixels in images where c is present, and median f req is
83
Results are shown in Table 6.4; we compare regular (1) and reweighted (2) versions of
our model against comparison methods. Our model outperforms all but Long et al. by
substantial margins using our smaller ImageNet model, and performs similarly or bet-
ter to Long et al. with our larger model. We also greatly improve upon the weighted
kNN system we first developed in Chapter 3. Examples are shown in Fig. 6.5. This
demonstrates our model’s adaptability not just to different tasks but also different data.
Pascal VOC
In addition, we also verify our method using the Pascal VOC 2011 validation set. Sim-
ilarly to Long et al. [83], we train using the 2011 training set augmented with 8498
training images collected by Hariharan et al. [52], and evaluate using the 736 images
from the 2011 validation set not also in the Hariharan extra set. We perform online data
augmentations as in our NYUDepth and Sift Flow models, and use the same learning
rates. Because these images have arbitrary aspect ratio, we train our model on square
inputs, and scale the smaller side of each image to 256; at test time we apply the model
with a stride of 128 to cover the image (two applications are usually sufficient).
Results are shown in table Table 6.5 and Fig. 6.6. Our model performs comparably to
84
Input Model (1) Model (2) Ground Truth
Pixelwise Max
Labels Blended
85
Figure 6.6: Example semantic labeling results for Pascal VOC 2011. For each image, we
show RGB input, our prediction, and ground truth.
86
0.123 456%*''(723123 8956%*''(723123 8956%*''(:;<2./(=;23>
!"#$%&"'$()*#
+*,$%'(-%$./$/
Figure 6.7: Example semantic labeling results. Top: Maximum predicted label shown
for each pixel; Bottom: Label colors blended according to softmax outputs. For each
row, we show: (a) input image; (b) 4-class labeling result; (c) 13-class result; (d) 13-class
ground truth. Note we feed the ground-truth depth and normals along with the RGB
image as input to our labeling network.
87
4-Class (pixel acc.)
floor struct furntr prop
Couprie et al. 87.3 86.1 45.3 35.5
Khan et al. 87.1 88.2 54.7 32.6
Stuckler et al. 90.7 81.4 68.1 19.8
Mueller et al. 94.9 78.9 71.1 42.7
Ours (AlexNet) 93.9 87.9 79.7 55.1
We compare performance broken down according to the different scales in our model in
Table 6.7. For depth, normals and 4- and 13-class semantic labeling tasks, we train and
evaluate the model using just scale 1, just scale 2, both, or all three scales 1, 2 and 3.
For the coarse scale-1-only prediction, we replace the last fully connected layer of the
coarse stack with a fully connected layer that outputs directly to target size, i.e. a pixel
map of either 1, 3, 4 or 13 channels depending on the task. The spatial resolution is the
same as is used for the coarse features in our model, and is upsampled in the same way.
We report the “abs relative difference” measure (i.e. |D − D∗ |/D∗ ) to compare depth,
mean angle distance for normals, and pixelwise accuracy for semantic segmentation.
First, we note there is progressive improvement in all tasks as scales are added (rows
88
Contributions of Scales
4-Class 13-Class
Depth Normals
RGB+D+N RGB RGB+D+N RGB
Pixelwise Error Pixelwise Accuracy
lower is better higher is better
Scale 1 only 0.218 29.7 71.5 71.5 58.1 58.1
Scale 2 only 0.290 31.8 77.4 67.2 65.1 53.1
Scales 1 + 2 0.216 26.1 80.1 74.4 69.8 63.2
Scales 1 + 2 + 3 0.198 25.9 80.6 75.3 70.5 64.0
Table 6.7: Comparison of networks for different scales for depth, normals and semantic
labeling tasks with 4 and 13 categories. Largest single contributing scale is underlined.
Effect of Depth/Normals Inputs
Scale 2 only Scales 1 + 2
Pix. Acc. Per-class Pix. Acc. Per-class
RGB only 53.1 38.3 63.2 50.6
RGB + pred. D&N 58.7 43.8 65.0 49.5
RGB + g.t. D&N 65.1 52.3 69.8 58.9
1, 3, and 4). In addition, we find the largest single contribution to performance is the
coarse Scale 1 for depth and normals, but the more local Scale 2 for the semantic tasks —
however, this is only due to the fact that the depth and normals channels are introduced
at Scale 2 for the semantic labeling task. Looking at the labeling network with RGB-
only inputs, we find that the coarse scale is again the larger contributer, indicating the
importance of the global view. (Of course, this scale was also initialized with ImageNet
convolution weights that are much related to the semantic task; however, even initializing
randomly achieves 54.5% for 13-class scale 1 only, still the largest contribution, albeit
by a smaller amount).
The fact that we can recover much of the depth and normals information from the RGB
image naturally leads to two questions: (i) How important are the depth and normals
inputs relative to RGB in the semantic labeling task? (ii) What might happen if we were
to replace the true depth and normals inputs with the predictions made by our network?
89
To study this, we trained and tested our network using either Scale 2 alone or both Scales
1 and 2 for the 13-class semantic labeling task under three input conditions: (a) the RGB
image only, (b) the RGB image along with predicted depth and normals, or (c) RGB plus
true depth and normals. Results are in Table 6.8. Using ground truth depth/normals
have little effect when using both scales, but a tangible improvement when using only
depths/normals for labeling can also be extracted from the input; thus the labeling
network can learn this same information itself, just from the label targets. However, this
supposes that the network structure is capable of learning these relations: If this is not
the case, e.g. when using only Scale 2, we do see improvement. This is also consistent
with Section 6.7.1, where we found the coarse network was important for prediction in
all tasks — indeed, supplying the predicted depth/normals to scale 2 is able to recover
6.8 Discussion
Together, depth, surface normals and semantic labels provide a rich account of a scene.
We have proposed a simple and fast multiscale architecture using convolutional networks
that gives excellent performance on all three modalities. The models beat existing meth-
ods on the vast majority of benchmarks we explored. This is impressive given that many
of these methods are specific to a single modality and often slower and more complex
algorithms than ours. As such, our model also provides a convenient new baseline for
One drawback of this approach is that it currently requires a large number of relatively
dense pixel maps for training. Possible future extensions of this approach include adapt-
ing it to be able to use more sparsely labeled targets, as well as its application to further
90
In the past four chapters, we have applied convolutional networks to multiple different
pixel-map prediction tasks: denoising, depth prediction, surface normals, and seman-
tic labeling. We also saw the ConvNet approach soundly improve over the kNN scene
parsing method first described in Chapter 3. The next two chapters look into convolu-
tional architecture sizing patterns in some more detail, then explore some ideas that use
91
Chapter 7
Understanding Deep
Architectures using a
Recursive Convolutional Network
The work presented in this chapter appeared at the ICLR Workshops 2014 [26], and was
7.1 Introduction
The previous chapters in this thesis developed convolutional network models for use
in four pixel map prediction tasks: denoising, depth-from-camera, surface normals, and
semantic labels. Each used multiple layers of convolution to make these predictions based
on the input. However, many sizing factors needed to be set in order to define the models,
including the numbers of layers, feature maps, kernel pixel width, pooling, etc. Moreover,
each of these adjusted both the size of the activation units as well as the total number of
parameters. Are there any intuitions for how different configurations affect performance
and the system’s capabilities? This chapter aims to characterize some of these in the
92
context of a classification task, by evaluating the independent contributions of three
interlinked linked variables: The numbers of layers, feature maps, and parameters.
network model. This model is equivalent to a deep convolutional network where all
layers have the same number of feature maps and the filters (weights) are tied across
we are able to tease apart these three factors that determine performance. For example,
adding another layer increases the number of parameters, but it also puts an additional
non-linearity into the system. But would the extra parameters be better used expanding
the size of the existing layers? To provide a general answer to this type of issue is difficult
since multiple factors are conflated: the capacity of the model (and of each layer) and
its degree of non-linearity. However, we can design a recursive model to have the same
number of layers and parameters as the standard convolutional model, and thereby see
whether the number of feature maps (which differs) is important. Or we can match the
number of feature maps and parameters to see if the number of layers (and number of
non-linearities) matters.
Several recent works have found that stacks of multiple unpooled convolution layers are
essential to obtain high performance on image classification tasks, including all ImageNet
challenge winners for the past three years [73, 149, 113, 125, 116]. Hence the use of
multiple convolution layers is vital and the development of superior models relies on
understanding their properties. Our investigation in this chapter has particular bearing
in characterizing these layers, and our results are very much corroborative with the
We find that while increasing the numbers of layers and parameters each have clear bene-
fit, the number of feature maps (and hence dimensionality of the representation) appears
ancillary, and finds most of its benefit through the introduction of more weights. Our re-
sults (i) empirically confirm the notion that adding layers alone increases computational
power, within the context of convolutional layers, and (ii) suggest that precise sizing of
93
convolutional feature map dimensions is itself of little concern — more attention should
In addition to unpooled stacks of convolutional maps, the model we employ also has
relations to recurrent neural networks. These are are well-studied models [60, 111, 124],
naturally suited to temporal and sequential data. For example, they have recently been
shown to deliver excellent performance for phoneme recognition [44] and cursive hand-
writing recognition [43]. However, they have seen limited use on image data. Socher
et al. [121] showed how image segments could be recursively merged to perform scene
parsing. More recently [120], they used a convolutional network in a separate stage to
first learn features on RGB-Depth data, prior to hierarchical merging. In these models
the input dimension is twice that of the output. This contrasts with our model which
Our network also has links to several auto-encoder models. Sparse coding [93] uses
iterative algorithms, such as ISTA [5], to perform inference. Rozell et al. [105] showed
how the ISTA scheme can be unwrapped into a repeated series of network layers, which
can be viewed as a recursive net. Gregor & LeCun [45] showed how to backpropagate
through such a network to give fast approximations to sparse coding known as LISTA.
Rolfe & LeCun [103] then showed in their DrSAE model how a discriminative term can
of LISTA or DrSAE.
7.3 Approach
Our investigation is based on a multilayer Convolutional Network [77], for which all layers
beyond the first have the same size and connection topology. All layers use rectified linear
94
units (ReLU) [13, 37, 90]. We perform max-pooling with non-overlapping windows after
the first layer convolutions and rectification; however, layers after the first use no explicit
pooling. We refer to the number of feature maps per layer as M , and the number of
layers after the first as L. To emphasize the difference between the pooled first layer
and the unpooled higher layers, we denote the first convolution kernel by V and the
the convolutions. A final classification matrix C maps the last hidden layer to softmax
inputs.
Since all hidden layers have the same size, the transformations at all layers beyond the
first have the same number of parameters (and the same connection topology). In addi-
tion to the case where all layers are independently parameterized, we consider networks
for which the parameters of the higher layers are tied between layers, so that Wi = Wj
and bi = bj for all i, j. As shown in Fig. 7.1, tying the parameters across layers renders
the deep network dynamics equivalent to recurrence: rather than projecting through a
each layer implies another set of ties, enforcing translation-invariance among the param-
eters. This novel recursive, convolutional architecture is reminiscent of LISTA [45], but
We describe our models for the CIFAR-10 [72] and SVHN [91] datasets used in our
we drop the superscript n indicating the index in the dataset for notational simplicity.
The first layer applies a set of M kernels Vm of size 8 × 8 × 3 via spatial convolution
with stride one (denoted as ∗), and per-map bias b0m , followed by the element-wise
95
8 Z3
... W
M
convolution
...
...
(a) (b) (c)
8 8
Feature map Z3 Feature map Z3
M Conv & ReLU M Conv & ReLU
Filters W2 Filters W
... ...
M M
8 8 Feature map 8
Feature map Z2 Feature map Z2 Z2, Z3, ...
M Conv & ReLU M Conv & ReLU M Conv & ReLU
Filters W1 Filters W Filters W
... ... ...
M M M
8 8 8
Feature map Z1 Feature map Z1 Feature map Z1
M M M
} 4x4 max pooling } 4x4 max pooling } 4x4 max pooling
Filters V M Conv & ReLU Filters V M Conv & ReLU Filters V M Conv & ReLU
... 8 ... 8 ... 8
32 32 32
Input image X Input image X Input image X
Figure 7.1: Our model architecture prior to the classification layer, as applied to CIFAR
and SVHN datasets. (a): Version with un-tied weights in the upper layers. (b): Version
with tied weights. Kernels connected by dotted lines are constrained to be identical. (c):
The network with tied weights from (b) can be represented as a recursive network.
pooled within each feature map with non-overlapping 4 × 4 windows, producing a hidden
layer Z1 of size 8 × 8 × M .
� � � �
Pm = max 0, b0m + Vm ∗ X , Z1i,j,m = max P4·i+i� ,4·j+j � ,m
i� ,j � ∈{0,··· ,3}
l of
All L succeeding hidden layers maintain this size, applying a set of M kernels Wm
size 3 × 3 × M , also via “same” spatial convolution with stride one, and per-map bias
� �
Zlm = max 0, bl−1
m + W l−1
m ∗ Z l−1
In the case of the tied model (see Fig. 7.1(b)), the kernels W l (and biases bl ) are con-
strained to be the same. The final hidden layer is subject to pixel-wise L2 normalization
exp(Yk� ) �
Yk = � � where Yk� = Cki,j,m · ZL+1 L+1
i,j,m /||Zi,j ||
k exp(Yk ) i,j,m
96
The first-layer kernels Vm are initialized from a zero-mean Gaussian distribution with
standard deviation 0.1 for CIFAR-10 and 0.001 for SVHN. The kernels of the higher layers
l are initialized to the identity transformation W � � �
Wm i ,j ,m ,m = δi� ,0 · δj � ,0 · δm� ,m , where
δ is the Kronecker delta. The network is trained to minimize the logistic loss function
� n ) and k(n) is the true class of the nth element of the dataset. The
L = n log(Yk(n)
parameters are not subject to explicit regularization. Training is performed via stochastic
gradient descent with minibatches of size 128, learning rate 10−3 , and momentum 0.9:
� ∂Ln
g = 0.9 · g + ; {V, W, b} = {V, W, b} − 10−3 · g
∂ {V, W, b}
n∈minibatch
7.4 Experiments
We first provide an overview of the model’s performance at different sizes, with both
untied and tied weights, in order to examine basic trends and compare with other current
systems. For CIFAR-10, we tested the models using M = 32, 64, 128, or 256 feature
maps per layer, and L = 1, 2, 4, 8, or 16 layers beyond the first. For SVHN, we used
M = 32, 64, 128, or 256 feature maps and L = 1, 2, 4, or 8 layers beyond the first. That
we were able to train networks at these large depths is due to the initialization of all
l to the identity: this initially copies activations at the first layer up to the last layer,
Wm
and gradients from the last layer to the first. Both untied and tied models had trouble
Results are shown in Figs. 7.2 and 7.3. Here, we plot each condition on a grid according
to numbers of feature maps and layers. To the right of each point, we show the test error
(top) and training error (bottom). Contours show curves with a constant number of
parameters: in the untied case, the number of parameters is determined by the number
of feature maps and layers, while in the tied case it is determined solely by the number
of feature maps; Section 7.4.2 examines the behavior along these lines in more detail.
97
First, we note that despite the simple architecture of our model, it still achieves compet-
itive performance on both datasets, relative to other models that, like ours, do not use
any image transformations or other regularizations such as dropout [58, 139], stochastic
pooling [146] or maxout [40] (see Table 7.1). Thus our simplifications do not entail a
We also see roughly how the numbers of layers, feature maps and parameters affect
performance of these models at this range. In particular, increasing any of them tends
10 at 16 layers in the tied case, which goes up slightly). We now examine the independent
L and parameters P are interrelated: Increasing the number of feature maps or layers
increases the total number of parameters in addition to the representational power gained
by higher dimensionality (more feature maps) or greater nonlinearity (more layers). But
by using the tied version of our model, we can investigate the effects of each of these
98
Figure 7.3: Classification performance on Street View House Numbers as a function of
network size, for untied (left) and tied (right) models.
Table 7.1: Comparison of our largest model architecture (measured by number of pa-
rameters) against other approaches that do not use data transformations or stochastic
regularization methods.
To accomplish this, we consider the following three cases, each of which we investigate
1. Control for M and P , vary L: Using the tied model (constant M and P ), we
2. Control for M and L, vary P : Compare pairs of tied and untied models with the
increases when going from tied to untied model for each pair.
3. Control for P and L, vary M : Compare pairs of untied and tied models with
99
the same number of parameters P and layers L. The number of feature maps M
increases when going from the untied to tied model for each pair.
Note the number of parameters P is equal to the total number of independent weights
and biases over all layers, including initial feature extraction and classification. This is
given by the formula below for the untied model (for the tied case, substitute L = 1
P = 8·8·3·M + 3 · 3 · M2 · L + M · (L + 1) + 64 · M · 10 + 10
We examine the first of these cases in Fig. 7.4. Here, we plot classification performance
at different numbers of layers using the tied model only, which controls for the number
of parameters. A different curve is shown for different numbers of feature maps. For
both CIFAR-10 and SVHN, performance gets better as the number of layers increases,
although there is an upward tick at 8 layers for CIFAR-10 test error. The predominant
cause of this appears to be overfitting, since the training error still goes down. At these
depths, therefore, adding more layers alone tends to increase performance, even though
no additional parameters are introduced. This is because additional layers allow the
This conclusion is further supported by Fig. 7.5, which shows performance of the untied
model according to numbers of parameters and layers. Note that vertical cross-sections
of this figure correspond to the constant-parameter contours of Fig. 7.2. Here, we can
also see that for any given number of parameters, the best performance is obtained with
a deeper model. The exception to this is again the 8-layer models on CIFAR-10 test
100
Experiment 1a: Error by Layers and Features (tied model)
Test
Error
Training
Error
Figure 7.4: Comparison of classification error for different numbers of layers in the tied
model. This controls for the number of parameters and features. We show results for
both (a) CIFAR-10 and (b) SVHN datasets.
To vary the number of parameters P while holding fixed the number of feature maps
M and layers L, we consider pairs of tied and untied models where M and L remain
the same within each pair. The number of parameters P is then greater for the untied
model.
The result of this comparison is shown in Fig. 7.6. Each point corresponds to a model
pair; we show classification performance of the tied model on the x axis, and performance
of the untied model on the y axis. Since the points fall below the y = x line, classification
performance is better for the untied model than it is for the tied. This is not surprising,
since the untied model has more total parameters and thus more flexibility. Note also
that the two models converge to the same test performance as classification gets better —
this is because for the largest numbers of L and M , both models have enough flexibility
101
Experiment 1b: Error by Parameters and Layers (untied model)
Test
Error
Training
Error
We now consider the third condition from above, the effect of varying the number of
feature maps M while holding fixed the numbers of layers L and parameters P .
For a given L, we find model pairs whose numbers of parameters P are very close by
varying the number of feature maps. For example, an untied model with L = 3 layers
and M = 71 feature maps has P = 195473 parameters, while a tied model with L = 3
layers and M = 108 feature maps has P = 195058 parameters — a difference of only
0.2%. In this experiment, we randomly sampled model pairs having the same number
of layers, and where the numbers of parameters were within 1.0% of each other. We
102
Experiment 2: Same Feature Maps and Layers, Varied Parameters
Test
Error
Training
Error
Figure 7.6: Comparison of classification error between tied and untied models, control-
ling for the number of feature maps and layers. Linear regression coefficients are in the
bottom-right corners.
considered models where the number of layers beyond the first was between 2 and 8, and
the number of feature maps was between 16 and 256 (for CIFAR-10) or between 16 and
Fig. 7.7 shows the results. As before, we plot a point for each model pair, showing
classification performance of the tied model on the x axis, and of the untied model on
the y axis. This time, however, each pair has fixed P and L, and tied and untied models
differ in their number of feature maps M . We find that despite the different numbers of
feature maps, the tied and untied models perform about the same in each case. Thus,
103
Experiment 3: Same Parameters and Layers, Varied Feature Maps
Test
Error
Training
Error
Figure 7.7: Comparison of classification error between tied and untied models, con-
trolling for the number of parameters and layers. Linear regression coefficients in the
bottom-right corners.
7.5 Discussion
Above we have demonstrated that while the numbers of layers and parameters each
have clear effects on performance, the number of feature maps has little effect, once the
performance; instead we find that convolutional layers are insensitive to this size.
This observation is also consistent with Fig. 7.5: Allocating a fixed number of parameters
across multiple layers tends to increase performance compared to putting them in few
layers, even though this comes at the cost of decreasing the feature map dimension.
This is precisely what one might expect if the number of feature maps had little effect
Our analysis employed a special tied architecture and comes with some important caveats,
104
however. First, while the tied architecture serves as a useful point of comparison leading
to several interesting conclusions, it is new and thus its behaviors are still relatively un-
known compared to the common untied models. This may particularly apply to models
with a large number of layers (L > 8), or very small numbers of feature maps (M < 16),
which have been left mostly unexamined. Second, our experiments all used a simplified
architecture, with just one layer of pooling. While we believe the principles found in
our experiments are likely to apply in more complex cases as well, this is unclear and
requires further investigation to confirm. Nevertheless, many recent systems make heavy
this case.
The results we have presented provide empirical confirmation within the context of con-
volutional layers that increasing layers alone can yield performance benefits (Experiment
1a). They also indicate that filter parameters may be best allocated in multilayer stacks
(Experiments 1b and 3), even at the expense of having fewer feature maps. In conjunc-
tion with this, we find the feature map dimension itself has little effect on convolutional
layers’ performance, with most sizing effects coming from the numbers of layers and
parameters (Experiments 2 and 3). Thus, focus is best placed on these variables when
105
Chapter 8
Convolutional Unsupervised
Methods
The work presented in this chapter is currently unpublished. Section 8.2 is a joint work
with Rob Fergus; Sections 8.3 and 8.4 are collaborations with Jason Rolfe, Rob Fergus
8.1 Introduction
a fast feed-forward network. Second, we apply an entropy cost that causes convolutional
features to organize into two different types, prototype templates and deformations,
that factors many higher-level edge features away from the lowest pixel-level information
necessary for reconstruction. Third, we outline how ZCA whitening can be adapted for
106
8.2 Convolutional LISTA Autoencoder
The Convolutional LISTA Autoencoder combines ideas from Gregor et al., “Learned
ISTA” [45] and Zeiler et al., “Deconvolutional Networks” [147, 148]. We first adapt the
LISTA network described in [45] to be convolutional, and use it as the encoder half of
an autoencoder: Rather than training the encoder to predict true sparse codes found
8.2.1 Background
Before describing our LISTA autoencoder network, we first review the basics of ISTA and
LISTA. Given an input x and (convolutional) filter dictionary W , the Iterative Shrinkage
and Thresholding Algorithm (ISTA) [18, 105, 5] finds codes z that minimize the sparse
error 12 ||W T ∗ z − x||22 + λ|z|1 , producing codes that are both sparse (have few nonzero
elements) and reconstruct the input well. The procedure is based on a gradient descent
on z:
z0 = 0
shrink and threshold operation, i.e. shα (z) = max(0, z −α) if the codes z are constrained
to be nonnegative. The scalars α and η come from the strength λ of the sparsity term
Each step of ISTA adds to the code z a small amount of the input filtered by W , thus
increasing the activations of matching filters, while at the same time subtracting out
107
activations of features similar to each another via the lateral inhibitions S (equal to
activations, however often takes a fairly large (e.g. 100 or more) number of iterations.
Learned ISTA (LISTA) [45] modifies this by decoupling the inhibition weights S from the
dictionary W , allowing each to be trained separately. The inference procedure Eqn. 8.1
is unrolled for fixed small number of steps (e.g. five), and trained to predict the true
sparse codes z of a training dataset via backpropagation. The network can use the
weight relaxations to learn to “shortcut” many of the steps needed for inference, so
that it quickly produces codes with few iterations. This comes at a cost of restricted
generalizability — the inference model is trained on a training set and may perform
small subset of the full unrestricted space, e.g. the set of natural images versus the set
Our convolutional LISTA autoencoder, depicted in Fig. 8.1(a), uses a LISTA network
as the encoder half of an autoencoder. Rather than train a LISTA model to predict
true sparse codes, we further extend the network with separate encoder and decoder
matrix S. The decoder then performs a linear convolution back to the input space using
z0 = 0
x� = WdT ∗ zn
108
z2!
z1´! z1!
Wd T! We! 5!" S!
x´! x! x´! x!
(a) (b)
Figure 8.1: Convolutional LISTA Autoencoder architecture. (a) Single layer. (b) A stack
of two layers, with max-pooling switches transfered between encoder and decoder.
The operations perfomormed by this network are thus similar to ISTA sparse coding
inference, but with separately trainable encoder, decoder and lateral inhibition kernels.
1 �
For training, we use a l2 reconstruction and l1 sparsity-inducing loss, L(x) = 2 ||x −
x||22 + λ|zn |1 . We also constrain the decoder weights so that each filter is unit norm.
Another variation that we also tried instead uses an inhibition kernel Sp to project back
to the input (pixel) space, rather than working between features in the code space. This
generally has fewer connections, but is faster, and the inhibition kernels can be visualized
z0 = 0
Filters learned using n = 5 iterations on MNIST and CIFAR-10 are shown in Fig. 8.2.
109
MNIST
CIFAR-10
Figure 8.2: Filters learned using MNIST and CIFAR-10 (for MNIST, red is positive and
blue negative). (a) Decoder Wd , (b) Encoder We , (c) Inhibition Sp . The encoder has
negative “shadows” around positive stroke centers that help turn off the activation soon
as the filter becomes unaligned; the inhibition kernels subtract out the stroke as it is
explained.
We can stack multiple LISTA autoencoders together, combined with spatial max-pooling
layers, to form a deep autoencoder network (Fig. 8.1(b)). This stacking is similar to that
in [147, 148], but uses LISTA autoencoders instead of iterative sparse coding inference.
Since the feed-forward encoder network performs many fewer iterations (about one one-
hundredth the number), it is much faster. Yet since it has a trained inference network,
section on the input images, using five z-iterations. This produces first-layer codes
sequentially train a stack of two additional autoencoders and pooling, by minimizing the
reconstruction error of the pooled maps immediately below. This forms a stack of three
autoencoder applications.
110
Classification Error
Layer 1 0.42
Layer 2 0.37
Layer 3 0.34
Table 8.1: Classification error on CIFAR-10 using features from each layer of a Convo-
lutional LISTA Autoencoder. We train a single-layer softmax on top of fixed features
from each layer to evaluate their relation with semantic class labels. By comparison, a
similar fully-supervised network achieves an error of around 0.15.
We show reconstructions from the top 10 activations of each feature map in Figures 8.3
and 8.4, when trained on CIFAR-10 with ZCA whitening. We see that the network is
able to capture extended edges, as well as corners and edge or color combinations, at a
Fixing the features at each layer, we also trained a 10-unit softmax classifier on top of
each layer’s features to check their relationship to predicting semantic labels. Results are
in Table 8.1: Classification performance improves between the first and second layers,
and the third layer is slightly better yet, though not by much.
111
Figure 8.3: Layer 2 reconstructions. We show reconstructions for the top 10 activations
for each second layer unit across the test set. The decoder unpools using the pooling
locations determined by the encoder.
112
113
Figure 8.4: Layer 3 reconstructions. We show reconstructions for the top 10 activations for each third layer unit across the test set.
The decoder unpools through two pooling layers, using the locations determined in the encoder pooling layers.
8.3 Entropy Prototypes
model, although not convolutional, is that of Rolfe et al. [103], which adds a classifica-
tion objective in addition to these two. They find that with all three objective terms,
weights are learned so that the activation units organize themselves into “prototypes”
and “deformation” units. Prototypes capture a mean template for each class, while
deformations edit the template to better reconstruct the input. The prototype units
display several characteristics that differentiate them from the deformations: They are
fewer in number, activate higher in magnitude, and turn on later in the network stack,
In this section, we describe a system that learns units with similar characteristics con-
the classification cost in [103]. Using this entropy cost, we can find convolutional “pro-
totypes” and “deformations”; applied with strided convolutions, these can factor out
larger low-level structures from small pixel details at each input window.
Given an input x, we use the convolutional LISTA autoencoder described in Section 8.2.2
to find a code z and reconstruction x� ; however, the loss including entropy term is now
� �
1 ez
L(x) = ||x� − x||22 + λ|z|1 + µH � z
2 k ek
�
where H(y) = − yijk log yijk
i,j,k
That is, the additional cost tries to minimize the entropy of the features (indexed by k) at
each spatial location i, j. This happens when each spatial location has a strong activation
cross-entropy loss, substituting the feature activations themselves for the ground-truth
class labels.
114
We also bound the l2 norm of the encoder weights We , and normalize z across the
feature dimension at each location before passing it to the softmax in the entropy cost.
These prevent the entropy term being satisfied by a degenerate case where a very large
activation is always placed on a single z unit at each spatial location (the corresponding
decoder for that unit is a self-canceling convolution kernel with checkers of positive and
We trained this model on both MNIST and NORB. In each case, we first pretrained the
autoencoder using just the reconstruction and sparsity terms, then added in the entropy
cost. We used 8x8 kernels with a stride of 4 for the convolutions, so that in fact they
encoded overlapping tiles. NORB was preprocessed using convolutional ZCA (described
We show the filters the system learns in Fig. 8.5, sorted descending by mean nonzero
activation. For MNIST, the first few units (corresponding to prototypes) are all-positive
with uniform thickness. The deformation units that follow have both positive and nega-
tive values, which often edit the stroke thickness. For NORB, the prototypes are simple
edges, while deformation units are more complex. We also plot the mean nonzero acti-
vations for each unit in Fig. 8.6. For each dataset, there is a distinctive split between
prototype units, which have a large activation, and the deformations, which tend to be
smaller.
Reconstructions are shown in Figures 8.7 and 8.8. We show the pixel-space reconstruction
using only prototype units (chosen using a threshold on the mean nonzero activation),
only deformation units, and both. For MNIST, strokes tend to become more uniform
in thickness for prototype-only reconstruction, and are more limited in orientation due
to the more discrete set of prototypes. For NORB, prototype-only reconstructions also
use a more limited set of edges, appearing similar to line drawings. Thus many pixel-
level details are explained by the deformation units, allowing essential templates to be
115
MNIST
Decoder Encoder
NORB
Decoder Encoder
Figure 8.5: Autoencoder weights trained with per-location entropy cost, sorted descend-
ing by mean nonzero activation. Prototype units appear in the beginning. For MNIST,
these are all-positive with around uniform thickness; the deformation units that follow
have both positive and negative values, often editing the stroke thickness.
MNIST NORB
Figure 8.6: Mean nonzero activations: For each unit, we plot the mean of its activation
out of times it is nonzero (blue, left axis). We also show the fraction of the time the
unit is active when any unit at the same location is active (red, right axis). Units are
sorted according to mean nonzero activation. We find that there are two distinct kinds
of units, corresponding to prototypes and deformations.
116
Figure 8.7: Reconstructions of MNIST with per-location entropy cost. For each image,
we show (i) reconstruction using prototypes, (ii) using only deformations, (iii) recon-
struction with all units, (iv) original input image.
Figure 8.8: Reconstructions of NORB with per-location entropy cost. For each image,
we show (i) reconstruction using prototypes, (ii) using only deformations, (iii) recon-
struction with all units.
117
8.4 Convolutional ZCA Whitening
quencies do not dominate reconstruction costs; this enables unsupervised methods based
on reconstruction to learn codes that represent the full spectrum of the image, rather
than just constant regions. An effective whitening transformation is ZCA [72], which
constructs a linear transformation that explicitly scales each singular value to be 1. How-
ever, this transform is applied to the entire image, and requires the covariance between
all pixels. While reasonable for small images such as the 32x32 CIFAR, this is expensive
transformation for a sample of local image patches across the dataset, and then apply
this transform to every patch in a larger image. We then use the center pixel of each
ZCA’d patch to create the conv-ZCA output image. The operations of applying local
ZCA and selecting the center pixel can be combined into a single convolution kernel,
resulting in the following algorithm (explained using RGB inputs and 9x9 kernel):
for each pair of colors (ci , cj ), set k[cj , ci , :, :] = V [:, cj , x0 , y0 ]T D−1/2 V [:, ci , :, :]
where (x0 , y0 ) is the center pixel location (e.g. (5,5) for a 9x9 kernel)
Note the matrix multiplies in step 4 work on the PCA dimension of V and are “broad-
casted” over each spatial component of the V on the right (which maps the input to the
eigenspace).
We show the top 100 singular values for a random sample of 1M patches both before and
after convolutional ZCA processing in Fig. 8.9, using the RGB part of the NYU Depth
118
v2 dataset, rescaled to 320x240. We display plots for both 9x9 and 15x15 patches, but
used the same 9x9 ZCA kernel in each case. The kernel was computed using 10M 9x9
Whitening kernels found using this method on NORB (grayscale, 96x96) and NYU Depth
v2 (using RGB only, rescaled to 320x240) are shown in Fig. 8.10. In both cases we use
9x9 kernels and keep 50 PCA components. We show transformed images in Fig. 8.11.
Figure 8.9: Top 100 singular values for 9x9 and 15x15 patches, before and after applying
convolutional ZCA with 9x9 filters and 50 PCA components, using the RGB part of NYU
Depth v2 rescaled to 320x240.
Figure 8.10: Convolutional ZCA whitening kernels, trained on NORB (left) and the
RGB part of NYU Depth v2 (right).
119
Figure 8.11: Example images before and after convolutional ZCA processing, for NORB
(top) and NYU Depth RGB (bottom).
120
Chapter 9
Conclusion
In this thesis we have developed convolutional network models that infer 2D pixel maps
for a variety of tasks, as well as exploring several related systems. In particular, we have
intervening glass pane using a local convolutional network, and show that training
3. We introduce a new system for predicting depth from a single image that integrates
global and local views using a series of convolutional networks applied at different
easily adapted to predict many types of 2D outputs effectively from a single image,
and apply it to predict depth, surface normals and semantic labels. Our model uses
a global-scale network whose field of view includes the entire image area to find low-
121
resolution feature maps, then refines predictions through a series of progressively
feature maps by employing a recursive classification network, and find that higher
depth alone can result in higher performance while the number of feature maps is
entropy objective that factors prototype template features away from reconstruc-
tion details. We also construct a convolutional ZCA whitening operation that can
While the systems we presented are able to tackle a variety of problems, there are several
Firstly, our systems require a relatively large amount of densely labeled data for train-
ing. While this is fairly inexpensive to acquire for depth and surface normals, it is more
difficult to obtain for semantic tasks, often requiring detailed human annotations. Some
of this might be relieved by handling sparser target maps, where many pixels are unla-
beled. While we can handle relatively small unlabeled regions by excluding them from
the training loss, it is unclear how well our methods would perform if most of the data
since the precise boundary may not be available in the label map. Furthermore, not all
unlabeled regions can used directly as negative examples, since the model will eventually
In addition, our models have only a limited interchange between between top-down and
encing bottom-up flow: The network looks back to the original image when refining the
output of coarser layers, thus the coarse prediction is recombined with a bottom-up sig-
122
nal from the image to produce each finer-scale prediction. However, the new finer-scale
prediction cannot then be used to influence the coarser scale — there is no continual cir-
culation between the scales and layers. Such signals might help disambiguate cases with
multiple interpretations, and enable the system to predict each individually, rather than
an average that splits the difference in error. For example, Deep Boltzmann Machines
can achieve this through iterative inference [106], settling on single interpretations rather
than averaging between them. Such methods may help in similar ways for our case.
Another possible extension is to use losses different from pixelwise accuracy. Pixelwise
loss leads to predictions that essentially average nearby plausible outputs in pixel space.
Other losses might push these averages into a more complex space, leading to qualita-
tively different errors. For instance, adversarial networks [39] have very recently been
used to generate small images. One might imagine applying this as a loss, feeding the
output of our network concatenated with the RGB input to a second discriminator net-
work, which attempts to distinguish it from the true data. Our network might then
Additionally, it may be possible to use data from device sensors as a means for “unsuper-
vised” feature learning. For instance, depth maps can be captured automatically, at less
expense than hand-annotated labels, yet predicting them still requires extracting many
relevant features. These representations may be able to generalize to other tasks better
than features found by reconstruction (although worse than labels created with the new
task in mind). Learning from sensor data may fall between “supervised” learning with
human annotations, and “unsupervised” learning in which only the inputs are available.
images. There has already been much work using ConvNets for speech [57], text [14, 15]
and video [127], and the idea of a convolution also can even be generalized to non-
grid topologies [8], showing promise for a range of other graphs such as 3D meshes,
123
Bibliography
[1] G. Alain and Y. Bengio. What regularized auto-encoders learn from the data
[2] M. H. Baig and L. Torresani. Coarse-to-fine depth estimation from a single image
[3] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learn-
ing from examples without local minima. Neural networks, 2(1):53–58, 1989.
[4] P. Barnum, S. Narasimhan, and K. Takeo. Analysis of rain and snow in frequency
[7] Y.-L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling
[8] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally
[9] H. Burger, C. Schuler, and S. Harmeling. Image denoising: Can plain neural
124
[10] H. Burger, C. Schuler, and S. Harmeling. Image denoising with multi-layer per-
IJCAI, 2011.
[13] A. Coates and A. Y. Ng. The importance of encoding versus training with sparse
[14] R. Collobert. Deep learning for efficient discriminative parsing. In AISTATS, 2011.
[17] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising with block-
[19] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-fei. Imagenet: A large-scale
Hall, 1992.
125
[21] B. Dong, H. Ji, J. Li, Z. Shen, and Y. Xu. Wavelet frame based blind image
[22] D. Eigen and R. Fergus. Nonparametric image parsing using adaptive neighbor
[23] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with
2014.
[24] D. Eigen, D. Krishnan, and R. Fergus. Restoring an image taken through a window
[25] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image
[27] M. Elad and M. Aharon. Image denoising via learned dictionaries and sparse
[28] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Scene parsing with multiscale
[30] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3d primitives for single image
ECCV, 2014.
[32] A. Frome, Y. Singer, and J. Malik. Image retrieval and classification using local
126
[33] K. Fukushima. Neocognitron: A self-organizing neural network model for a mech-
1980.
[34] K. Garg and S. Nayar. Detection and removal of rain from videos. In CVPR, pages
528–535, 2004.
[35] P. V. Gehler and S. Nowozin. On feature combination for multiclass object classi-
[36] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti
[37] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier networks. In AISTATS,
[41] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and
[44] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent
127
[45] K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In ICML,
2010.
Due to Dirty Camera Lenses and Thin Occluders. SIGGRAPH Asia, Dec 2009.
[49] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning rich features from rgb-d
and Y. LeCun. Learning long-range vision for autonomous off-road driving. Journal
[55] G. Heitz and D. Koller. Learning spatial context: using stuff to find things. In
CVPR, 2008.
[56] A. Hermans, G. Floros, and B. Leibe. Dense 3d semantic mapping of indoor scenes
128
[57] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior,
modeling in speech recognition: The shared views of four research groups. Signal
arXiv:1207.0580, 2012.
9(8):1735–1780, 1997.
[61] D. Hoiem, A. Efros, and M. Hebert. Closing the loop on scene interpretation. In
CVPR, 2008.
[62] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo pop-up. In ACM SIG-
[63] G. B. Huang and V. Jain. Deep and wide multiscale recursive networks for robust
[64] D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional
architecture in the cat’s visual cortex. The Journal of physiology, 160(1), 1962.
[65] V. Jain and S. Seung. Natural image denoising with convolutional networks. In
NIPS, 2008.
[67] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-
129
[68] K. Karsch, C. Liu, S. B. Kang, and N. England. Depth extraction from video using
tures through topographic filter maps. In Computer Vision and Pattern Recogni-
tion, 2009. CVPR 2009. IEEE Conference on, pages 1605–1612. IEEE, 2009.
arXiv:1312.3429v2, 2013.
[72] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical
[74] L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out of perspective. In CVPR,
2014.
[75] L. Ladickỳ, B. Zeisl, and P. Marc. Discriminatively trained dense surface normal
[76] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
2006.
[78] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth from a
[79] A. Levin and B. Nadler. Natural image denoising: Optimality and inherent bounds.
In CVPR, 2011.
130
[80] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing: label transfer via
[81] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. Freeman. Sift flow: dense corre-
[82] F. Liu, C. Shen, and G. Lin. Deep convolutional neural fields for depth estimation
[83] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
1999.
[87] R. Memisevic and C. Conrad. Stereopsis via deep learning. In NIPS Workshop on
[88] J. Michels, A. Saxena, and A. Y. Ng. High speed obstacle avoidance using monoc-
[90] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann
[91] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits
131
[92] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation
[93] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A
[94] M. Osadchy, Y. Le Cun, and M. L. Miller. Synergistic face detection and pose es-
tional probabilities from raw speech signal using convolutional neural networks. In
Interspeech, 2013.
[97] S. Paris and F. Durand. A fast approximation of the bilateral filter using a signal
[98] P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene
scale mixtures of Gaussians in the wavelet domain. IEEE Trans Image Processing,
[101] X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features and algorithms. In
CVPR, 2012.
132
[103] J. Rolfe and Y. LeCun. Discriminative recurrent sparse auto-encoders. In ICLR,
2013.
[104] M. Roser and A. Geiger. Video-based raindrop detection for improved image reg-
[107] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular
[108] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3-d scene structure from a
ICLR, 2013.
133
[114] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image cate-
[115] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support
[116] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
[118] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections
in 3d. 2006.
[121] R. Socher, C. C. Lin, A. Y. Ng, and C. D. Manning. Parsing natural scenes and
Dropout: A simple way to prevent neural networks from overfitting. The Journal
[124] I. Sutskever and G. Hinton. Temporal kernel recurrent neural networks. Neural
abs/1409.4842, 2014.
134
[126] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection.
Springer, 2010.
[129] J. Tighe and S. Lazebnik. Finding things: Image parsing with regions and per-
[130] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In
CVPR, 1998.
network and a graphical model for human pose estimation. NIPS, 2014.
large database for non-parametric object and scene recognition. IEEE PAMI,
[134] Z. Tu. Auto-context and its application to high-level vision tasks. In CVPR, 2008.
[135] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and com-
3408, 2010.
135
[137] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. Hoggles: Visualizing
abs/1411.5309, 2014.
[140] A. Wang, J. Lu, G. Wang, J. Cai, and T.-J. Cham. Multi-modal unsupervised
[141] X. Wang, L. Zhang, L. Lin, Z. Liang, and W. Zuo. Deep joint task learning for
2005.
[143] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural
[145] J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional
CVPR, 2010.
[148] M. Zeiler, G. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid
136
[149] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks.
In ECCV, 2014.
[150] S. Zhang and E. Salari. Image denosing using a neural network based non-linear
[151] C. Zhou and S. Lin. Removal of image artifacts due to sensor dust. In CVPR,
2007.
[152] S. C. Zhu and D. Mumford. Prior learning and gibbs reaction-diffusion. PAMI,
19(11):1236–1250, 1997.
[153] D. Zoran and Y. Weiss. From learning models of natural image patches to whole
137