0% found this document useful (0 votes)
89 views

Context Encoders: Feature Learning by Inpainting

This document presents Context Encoders, an unsupervised feature learning algorithm that trains a convolutional neural network to generate the contents of an arbitrary image region conditioned on its surroundings. The model is trained to inpaint missing regions of images, which requires it to understand both image content and structure in order to produce plausible hypotheses for missing parts. The authors find that training with both an L2 reconstruction loss and an adversarial loss produces sharper results, as the adversarial loss helps handle multiple possible outputs. Evaluation shows the learned features capture visual semantics and are effective for classification, detection, and segmentation tasks.

Uploaded by

MHSou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

Context Encoders: Feature Learning by Inpainting

This document presents Context Encoders, an unsupervised feature learning algorithm that trains a convolutional neural network to generate the contents of an arbitrary image region conditioned on its surroundings. The model is trained to inpaint missing regions of images, which requires it to understand both image content and structure in order to produce plausible hypotheses for missing parts. The authors find that training with both an L2 reconstruction loss and an adversarial loss produces sharper results, as the adversarial loss helps handle multiple possible outputs. Evaluation shows the learned features capture visual semantics and are effective for classification, detection, and segmentation tasks.

Uploaded by

MHSou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Context Encoders: Feature Learning by Inpainting

Deepak Pathak Philipp Krähenbühl Jeff Donahue Trevor Darrell Alexei A. Efros
University of California, Berkeley
{pathak,philkr,jdonahue,trevor,efros}@cs.berkeley.edu
arXiv:1604.07379v2 [cs.CV] 21 Nov 2016

Abstract

We present an unsupervised visual feature learning algo-


rithm driven by context-based pixel prediction. By analogy
with auto-encoders, we propose Context Encoders – a con-
volutional neural network trained to generate the contents
of an arbitrary image region conditioned on its surround-
ings. In order to succeed at this task, context encoders
need to both understand the content of the entire image,
as well as produce a plausible hypothesis for the missing (a) Input context (b) Human artist
part(s). When training context encoders, we have experi-
mented with both a standard pixel-wise reconstruction loss,
as well as a reconstruction plus an adversarial loss. The
latter produces much sharper results because it can better
handle multiple modes in the output. We found that a con-
text encoder learns a representation that captures not just
appearance but also the semantics of visual structures. We
quantitatively demonstrate the effectiveness of our learned
features for CNN pre-training on classification, detection,
and segmentation tasks. Furthermore, context encoders can
be used for semantic inpainting tasks, either stand-alone or (c) Context Encoder (d) Context Encoder
as initialization for non-parametric methods. (L2 loss) (L2 + Adversarial loss)

Figure 1: Qualitative illustration of the task. Given an im-


age with a missing region (a), a human artist has no trouble
inpainting it (b). Automatic inpainting using our context
1. Introduction encoder trained with L2 reconstruction loss is shown in (c),
Our visual world is very diverse, yet highly structured, and using both L2 and adversarial losses in (d).
and humans have an uncanny ability to make sense of this
structure. In this work, we explore whether state-of-the-art that it is possible to learn and predict this structure using
computer vision algorithms can do the same. Consider the convolutional neural networks (CNNs), a class of models
image shown in Figure 1a. Although the center part of the that have recently shown success across a variety of image
image is missing, most of us can easily imagine its content understanding tasks.
from the surrounding pixels, without having ever seen that Given an image with a missing region (e.g., Fig. 1a), we
exact scene. Some of us can even draw it, as shown on Fig- train a convolutional neural network to regress to the miss-
ure 1b. This ability comes from the fact that natural images, ing pixel values (Fig. 1d). We call our model context en-
despite their diversity, are highly structured (e.g. the regular coder, as it consists of an encoder capturing the context of
pattern of windows on the facade). We humans are able to an image into a compact latent feature representation and a
understand this structure and make visual predictions even decoder which uses that representation to produce the miss-
when seeing only parts of the scene. In this paper, we show ing image content. The context encoder is closely related to
The code, trained models and more inpainting results are available at autoencoders [3, 20], as it shares a similar encoder-decoder
the author’s project website. architecture. Autoencoders take an input image and try

1
to reconstruct it after it passes through a low-dimensional 2. Related work
“bottleneck” layer, with the aim of obtaining a compact fea-
ture representation of the scene. Unfortunately, this feature Computer vision has made tremendous progress on se-
representation is likely to just compresses the image content mantic image understanding tasks such as classification, ob-
without learning a semantically meaningful representation. ject detection, and segmentation in the past decade. Re-
Denoising autoencoders [38] address this issue by corrupt- cently, Convolutional Neural Networks (CNNs) [13, 27]
ing the input image and requiring the network to undo the have greatly advanced the performance in these tasks [15,
damage. However, this corruption process is typically very 26, 28]. The success of such models on image classification
localized and low-level, and does not require much seman- paved the way to tackle harder problems, including unsu-
tic information to undo. In contrast, our context encoder pervised understanding and generation of natural images.
needs to solve a much harder task: to fill in large missing We briefly review the related work in each of the sub-fields
areas of the image, where it can’t get “hints” from nearby pertaining to this paper.
pixels. This requires a much deeper semantic understanding
of the scene, and the ability to synthesize high-level features Unsupervised learning CNNs trained for ImageNet [37]
over large spatial extents. For example, in Figure 1a, an en- classification with over a million labeled examples learn
tire window needs to be conjured up “out of thin air.” This features which generalize very well across tasks [9]. How-
is similar in spirit to word2vec [30] which learns word rep- ever, whether such semantically informative and gener-
resentation from natural language sentences by predicting a alizable features can be learned from raw images alone,
word given its context. without any labels, remains an open question. Some of
Like autoencoders, context encoders are trained in a the earliest work in deep unsupervised learning are au-
completely unsupervised manner. Our results demonstrate toencoders [3, 20]. Along similar lines, denoising autoen-
that in order to succeed at this task, a model needs to both coders [38] reconstruct the image from local corruptions, to
understand the content of an image, as well as produce a make encoding robust to such corruptions. While context
plausible hypothesis for the missing parts. This task, how- encoders could be thought of as a variant of denoising au-
ever, is inherently multi-modal as there are multiple ways toencoders, the corruption applied to the model’s input is
to fill the missing region while also maintaining coherence spatially much larger, requiring more semantic information
with the given context. We decouple this burden in our loss to undo.
function by jointly training our context encoders to mini-
mize both a reconstruction loss and an adversarial loss. The Weakly-supervised and self-supervised learning Very
reconstruction (L2) loss captures the overall structure of the recently, there has been significant interest in learning
missing region in relation to the context, while the the ad- meaningful representations using weakly-supervised and
versarial loss [16] has the effect of picking a particular mode self-supervised learning. One useful source of supervision
from the distribution. Figure 1 shows that using only the re- is to use the temporal information contained in videos. Con-
construction loss produces blurry results, whereas adding sistency across temporal frames has been used as supervi-
the adversarial loss results in much sharper predictions. sion to learn embeddings which perform well on a num-
We evaluate the encoder and the decoder independently. ber of tasks [17, 34]. Another way to use consistency is to
On the encoder side, we show that encoding just the con- track patches in frames of video containing task-relevant at-
text of an image patch and using the resulting feature to tributes and use the coherence of tracked patches to guide
retrieve nearest neighbor contexts from a dataset produces the training [39]. Ego-motion read off from non-vision sen-
patches which are semantically similar to the original (un- sors has been used as supervisory signal to train visual fea-
seen) patch. We further validate the quality of the learned tures et al. [1, 21].
feature representation by fine-tuning the encoder for a va- Most closely related to the present paper are efforts at
riety of image understanding tasks, including classifica- exploiting spatial context as a source of free and plentiful
tion, object detection, and semantic segmentation. We supervisory signal. Visual Memex [29] used context to non-
are competitive with the state-of-the-art unsupervised/self- parametrically model object relations and to predict masked
supervised methods on those tasks. On the decoder side, we objects in scenes, while [6] used context to establish cor-
show that our method is often able to fill in realistic image respondences for unsupervised object discovery. However,
content. Indeed, to the best of our knowledge, ours is the both approaches relied on hand-designed features and did
first parametric inpainting algorithm that is able to give rea- not perform any representation learning. Recently, Doer-
sonable results for semantic hole-filling (i.e. large missing sch et al. [7] used the task of predicting the relative positions
regions). The context encoder can also be useful as a bet- of neighboring patches within an image as a way to train
ter visual feature for computing nearest neighbors in non- an unsupervised deep feature representations. We share the
parametric inpainting methods. same high-level goals with Doersch et al. but fundamentally
differ in the approach: whereas [7] are solving a discrimina-
tive task (is patch A above patch B or below?), our context
encoder solves a pure prediction problem (what pixel inten-
sities should go in the hole?). Interestingly, similar distinc-
tion exist in using language context to learn word embed-
dings: Collobert and Weston [5] advocate a discriminative
approach, whereas word2vec [30] formulate it as word pre-
diction. One important benefit of our approach is that our
supervisory signal is much richer: a context encoder needs
to predict roughly 15,000 real values per training example, Figure 2: Context Encoder. The context image is passed
compared to just 1 option among 8 choices in [7]. Likely through the encoder to obtain features which are connected
due in part to this difference, our context encoders take far to the decoder using channel-wise fully-connected layer as
less time to train than [7]. Moreover, context based predic- described in Section 3.1. The decoder then produces the
tion is also harder to “cheat” since low-level image features, missing regions in the image.
such as chromatic aberration, do not provide any meaning-
ful information, in contrast to [7] where chromatic aberra- 3. Context encoders for image generation
tion partially solves the task. On the other hand, it is not yet
clear if requiring faithful pixel generation is necessary for We now introduce context encoders: CNNs that predict
learning good visual features. missing parts of a scene from their surroundings. We first
give an overview of the general architecture, then provide
Image generation Generative models of natural images details on the learning procedure and finally present various
have enjoyed significant research interest [16, 24, 35]. Re- strategies for image region removal.
cently, Radford et al. [33] proposed new convolutional ar- 3.1. Encoder-decoder pipeline
chitectures and optimization hyperparameters for Genera-
tive Adversarial Networks (GAN) [16] producing encour- The overall architecture is a simple encoder-decoder
aging results. We train our context encoders using an ad- pipeline. The encoder takes an input image with missing
versary jointly with reconstruction loss for generating in- regions and produces a latent feature representation of that
painting results. We discuss this in detail in Section 3.2. image. The decoder takes this feature representation and
Dosovitskiy et al. [10] and Rifai et al. [36] demonstrate produces the missing image content. We found it important
that CNNs can learn to generate novel images of particular to connect the encoder and the decoder through a channel-
object categories (chairs and faces, respectively), but rely on wise fully-connected layer, which allows each unit in the
large labeled datasets with examples of these categories. In decoder to reason about the entire image content. Figure 2
contrast, context encoders can be applied to any unlabeled shows an overview of our architecture.
image database and learn to generate images based on the
surrounding context. Encoder Our encoder is derived from the AlexNet archi-
tecture [26]. Given an input image of size 227×227, we use
Inpainting and hole-filling It is important to point out the first five convolutional layers and the following pooling
that our hole-filling task cannot be handled by classical in- layer (called pool5) to compute an abstract 6 × 6 × 256
painting [4, 32] or texture synthesis [2, 11] approaches, dimensional feature representation. In contrast to AlexNet,
since the missing region is too large for local non-semantic our model is not trained for ImageNet classification; rather,
methods to work well. In computer graphics, filling in large the network is trained for context prediction “from scratch”
holes is typically done via scene completion [19], involv- with randomly initialized weights.
ing a cut-paste formulation using nearest neighbors from a However, if the encoder architecture is limited only to
dataset of millions of images. However, scene completion convolutional layers, there is no way for information to di-
is meant for filling in holes left by removing whole objects, rectly propagate from one corner of the feature map to an-
and it struggles to fill arbitrary holes, e.g. amodal comple- other. This is so because convolutional layers connect all
tion of partially occluded objects. Furthermore, previous the feature maps together, but never directly connect all lo-
completion relies on a hand-crafted distance metric, such as cations within a specific feature map. In the present archi-
Gist [31] for nearest-neighbor computation which is infe- tectures, this information propagation is handled by fully-
rior to a learned distance metric. We show that our method connected or inner product layers, where all the activations
is often able to inpaint semantically meaningful content in are directly connected to each other. In our architecture, the
a parametric fashion, as well as provide a better feature for latent feature dimension is 6 × 6 × 256 = 9216 for both
nearest neighbor-based inpainting methods. encoder and decoder. This is so because, unlike autoen-
coders, we do not reconstruct the original input and hence
need not have a smaller bottleneck. However, fully connect-
ing the encoder and decoder would result in an explosion in
the number of parameters (over 100M!), to the extent that
efficient training on current GPUs would be difficult. To
alleviate this issue, we use a channel-wise fully-connected
layer to connect the encoder features to the decoder, de-
scribed in detail below.

(a) Central region (b) Random block (c) Random region


Channel-wise fully-connected layer This layer is essen-
tially a fully-connected layer with groups, intended to prop- Figure 3: An example of image x with our different region
agate information within activations of each feature map. If masks M̂ applied, as described in Section 3.3.
the input layer has m feature maps of size n × n, this layer
will output m feature maps of dimension n × n. However, image x, our context encoder F produces an output F (x).
unlike a fully-connected layer, it has no parameters connect- Let M̂ be a binary mask corresponding to the dropped im-
ing different feature maps and only propagates information age region with a value of 1 wherever a pixel was dropped
within feature maps. Thus, the number of parameters in and 0 for input pixels. During training, those masks are au-
this channel-wise fully-connected layer is mn4 , compared tomatically generated for each image and training iterations,
to m2 n4 parameters in a fully-connected layer (ignoring the as described in Section 3.3. We now describe different com-
bias term). This is followed by a stride 1 convolution to ponents of our loss function.
propagate information across channels.

Decoder We now discuss the second half of our pipeline, Reconstruction Loss We use a normalized masked L2
the decoder, which generates pixels of the image using distance as our reconstruction loss function, Lrec ,
the encoder features. The “encoder features” are con-
nected to the “decoder features” using a channel-wise fully- Lrec (x) = kM̂ (x − F ((1 − M̂ ) x))k22 , (1)
connected layer.
The channel-wise fully-connected layer is followed by where is the element-wise product operation. We experi-
a series of five up-convolutional layers [10, 28, 40] with mented with both L1 and L2 losses and found no significant
learned filters, each with a rectified linear unit (ReLU) acti- difference between them. While this simple loss encour-
vation function. A up-convolutional is simply a convolution ages the decoder to produce a rough outline of the predicted
that results in a higher resolution image. It can be under- object, it often fails to capture any high frequency detail
stood as upsampling followed by convolution (as described (see Fig. 1c). This stems from the fact that the L2 (or L1)
in [10]), or convolution with fractional stride (as described loss often prefer a blurry solution, over highly accurate tex-
in [28]). The intuition behind this is straightforward – the tures. We believe this happens because it is much “safer”
series of up-convolutions and non-linearities comprises a for the L2 loss to predict the mean of the distribution, be-
non-linear weighted upsampling of the feature produced by cause this minimizes the mean pixel-wise error, but results
the encoder until we roughly reach the original target size. in a blurry averaged image. We alleviated this problem by
adding an adversarial loss.
3.2. Loss function
We train our context encoders by regressing to the Adversarial Loss Our adversarial loss is based on Gener-
ground truth content of the missing (dropped out) region. ative Adversarial Networks (GAN) [16]. To learn a genera-
However, there are often multiple equally plausible ways to tive model G of a data distribution, GAN proposes to jointly
fill a missing image region which are consistent with the learn an adversarial discriminative model D to provide loss
context. We model this behavior by having a decoupled gradients to the generative model. G and D are paramet-
joint loss function to handle both continuity within the con- ric functions (e.g., deep networks) where G : Z → X
text and multiple modes in the output. The reconstruction maps samples from noise distribution Z to data distribution
(L2) loss is responsible for capturing the overall structure of X . The learning procedure is a two-player game where an
the missing region and coherence with regards to its context, adversarial discriminator D takes in both the prediction of
but tends to average together the multiple modes in predic- G and ground truth samples, and tries to distinguish them,
tions. The adversarial loss [16], on the other hand, tries while G tries to confuse D by producing samples that ap-
to make prediction look real, and has the effect of picking a pear as “real” as possible. The objective for discriminator is
particular mode from the distribution. For each ground truth logistic likelihood indicating whether the input is real sam-
Figure 4: Semantic Inpainting results on held-out images for context encoder trained using reconstruction and adversarial
loss. First three rows are examples from ImageNet, and bottom two rows are from Paris StreetView Dataset. See more results
on author’s project website.

ple or predicted one: entire output of the context encoder to look realistic, not just
the missing regions as in Equation (1).
min max Ex∈X [log(D(x))] + Ez∈Z [log(1 − D(G(z)))]
G D

This method has recently shown encouraging results in Joint Loss We define the overall loss function as
generative modeling of images [33]. We thus adapt this
framework for context prediction by modeling generator by L = λrec Lrec + λadv Ladv . (3)
context encoder; i.e., G , F . To customize GANs for this
Currently, we use adversarial loss only for inpainting exper-
task, one could condition on the given context information;
iments as AlexNet [26] architecture training diverged with
i.e., the mask M̂ x. However, conditional GANs don’t
joint adversarial loss. Details follow in Sections 5.1, 5.2.
train easily for context prediction task as the adversarial dis-
criminator D easily exploits the perceptual discontinuity in 3.3. Region masks
generated regions and the original context to easily classify
predicted versus real samples. We thus use an alternate for- The input to a context encoder is an image with one or
mulation, by conditioning only the generator (not the dis- more of its regions “dropped out”; i.e., set to zero, assuming
criminator) on context. We also found results improved zero-centered inputs. The removed regions could be of any
when the generator was not conditioned on a noise vector. shape, we present three different strategies here:
Hence the adversarial loss for context encoders, Ladv , is Central region The simplest such shape is the central
square patch in the image, as shown in Figure 3a. While this
Ladv = max Ex∈X [log(D(x)) works quite well for inpainting, the network learns low level
D
image features that latch onto the boundary of the central
+ log(1 − D(F ((1 − M̂ ) x)))], (2)
mask. Those low level image features tend not to generalize
where, in practice, both F and D are optimized jointly us- well to images without masks, hence the features learned
ing alternating SGD. Note that this objective encourages the are not very general.
Method Mean L1 Loss Mean L2 Loss PSNR (higher better)
NN-inpainting (HOG features) 19.92% 6.92% 12.79 dB
NN-inpainting (our features) 15.10% 4.30% 14.70 dB
Our Reconstruction (joint) 09.37% 1.96% 18.58 dB

Table 1: Semantic Inpainting accuracy for Paris StreetView


dataset on held-out images. NN inpainting is basis for [19].

In classification, pooling provides spatial invariance, which


may be detrimental for reconstruction-based training. To be
consistent with prior work, we still use the original AlexNet
architecture (with pooling) for all feature learning results.
Input Context Context Encoder Content-Aware Fill
Figure 5: Comparison with Content-Aware Fill (Photoshop 5. Evaluation
feature based on [2]) on held-out images. Our method
works better in semantic cases (top row) and works slightly We now evaluate the encoder features for their seman-
worse in textured settings (bottom row). tic quality and transferability to other image understanding
tasks. We experiment with images from two datasets: Paris
StreetView [8] and ImageNet [37] without using any of the
Random block To prevent the network from latching on accompanying labels. In Section 5.1, we present visualiza-
the the constant boundary of the masked region, we ran- tions demonstrating the ability of the context encoder to fill
domize the masking process. Instead of choosing a sin- in semantic details of images with missing regions. In Sec-
gle large mask at a fixed location, we remove a number of tion 5.2, we demonstrate the transferability of our learned
smaller possibly overlapping masks, covering up to 14 of the features to other tasks, using context encoders as a pre-
image. An example of this is shown in Figure 3b. How- training step for image classification, object detection, and
ever, the random block masking still has sharp boundaries semantic segmentation. We compare our results on these
convolutional features could latch onto. tasks with those of other unsupervised or self-supervised
Random region To completely remove those bound- methods, demonstrating that our approach outperforms pre-
aries, we experimented with removing arbitrary shapes vious methods.
from images, obtained from random masks in the PASCAL
VOC 2012 dataset [12]. We deform those shapes and paste 5.1. Semantic Inpainting
in arbitrary places in the other images (not from PASCAL),
again covering up to 14 of the image. Note that we com- We train context encoders with the joint loss function de-
pletely randomize the region masking process, and do not fined in Equation (3) for the task of inpainting the missing
expect or want any correlation between the source segmen- region. The encoder and discriminator architecture is simi-
tation mask and the image. We merely use those regions to lar to that of discriminator in [33], and decoder is similar to
prevent the network from learning low-level features corre- generator in [33]. However, the bottleneck is of 4000 units
sponding to the removed mask. See example in Figure 3c. (in contrast to 100 in [33]); see supplementary material. We
In practice, we found region and random block masks used the default solver hyper-parameters suggested in [33].
produce a similarly general feature, while significantly out- We use λrec = 0.999 and λadv = 0.001. However, a few
performing the central region features. We use the random things were crucial for training the model. We did not con-
region dropout for all our feature based experiments. dition the adversarial loss (see Section 3.2) nor did we add
noise to the encoder. We use a higher learning rate for con-
text encoder (10 times) to that of adversarial discriminator.
4. Implementation details
To further emphasize the consistency of prediction with the
The pipeline was implemented in Caffe [22] and Torch. context, we predict a slightly larger patch that overlaps with
We used the recently proposed stochastic gradient descent the context (by 7px). During training, we use higher weight
solver, ADAM [23] for optimization. The missing region in (10×) for the reconstruction loss in this overlapping region.
the masked input image is filled with constant mean value. The qualitative results are shown in Figure 4. Our model
Hyper-parameter details are discussed in Sections 5.1, 5.2. performs generally well in inpainting semantic regions of
Pool-free encoders We experimented with replacing all an image. However, if a region can be filled with low-
pooling layers with convolutions of the same kernel size level textures, texture synthesis methods, such as [2, 11],
and stride. The overall stride of the network remains the can often perform better (e.g. Figure 5). For semantic in-
same, but it results in finer inpainting. Intuitively, there is painting, we compare against nearest neighbor inpainting
no reason to use pooling for reconstruction based networks. (which forms the basis of Hays et al. [19]) and show that
Image Ours(L2) Ours(Adv) Ours(L2+Adv) NN-Inpainting w/ our features NN-Inpainting w/ HOG

Figure 6: Semantic Inpainting using different methods on held-out images. Context Encoder with just L2 are well aligned,
but not sharp. Using adversarial loss, results are sharp but not coherent. Joint loss alleviate the weaknesses of each of them.
The last two columns are the results if we plug-in the best nearest neighbor (NN) patch in the masked region.

our reconstructions are well-aligned semantically, as seen to the masked part of image just by using the features from
on Figure 6. It also shows that joint loss significantly im- the context, see Figure 8. Note that none of the methods
proves the inpainting over both reconstruction and adver- ever see the center part of any image, whether a query
sarial loss alone. Moreover, using our learned features in or dataset image. Our features retrieve decent nearest
a nearest-neighbor style inpainting can sometimes improve neighbors just from context, even though actual prediction
results over a hand-designed distance metrics. Table 1 re- is blurry with L2 loss. AlexNet features also perform
ports quantitative results on StreetView Dataset. decently as they were trained with 1M labels for semantic
tasks, HOG on the other hand fail to get the semantics.
5.2. Feature Learning
For consistency with prior work, we use the 5.2.1 Classification pre-training
AlexNet [26] architecture for our encoder. Unfortu-
nately, we did not manage to make the adversarial loss For this experiment, we fine-tune a standard AlexNet clas-
converge with AlexNet, so we used just the reconstruction sifier on the PASCAL VOC 2007 [12] from a number of su-
loss. The networks were trained with a constant learning pervised, self-supervised and unsupervised initializations.
rate of 10−3 for the center-region masks. However, for We train the classifier using random cropping, and then
random region corruption, we found a learning rate of 10−4 evaluate it using 10 random crops per test image. We av-
to perform better. We apply dropout with a rate of 0.5 just erage the classifier output over those random crops. Table 2
for the channel-wise fully connected layer, since it has shows the standard mean average precision (mAP) score for
more parameters than other layers and might be prone to all compared methods.
overfitting. The training process is fast and converges in A random initialization performs roughly 25% below
about 100K iterations: 14 hours on a Titan X GPU. Figure 7 an ImageNet-trained model; however, it does not use any
shows inpainting results for context encoder trained with labels. Context encoders are competitive with concurrent
random region corruption using reconstruction loss. To self-supervised feature learning methods [7, 39] and signif-
evaluate the quality of features, we find nearest neighbors icantly outperform autoencoders and Agrawal et al. [1].

5.2.2 Detection pre-training


Our second set of quantitative results involves using our
features for object detection. We use Fast R-CNN [14]
framework (FRCN). We replace the ImageNet pre-trained
network with our context encoders (or any other baseline
Figure 7: Arbitrary region inpainting for context encoder model). In particular, we take the pre-trained encoder
trained with reconstruction loss on held-out images. weights up to the pool5 layer and re-initialize the fully-
Ours

Ours
HOG

HOG
AlexNet

AlexNet
Figure 8: Context Nearest Neighbors. Center patches whose context (not shown here) are close in the embedding space
of different methods (namely our context encoder, HOG and AlexNet). Note that the appearance of these center patches
themselves was never seen by these methods. But our method brings them close just from their context.

Pretraining Method Supervision Pretraining time Classification Detection Segmentation


ImageNet [26] 1000 class labels 3 days 78.2% 56.8% 48.0%
Random Gaussian initialization < 1 minute 53.3% 43.4% 19.8%
Autoencoder - 14 hours 53.8% 41.9% 25.2%
Agrawal et al. [1] egomotion 10 hours 52.9% 41.8% -
Wang et al. [39] motion 1 week 58.7% 47.4% -
Doersch et al. [7] relative context 4 weeks 55.3% 46.6% -
Ours context 14 hours 56.5% 44.5% 30.0%

Table 2: Quantitative comparison for classification, detection and semantic segmentation. Classification and Fast-RCNN
Detection results are on the PASCAL VOC 2007 test set. Semantic segmentation results are on the PASCAL VOC 2012
validation set from the FCN evaluation described in Section 5.2.3, using the additional training data from [18], and removing
overlapping images from the validation set [28].

connected layers. We then follow the training and evalu- our context encoders, afterwards following the FCN train-
ation procedures from FRCN and report the accuracy (in ing and evaluation procedure for direct comparison with
mAP) of the resulting detector. their original CaffeNet-based result.
Our results on the test set of the PASCAL VOC 2007 [12] Our results on the PASCAL VOC 2012 [12] validation
detection challenge are reported in Table 2. Context en- set are reported in Table 2. In this setting, we outperform a
coder pre-training is competitive with the existing meth- randomly initialized network as well as a plain autoencoder
ods achieving significant boost over the baseline. Recently, which is trained simply to reconstruct its full input.
Krähenbühl et al. [25] proposed a data-dependent method
for rescaling pre-trained model weights. This significantly 6. Conclusion
improves the features in Doersch et al. [7] up to 65.3%
Our context encoders trained to generate images condi-
for classification and 51.1% for detection. However, this
tioned on context advance the state of the art in semantic
rescaling doesn’t improve results for other methods, includ-
inpainting, at the same time learn feature representations
ing ours.
that are competitive with other models trained with auxil-
iary supervision.
5.2.3 Semantic Segmentation pre-training
Our last quantitative evaluation explores the utility of con- Acknowledgements The authors would like to thank
text encoder training for pixel-wise semantic segmentation. Amanda Buster for the artwork on Fig. 1b, as well as Shub-
Fully convolutional networks [28] (FCNs) were proposed as ham Tulsiani and Saurabh Gupta for helpful discussions.
an end-to-end learnable method of predicting a semantic la- This work was supported in part by DARPA, AFRL, In-
bel at each pixel of an image, using a convolutional network tel, DoD MURI award N000141110688, NSF awards IIS-
pre-trained for ImageNet classification. We replace the clas- 1212798, IIS-1427425, and IIS-1536003, the Berkeley Vi-
sification pre-trained network used in the FCN method with sion and Learning Center and Berkeley Deep Drive.
References [21] D. Jayaraman and K. Grauman. Learning image representa-
tions tied to ego-motion. In ICCV, 2015. 2
[1] P. Agrawal, J. Carreira, and J. Malik. Learning to see by
[22] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B.
moving. ICCV, 2015. 2, 7, 8
Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolu-
[2] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. tional architecture for fast feature embedding. In ACM Mul-
Patchmatch: A randomized correspondence algorithm for timedia, 2014. 6
structural image editing. ACM Transactions on Graphics,
[23] D. Kingma and J. Ba. Adam: A method for stochastic opti-
2009. 3, 6
mization. ICLR, 2015. 6
[3] Y. Bengio. Learning deep architectures for ai. Foundations
[24] D. P. Kingma and M. Welling. Auto-encoding variational
and trends in Machine Learning, 2009. 1, 2
bayes. ICLR, 2014. 3
[4] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image
[25] P. Krähenbühl, C. Doersch, J. Donahue, and T. Darrell. Data-
inpainting. In Computer graphics and interactive techniques,
dependent initializations of convolutional neural networks.
2000. 3
ICLR, 2016. 8
[5] R. Collobert and J. Weston. A unified architecture for natural
language processing: Deep neural networks with multitask [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
learning. In ICML, 2008. 3 classification with deep convolutional neural networks. In
NIPS, 2012. 2, 3, 5, 7, 8, 10
[6] C. Doersch, A. Gupta, and A. A. Efros. Context as supervi-
sory signal: Discovering objects with predictable context. In [27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
ECCV, 2014. 2 Howard, W. Hubbard, and L. D. Jackel. Backpropagation
applied to handwritten zip code recognition. Neural compu-
[7] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visual
tation, 1989. 2
representation learning by context prediction. ICCV, 2015.
2, 3, 7, 8 [28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In CVPR, 2015. 2, 4, 8
[8] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What
makes paris look like paris? ACM Transactions on Graphics, [29] T. Malisiewicz and A. Efros. Beyond categories: The visual
2012. 6 memex model for reasoning about object relationships. In
NIPS, 2009. 2
[9] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,
E. Tzeng, and T. Darrell. Decaf: A deep convolutional ac- [30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
tivation feature for generic visual recognition. ICML, 2014. J. Dean. Distributed representations of words and phrases
2 and their compositionality. In NIPS, 2013. 2, 3
[10] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to [31] A. Oliva and A. Torralba. Building the gist of a scene: The
generate chairs with convolutional neural networks. CVPR, role of global image features in recognition. Progress in
2015. 3, 4 brain research, 2006. 3
[11] A. Efros and T. K. Leung. Texture synthesis by non- [32] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An it-
parametric sampling. In ICCV, 1999. 3, 6 erative regularization method for total variation-based image
[12] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, restoration. Multiscale Modeling & Simulation, 2005. 3
J. Winn, and A. Zisserman. The Pascal Visual Object Classes [33] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-
challenge: A retrospective. IJCV, 2014. 6, 7, 8 sentation learning with deep convolutional generative adver-
[13] K. Fukushima. Neocognitron: A self-organizing neural net- sarial networks. ICLR, 2016. 3, 5, 6, 10
work model for a mechanism of pattern recognition unaf- [34] V. Ramanathan, K. Tang, G. Mori, and L. Fei-Fei. Learn-
fected by shift in position. Biological cybernetics, 1980. 2 ing temporal embeddings for complex video analysis. ICCV,
[14] R. Girshick. Fast r-cnn. ICCV, 2015. 7 2015. 2
[15] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- [35] M. Ranzato, V. Mnih, J. M. Susskind, and G. E. Hinton.
ture hierarchies for accurate object detection and semantic Modeling natural images using gated mrfs. PAMI, 2013. 3
segmentation. In CVPR, 2014. 2 [36] S. Rifai, Y. Bengio, A. Courville, P. Vincent, and M. Mirza.
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, Disentangling factors of variation for facial expression
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- recognition. In ECCV, 2012. 3
erative adversarial nets. In NIPS, 2014. 2, 3, 4 [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
[17] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
Unsupervised learning of spatiotemporally coherent metrics. A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recog-
ICCV, 2015. 2 nition challenge. IJCV, 2015. 2, 6
[18] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. [38] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Semantic contours from inverse detectors. In ICCV, 2011. 8 Extracting and composing robust features with denoising au-
[19] J. Hays and A. A. Efros. Scene completion using millions of toencoders. In ICML, 2008. 2
photographs. SIGGRAPH, 2007. 3, 6 [39] X. Wang and A. Gupta. Unsupervised learning of visual rep-
[20] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimen- resentations using videos. ICCV, 2015. 2, 7, 8
sionality of data with neural networks. Science, 2006. 1, [40] M. D. Zeiler and R. Fergus. Visualizing and understanding
2 convolutional networks. In ECCV, 2014. 4
Supplementary Material tures learned by context-encoder are semantically meaning-
ful and retrieve neighboring patches just by looking at the
In this section, we present the architectural details of our context. This is also verified quantitatively in Table 2. (b)
context-encoders, and show additional qualitative results. Our context encoder doesn’t memorize the examples from
Context encoders are not only able to inpaint semantic de- training set. It rather produces realistic and coherent in-
tails in the missing part of an input image, but also learn painting results which are much better than nearest neighbor
features transferable to other tasks. We discuss the imple- inpainting both qualitatively (Figure 10) and quantitatively
mentation details for each of these in following sections. (Table 1).
A. Semantic Inpainting
Context encoders for inpainting are trained jointly with
reconstruction and adversarial loss as discussed in Sec-
tion 5.1. The inpainting results are slightly worse if we use
227 × 227 directly. So, we resize images to 128 × 128
and then train our joint loss with the resized images. The
encoder and discriminator architecture is similar to that of
discriminator in [33], and decoder is similar to generator
in [33]; the bottleneck is of 4000 units. We used batch
normalization in both context encoder and discriminator.
ReLU [26] non-linearity is used in decoder, while leaky
ReLU [33] is used in both encoder and discriminator.
In case of arbitrary region inpainting, adversarial dis-
criminator compares the full real image and the full gen-
erated image. We do not condition the adversarial discrimi-
nator with mask, see (2). If the discriminator sees the mask,
it figures out the perceptual discontinuity of generated part
from the real part and easily classifies the real v/s the gen-
erated image, i.e., the process doesn’t train. Moreover, par-
ticularly for center region inpainting, this process can be
computationally simplified by producing center only and
not showing discriminator the context boundary (or in other
words, not showing the mask). The exact architecture for
center region dropout is shown in Figure 9a.

B. Feature Learning
We use the AlexNet [26] architecture for encoder so that
we can compare the learned features with the prior works,
which are trained using Imagenet labels and other un/self-
supervised techniques. The encoder is Alexnet until pool5,
followed by channel-wise fully connected layer and decoder
is a series of upconvolutional layers until we reach the tar-
get size. The input image size is 227 × 227. Unfortunately,
we couldn’t train adversary with Alexnet Encoder, so it is
trained with reconstruction loss. See Figure 9b for exact
architecture details. For pre-training experiments in Sec-
tion 5.2, we randomly initialize the fully-connected lay-
ers, i.e., fc6 and fc7, while starting from context encoder
weights.

C. Additional Results
Finally, we show additional inpainting results using our
context-encoders in Figure 10. These results, in compari-
son to nearest-neighbor inpainting, show that: (a) The fea-
Encoder Decoder

128
64 64 64 4000 64 64
32 128 128 32
16 256 512 256 16
8 512 8
4 4
Reconstruc*on
8 4 4 8 16 Loss (L2)
64 32 16 32 64

128 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4 4x4
(conv) (conv) (conv) (conv) (conv) (conv) (uconv) (uconv) (uconv) (uconv) (uconv)

64 64
32 128
16 256 512
8 4 real
4
or
32 16 8 fake
64
4x4 4x4 4x4 4x4 4x4
(conv) (conv) (conv) (conv) (conv)

Adversarial Discriminator

(a) Context encoder trained with joint reconstruction and adversarial loss for semantic inpainting. This illustration is shown for center region dropout.
Similar architecture holds for arbitrary region dropout as well. See Section 3.2.

Encoder Channel-wise Decoder


Fully
Connected
227
9216 3
227 9216
32 161
64
64 41 81 Reconstruc*on
256 128 21
AlexNet 11 Loss (L2)
6
(un*l pool5)
6 11 21 81 161 227
41
227 5x5 5x5 5x5 5x5 5x5
(reshape) (uconv) (uconv) (uconv) (uconv) (uconv) (resize)

(b) Context encoder trained with reconstruction loss for feature learning by filling in arbitrary region dropouts in the input.

Figure 9: Context encoder training architectures.


Image Ours(L2) Ours(Adv) Ours(L2+Adv) NN-Inpainting w/ our features NN-Inpainting w/ HOG

Figure
Figure10:10:Semantic
Semantic Inpainting
Inpainting using
using different methods on
different methods on held-out
held-outimages.
images.Context
ContextEncoder
Encoderwith
withjust
justL2L2are
arewell
wellaligned,
aligned,
but not sharp. Using adversarial loss, results are sharp but not coherent. Joint loss alleviate the weaknesses of each
but not sharp. Using adversarial loss, results are sharp but not coherent. Joint loss alleviate the weaknesses of each of them.of them.
The last two columns are the results if we plug-in the best nearest neighbor (NN) patch in the masked
The last two columns are the results if we plug-in the best nearest neighbor (NN) patch in the masked region. region.

You might also like