Image Transformer: Van Den Oord & Schrauwen 2014 Bellemare Et Al. 2016
Image Transformer: Van Den Oord & Schrauwen 2014 Bellemare Et Al. 2016
Image Transformer: Van Den Oord & Schrauwen 2014 Bellemare Et Al. 2016
Abstract
Image generation has been successfully cast as
arXiv:1802.05751v3 [cs.CV] 15 Jun 2018
Table 2. On the left are image completions from our best conditional generation model, where we sample the second half. On the right are
samples from our four-fold super-resolution model trained on CIFAR-10. Our images look realistic and plausible, show good diversity
among the completion samples and observe the outputs carry surprising details for coarse inputs in super-resolution.
For ordinal values, we run a 1x3 window size, 1x3 strided Dropout
convolution to combine the 3 channels per pixel to form an
input representation with shape [h, w, d].
To each pixel representation, we add a d-dimensional en-
Local Self-Attention
coding of coordinates of that pixel. We evaluated two dif- MatMul Wv MatMul Wv MatMul Wv
ferent coordinate encodings: sine and cosine functions of
softmax
the coordinates, with different frequencies across different
cmp cmp cmp
dimensions, following (Vaswani et al., 2017), and learned
·
position embeddings. Since we need to represent two co-
+ + + + MatMul Wq MatMul Wk
ordinates, we use d/2 of the dimensions to encode the row
number and the other d/2 of the dimensions to encode the pq p1 p2 p3
q m1 m2 m3
the column and color channel.
3.2. Self-Attention Figure 1. A slice of one layer of the Image Transformer, recom-
puting the representation q 0 of a single channel of one pixel q by
For image-conditioned generation, as in our super-resolution attending to a memory of previously generated pixels m1 , m2 , . . ..
models, we use an encoder-decoder architecture. The en- After performing local self-attention we apply a two-layer position-
coder generates a contextualized, per-pixel-channel repre- wise feed-forward neural network with the same parameters for
sentation of the source image. The decoder autoregressively all positions in a given layer. Self-attention and the feed-forward
networks are followed by dropout and bypassed by a residual
generates an output image of pixel intensities, one channel
connection with subsequent layer normalization. The position
per pixel at each time step. While doing so, it consumes the encodings pq , p1 , . . . are added only in the first layer.
previously generated pixels and the input image represen-
Image Transformer
tation generated by the encoder. For both the encoder and the transformed memory, weighted by the attention distribu-
decoder, the Image Transformer uses stacks of self-attention tion. In the decoders of our different models we mask the
and position-wise feed-forward layers, similar to (Vaswani outputs of the comparisons appropriately so that the model
et al., 2017). In addition, the decoder uses an attention cannot attend to positions in the memory that have not been
mechanism to consume the encoder representation. For un- generated, yet.
conditional and class-conditional generation, we employ the
To the resulting vector we then apply a single-layer fully-
Image Transformer in a decoder-only configuration.
connected feed-forward neural network with rectified linear
Before we describe how we scale self-attention to images activation followed by another linear transformation. The
comprised of many more positions than typically found in learned parameters of these are shared across all positions
sentences, we give a brief description of self-attention. but different from layer to layer.
Each self-attention layer computes a d-dimensional repre- As illustrated in Figure1, we perform dropout, merge in
sentation for each position, that is, each channel of each residual connections and perform layer normalization after
pixel. To recompute the representation for a given posi- each application of self-attention and the position-wise feed-
tion, it first compares the position’s current representation forward networks (Ba et al., 2016; Srivastava et al., 2014).
to other positions’ representations, obtaining an attention
The entire self-attention operation can be implemented using
distribution over the other positions. This distribution is
highly optimized matrix multiplication code and executed
then used to weight the contribution of the other positions’
in parallel for all pixels’ channels.
representations to the next representation for the position at
hand.
3.3. Local Self-Attention
Equations 1 and 2 outline the computation in our self-
attention and fully-connected feed-forward layers; Figure The number of positions included in the memory lm , or the
1 depicts it. W1 and W2 are the parameters of the feed- number of columns of M , has tremendous impact on the
forward layer, and are shared across all the positions in a scalability of the self-attention mechanism, which has a time
layer. These fully describe all operations performed in every complexity in O(h · w · lm · d).
layer, independently for each position, with the exception of The encoders of our super-resolution models operate on 8×8
multi-head attention. For details of multi-head self-attention, pixel images and it is computationally feasible to attend to
see (Vaswani et al., 2017). all of their 192 positions. The decoders in our experiments,
however, produce 32 × 32 pixel images with 3072 positions,
rendering attending to all positions impractical.
qa = layernorm(q + dropout( Inspired by convolutional neural networks we address this
Wq q(M Wk )T by adopting a notion of locality, restricting the positions
softmax √ M Wv )) (1) in the memory matrix M to a local neighborhood around
d
the query position. Changing this neighborhood per query
position, however, would prohibit packing most of the com-
putation necessary for self-attention into two matrix multi-
q 0 = layernorm(qa + dropout(W1 ReLu(W2 qa ))) (2)
plications - one for computing the pairwise comparisons and
another for generating the weighted averages. To avoid this,
In more detail, following previous work, we call the cur- we partition the image into query blocks and associate each
rent representation of the pixel’s channel, or position, to be of these with a larger memory block that also contains the
recomputed the query q. The other positions whose repre- query block. For all queries from a given query block, the
sentations will be used in computing a new representation model attends to the same memory matrix, comprised of all
for q are m1 , m2 , . . . which together comprise the columns positions from the memory block. The self-attention is then
of the memory matrix M . Note that M can also contain q. computed for all query blocks in parallel. The feed-forward
We first transform q and M linearly by learned matrices Wq networks and layer normalizations are computed in parallel
and Wk , respectively. for all positions.
The self-attention mechanism then compares q to each of In our experiments we use two different schemes for choos-
the pixel’s channel representations
√ in the memory with a dot- ing query blocks and their associated memory block neigh-
product, scaled by 1/ d. We apply the softmax function borhoods, resulting in two different factorizations of the
to the resulting compatibility scores, treating the obtained joint pixel distribution into conditional distributions. Both
vector as attention distribution over the pixel channels in are illustrated in Figure 2.
the memory. After applying another linear transformation
Wv to the memory M , we compute a weighted average of
Image Transformer
Table 3. Conditional image generations for all CIFAR-10 cate- ImageNet is a much larger dataset, with many more cate-
gories. Images on the left are from a model that achieves 3.03 gories than CIFAR-10, requiring more parameters in a gener-
bits/dim on the test set. Images on the right are from our best ative model. Our ImageNet unconditioned generation model
non-averaged model with 2.99 bits/dim. Both models are able
has 12 self-attention and feed-forward layers, d = 512, 8
to generate convincing cars, trucks, and ships. Generated horses,
planes, and birds also look reasonable.
attention heads, 2048 dimensions in the feed-forward lay-
ers, and dropout of 0.1. It significantly outperforms the
Gated PixelCNN and establishes a new state-of-the-art of
3.77 bits/dim with checkpoint averaging. We trained only
5.1. Generative Image Modeling unconditional generative models on ImageNet, since class
labels were not available in the dataset provided by (van den
Our unconditioned and class-conditioned image generation Oord et al., 2016a).
models both use 1D local attention, with lq = 256 and a total
memory size of 512. On CIFAR-10 our best unconditional Table 4 shows that growing the receptive field improves
models achieve a perplexity of 2.90 bits/dim on the test set perplexity significantly. We believe this to highlight a key
using either DMOL or categorical. For categorical, we use advantage of local self-attention over CNNs: namely that
12 layers with d = 512, heads=4, feed-forward dimension the number of parameters used by local self-attention is
2048 with a dropout of 0.3. In DMOL, our best config uses independent of the size of the receptive field. Furthermore,
14 layers, d = 256, heads=8, feed-forward dimension 512 while d > receptivefield, self-attention still requires fewer
and a dropout of 0.2. This is a considerable improvement floating-point operations.
over two baselines: the PixelRNN (van den Oord et al., For experiments with the categorical distribution we evalu-
2016a) and PixelCNN++ (Salimans et al.). Introduced after ated both coordinate encoding schemes described in Section
the Image Transformer, the also self-attention based Pixel- 3.3 and found no difference in quality. For DMOL we only
SNAIL model reaches a significantly lower perplexity of evaluated learned coordinate embeddings.
2.85 bits/dim on CIFAR-10 (Chen et al., 2017). On the
more challenging ImageNet data set, however, the Image
5.2. Conditioning on Image Class
Transformer performs significantly better than PixelSNAIL.
We represent the image classes as learned d-dimensional
We also train smaller 8 layer CIFAR-10 models which have
embeddings per class and simply add the respective em-
d = 512, 1024 dimensions in the feed-forward layers, 8
bedding to the input representation of every input position
attention heads and use dropout of 0.1, and achieve 3.03
together with the positional encodings.
bits/dim, matching the PixelCNN model (van den Oord
et al., 2016a). Our best CIFAR-10 model with DMOL has d We trained the class-conditioned Image Transformer on
and feed-forward layer layer dimension of 256 and perform CIFAR-10, achieving very similar log-likelihoods as in un-
attention in 512 dimensions. conditioned generation. The perceptual quality of generated
Image Transformer
Table 6. Images from our 1D and 2D local attention super-resolution models trained on CelebA, sampled with different temperatures. 2D
local attention with τ = 0.9 scored highest in our human evaluation study.
CIFAR-10 We also trained a super-resolution model on previously proposed (Mansimov et al., 2015), and tasks
the CIFAR-10 data set. Our model reached a negative log- combining modalities such as language-driven editing of
likelihood of 2.76 using 1D local attention and 2.78 using images.
2D local attention on the test set. As seen in Figure 2, our
Fundamentally, we aim to move beyond still images to
model commonly generates plausible looking objects even
video (Kalchbrenner et al., 2016) and towards applications
though the input images seem to barely show any discernible
in model-based reinforcement learning.
structure beyond coarse shapes.
References
6. Conclusion
Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Geoffrey E.
In this work we demonstrate that models based on self- Layer normalization. arXiv preprint arXiv:1607.06450,
attention can operate effectively on modalities other than 2016.
text, and through local self-attention scale to significantly
larger structures than sentences. With fewer layers, its larger Bellemare, Marc G., Srinivasan, Sriram, Ostrovski, Georg,
receptive fields allow the Image Transformer to significantly Schaul, Tom, Saxton, David, and Munos, Rémi. Unifying
improve over the state of the art in unconditional, probabilis- count-based exploration and intrinsic motivation. CoRR,
tic image modeling of comparatively complex images from abs/1606.01868, 2016. URL http://arxiv.org/
ImageNet as well as super-resolution. abs/1606.01868.
We further hope to have provided additional evidence
that even in the light of generative adversarial networks, Bengio, Yoshua and Bengio, Samy. Modeling high-
likelihood-based models of images is very much a promising dimensional discrete data with multi-layer neural net-
area for further research - as is using network architectures works. In Neural Information Processing Systems, pp.
such as the Image Transformer in GANs. 400–406. MIT Press, 2000.
In future work we would like to explore a broader variety Berthelot, David, Schumm, Tom, and Metz, Luke. BEGAN:
of conditioning information including free-form text, as boundary equilibrium generative adversarial networks.
Image Transformer
CoRR, abs/1703.10717, 2017. URL http://arxiv. In Empirical Methods in Natural Language Process-
org/abs/1703.10717. ing, 2016. URL https://arxiv.org/pdf/1606.
01933.pdf.
Chen, Xi, Mishra, Nikhil, Rohaninejad, Mostafa, and
Abbeel, Pieter. Pixelsnail: An improved autoregres- Radford, Alec, Metz, Luke, and Chintala, Soumith.
sive generative model. arXiv preprint arXiv:1712.09763, Unsupervised representation learning with deep con-
2017. volutional generative adversarial networks. CoRR,
abs/1511.06434, 2015. URL http://arxiv.org/
Cheng, Jianpeng, Dong, Li, and Lapata, Mirella. Long
abs/1511.06434.
short-term memory-networks for machine reading. arXiv
preprint arXiv:1601.06733, 2016. Salimans, Tim, Karpathy, Andrej, Chen, Xi, Kingma,
Diederik P., and Bulatov, Yaroslav. Pixelcnn++: A pix-
Dahl, Ryan, Norouzi, Mohammad, and Shlens, Jonathan.
elcnn implementation with discretized logistic mixture
Pixel recursive super resolution. 2017. URL https:
likelihood and other modifications. In International Con-
//arxiv.org/abs/1702.00783.
ference on Learning Representations.
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,
Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Srivastava, Nitish, Hinton, Geoffrey E, Krizhevsky, Alex,
Aaron, and Bengio, Yoshua. Generative adversarial nets, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: a
2014. simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research, 15(1):1929–1958,
Kalchbrenner, Nal and Blunsom, Phil. Recurrent continuous 2014.
translation models. In Proceedings EMNLP 2013, pp.
1700–1709, 2013. URL http://nal.co/papers/ Theis, Lucas and Bethge, Matthias. Generative image mod-
KalchbrennerBlunsom_EMNLP13. eling using spatial lstms. In Proceedings of the 28th Inter-
national Conference on Neural Information Processing
Kalchbrenner, Nal, van den Oord, Aäron, Simonyan, Systems - Volume 2, NIPS’15, pp. 1927–1935, Cambridge,
Karen, Danihelka, Ivo, Vinyals, Oriol, Graves, Alex, MA, USA, 2015. MIT Press. URL http://dl.acm.
and Kavukcuoglu, Koray. Video pixel networks. CoRR, org/citation.cfm?id=2969442.2969455.
abs/1610.00527, 2016. URL http://arxiv.org/
abs/1610.00527. van den Oord, Aäron and Schrauwen, Benjamin. The
student-t mixture as a natural image patch prior
Kingma, Diederik and Ba, Jimmy. Adam: A method for with application to image compression. Jour-
stochastic optimization. In ICLR, 2015. nal of Machine Learning Research, 15:2061–2086,
2014. URL http://jmlr.org/papers/v15/
Larochelle, Hugo and Murray, Iain. The neural autoregres-
vandenoord14a.html.
sive distribution estimator. In The Proceedings of the 14th
International Conference on Artificial Intelligence and van den Oord, Aäron, Kalchbrenner, Nal, and Kavukcuoglu,
Statistics, volume 15 of JMLR: W&CP, pp. 29–37, 2011. Koray. Pixel recurrent neural networks. ICML, 2016a.
Ledig, Christian, Theis, Lucas, Huszar, Ferenc, Caballero,
van den Oord, Aäron, Kalchbrenner, Nal, Vinyals, Oriol,
Jose, Aitken, Andrew, Tejani, Alykhan, Totz, Johannes,
Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray.
Wang, Zehan, and Shi, Wenzhe. Photo-realistic single
Conditional image generation with pixelcnn decoders.
image super-resolution using a generative adversarial net-
NIPS, 2016b.
work. arXiv:1609.04802, 2016.
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit,
Mansimov, Elman, Parisotto, Emilio, Ba, Lei Jimmy, and
Jakob, Jones, Llion, Gomez, Aidan N., Kaiser, Lukasz,
Salakhutdinov, Ruslan. Generating images from cap-
and Polosukhin, Illia. Attention is all you need. 2017.
tions with attention. CoRR, abs/1511.02793, 2015. URL
URL http://arxiv.org/abs/1706.03762.
http://arxiv.org/abs/1511.02793.
Metz, Luke, Poole, Ben, Pfau, David, and Sohl-Dickstein, Vaswani, Ashish, Bengio, Samy, Brevdo, Eugene, Chol-
Jascha. Unrolled generative adversarial networks. CoRR, let, Francois, Gomez, Aidan N., Gouws, Stephan, Jones,
abs/1611.02163, 2016. URL http://arxiv.org/ Llion, Kaiser, Łukasz, Kalchbrenner, Nal, Parmar, Niki,
abs/1611.02163. Sepassi, Ryan, Shazeer, Noam, and Uszkoreit, Jakob.
Tensor2tensor for neural machine translation. CoRR,
Parikh, Ankur, Täckström, Oscar, Das, Dipanjan, and abs/1803.07416, 2018. URL http://arxiv.org/
Uszkoreit, Jakob. A decomposable attention model. abs/1803.07416.
Image Transformer