0% found this document useful (0 votes)

73 views18 pages

Cordonnier 2020

This paper examines the relationship between self-attention and convolutional layers. It provides both theoretical and empirical evidence that self-attention layers can perform convolution and often learn to do so in practice. Specifically, the paper proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Experiments also show that self-attention layers attend to pixel-grid patterns similarly to convolutional neural network layers, supporting the theoretical analysis.

Uploaded by

Victor Flores Benites

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views18 pages

Cordonnier 2020

Uploaded by

Victor Flores Benites

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Published as a conference paper at ICLR 2020

ON THE R ELATIONSHIP BETWEEN S ELF -ATTENTION

AND C ONVOLUTIONAL L AYERS

Jean-Baptiste Cordonnier, Andreas Loukas & Martin Jaggi

École Polytechnique Fédérale de Lausanne (EPFL)
{first.last}@epfl.ch

A BSTRACT
arXiv:1911.03584v2 [cs.LG] 10 Jan 2020

Recent trends of incorporating attention mechanisms in vision have led re-

searchers to reconsider the supremacy of convolutional layers as a primary build-
ing block. Beyond helping CNNs to handle long-range dependencies, Ramachan-
dran et al. (2019) showed that attention can completely replace convolution and
achieve state-of-the-art performance on vision tasks. This raises the question: do
learned attention layers operate similarly to convolutional layers? This work pro-
vides evidence that attention layers can perform convolution and, indeed, they of-
ten learn to do so in practice. Specifically, we prove that a multi-head self-attention
layer with sufficient number of heads is at least as expressive as any convolutional
layer. Our numerical experiments then show that self-attention layers attend to
pixel-grid patterns similarly to CNN layers, corroborating our analysis. Our code
is publicly available1 .

1 I NTRODUCTION
Recent advances in Natural Language Processing (NLP) are largely attributed to the rise of the trans-
former (Vaswani et al., 2017). Pre-trained to solve an unsupervised task on large corpora of text,
transformer-based architectures, such as GPT-2 (Radford et al., 2018), BERT (Devlin et al., 2018)
and Transformer-XL (Dai et al., 2019), seem to possess the capacity to learn the underlying structure
of text and, as a consequence, to learn representations that generalize across tasks. The key differ-
ence between transformers and previous methods, such as recurrent neural networks (Hochreiter &
Schmidhuber, 1997) and convolutional neural networks (CNN), is that the former can simultane-
ously attend to every word of their input sequence. This is made possible thanks to the attention
mechanism—originally introduced in Neural Machine Translation to better handle long-range de-
pendencies (Bahdanau et al., 2015). With self-attention in particular, the similarity of two words in
a sequence is captured by an attention score measuring the distance of their representations. The
representation of each word is then updated based on those words whose attention score is highest.
Inspired by its capacity to learn meaningful inter-dependencies between words, researchers have
recently considered utilizing self-attention in vision tasks. Self-attention was first added to CNN
by either using channel-based attention (Hu et al., 2018) or non-local relationships across the image
(Wang et al., 2018). More recently, Bello et al. (2019) augmented CNNs by replacing some convolu-
tional layers with self-attention layers, leading to improvements on image classification and object
detection tasks. Interestingly, Ramachandran et al. (2019) noticed that, even though state-of-the
art results are reached when attention and convolutional features are combined, under same com-
putation and model size constraints, self-attention-only architectures also reach competitive image
classification accuracy.
These findings raise the question, do self-attention layers process images in a similar manner to
convolutional layers? From a theoretical perspective, one could argue that transfomers have the
capacity to simulate any function—including a CNN. Indeed, Pérez et al. (2019) showed that a multi-
layer attention-based architecture with additive positional encodings is Turing complete under some
strong theoretical assumptions, such as unbounded precision arithmetic. Unfortunately, universality
results do not reveal how a machine solves a task, only that it has the capacity to do so. Thus, the
question of how self-attention layers actually process images remains open.
1
Code: github.com/epfml/attention-cnn. Website: epfml.github.io/attention-cnn.

1
Published as a conference paper at ICLR 2020

Contributions. In this work, we put forth theoretical and empirical evidence that self-attention
layers can (and do) learn to behave similar to convolutional layers:
I. From a theoretical perspective, we provide a constructive proof showing that self-attention
layers can express any convolutional layers.
Specifically, we show that a single multi-head self-attention layer using relative positional encoding
can be re-parametrized to express any convolutional layer.
II. Our experiments show that the first few layers of attention-only architectures (Ramachan-
dran et al., 2019) do learn to attend on grid-like pattern around each query pixel, similar to
our theoretical construction.

Strikingly, this behavior is confirmed both for our quadratic encoding, but also for relative encoding
that is learned. Our results seem to suggest that localized convolution is the right inductive bias
for the first few layers of an image classifying network. We provide an interactive website2 to
explore how self-attention exploits localized position-based attention in lower layers and content-
based attention in deeper layers. For reproducibility purposes, our code is publicly available.

2 BACKGROUND ON ATTENTION M ECHANISMS FOR V ISION

We here recall the mathematical formulation of self-attention layers and emphasize the role of posi-
tional encodings.

2.1 T HE M ULTI -H EAD S ELF -ATTENTION L AYER

Let X ∈ RT ×Din be an input matrix consisting of T tokens in of Din dimensions each. While in
NLP each token corresponds to a word in a sentence, the same formalism can be applied to any
sequence of T discrete objects, e.g. pixels. A self-attention layer maps any query token t ∈ [T ]
from Din to Dout dimensions as follows:
Self-Attention(X)t,: := softmax (At,: ) XWval , (1)
where we refer to the elements of the T × T matrix
>
A := XWqry Wkey X> (2)
as attention scores and the softmax output3 as attention probabilities. The layer is parametrized
by a query matrix Wqry ∈ RDin ×Dk , a key matrix Wkey ∈ RDin ×Dk and a value matrix Wval ∈
RDin ×Dout .For simplicity, we exclude any residual connections, batch normalization and constant
factors.
A key property of the self-attention model described above is that it is equivariant to reordering, that
is, it gives the same output independently of how the T input tokens are shuffled. This is problematic
for cases we expect the order of things to matter. To alleviate the limitation, a positional encoding
is learned for each token in the sequence (or pixel in an image), and added to the representation of
the token itself before applying self-attention
>
A := (X + P )Wqry Wkey (X + P )> , (3)
where P ∈ RT ×Din contains the embedding vectors for each position. More generally, P may be
substituted by any function that returns a vector representation of the position.
It has been found beneficial in practice to replicate this self-attention mechanism into multiple heads,
each being able to focus on different parts of the input by using different query, key and value
matrices. In multi-head self-attention, the output of the Nh heads of output dimension Dh are
concatenated and projected to dimension Dout as follows:

MHSA(X) := concat Self-Attentionh (X) Wout + bout (4)
h∈[Nh ]

and two new parameters are introduced: the projection matrix Wout ∈ RNh Dh ×Dout and a bias term
bout ∈ RDout .
2
epfml.github.io/attention-cnn
3 P
softmax (At,: )k = exp(At,k )/ p exp(At,p )

2
Published as a conference paper at ICLR 2020

2.2 ATTENTION FOR I MAGES

Convolutional layers are the de facto choice for building neural networks that operate on images.
We recall that, given an image tensor X ∈ RW ×H×Din of width W , height H and Din channels, the
output of a convolutional layer for pixel (i, j) is given by
X
Conv(X)i,j,: := Xi+δ1 ,j+δ2 ,: Wδ1 ,δ2 ,:,: + b, (5)
(δ1 ,δ2 )∈∆
∆K

where W is the K × K × Din × Dout weight tensor 4 , b ∈ RDout is the bias vector and the set

K K K K
∆K := −
∆ ,··· , × − ,··· ,
2 2 2 2
contains all possible shifts appearing when convolving the image with a K × K kernel.
In the following, we review how self-attention can be adapted from 1D sequences to images.
With images, rather than tokens, we have query and key pixels q, k ∈ [W ] × [H]. Accordingly, the
input is a tensor X of dimension W × H × Din and each attention score associates a query and a key
pixel.
To keep the formulas consistent with the 1D case, we abuse notation and slice tensors by using a 2D
index vector: if p = (i, j), we write Xp,: and Ap,: to mean Xi,j,: and Ai,j,:,: , respectively. With this
notation in place, the multi-head self attention layer output at pixel q can be expressed as follows:
X
Self-Attention(X)q,: = softmax (Aq,: )k Xk,: Wval (6)
k
and accordingly for the multi-head case.

2.3 P OSITIONAL E NCODING FOR I MAGES

There are two types of positional encoding that has been used in transformer-based architectures:
the absolute and relative encoding (see also Table 3 in the Appendix).
With absolute encodings, a (fixed or learned) vector Pp,: is assigned to each pixel p. The computa-
tion of the attention scores we saw in eq. (2) can then be decomposed as follows:
Aabs >
q,k = (Xq,: + Pq,: )Wqry Wkey (Xk,: + Pk,: )
>

> > > > > >

= Xq,: Wqry Wkey Xk,: + Xq,: Wqry Wkey Pk,: + Pq,: Wqry Wkey Xk,: + Pq,: Wqry Wkey Pk,: (7)
where q and k correspond to the query and key pixels, respectively.
The relative positional encoding was introduced by Dai et al. (2019). The main idea is to only
consider the position difference between the query pixel (pixel we compute the representation of)
and the key pixel (pixel we attend) instead of the absolute position of the key pixel:
> >
Arel > >c > >c
q,k := Xq,: Wqry Wkey Xk,: + Xq,: Wqry Wkey rδ + u Wkey Xk,: + v Wkey rδ (8)
In this manner, the attention scores only depend on the shift δ := k − q. Above, the learnable
vectors u and v are unique for each head, whereas for every shift δ the relative positional encoding
rδ ∈ RDp is shared by all layers and heads. Moreover, now the key weights are split into two types:
Wkey pertain to the input and W
ckey to the relative position of pixels.

3 S ELF -ATTENTION AS A C ONVOLUTIONAL L AYER

This section derives sufficient conditions such that a multi-head self-attention layer can simulate a
convolutional layer. Our main result is the following:
Theorem 1. A multi-head self-attention layer with Nh heads of dimension Dh , output dimen-
sion Dout and a relative
√ positional
√ encoding of dimension Dp ≥ 3 can express any convolutional
layer of kernel size Nh × Nh and min(Dh , Dout ) output channels.
4
To simplify notation, we index the first two dimensions of the tensor from −bK/2c to bK/2c.

3
Published as a conference paper at ICLR 2020

The theorem is proven constructively by selecting the parameters of the multi-head self-attention
layer so that the latter acts like a convolutional layer. In the proposed construction, the attention
scores of each self-attention head should attend to a different relative shift within the set ∆
∆K =
{−bK/2c, . . . , bK/2c}2 of all pixel shifts in a K × K kernel. The exact condition can be found in
the statement of Lemma 1.
Then, Lemma 2 shows that the aforementioned condition is satisfied for the relative positional en-
coding that we refer to as the quadratic encoding:
(h) (h)
v (h) := −α(h) (1, −2∆1 , −2∆2 ) rδ := (kδk2 , δ1 , δ2 ) Wqry = Wkey := 0 W
dkey := I (9)

(h) (h)
The learned parameters ∆(h) = (∆1 , ∆2 ) and α(h) determine the center and width of attention
of each head, respectively. On the other hand, δ = (δ1 , δ2 ) is fixed and expresses the relative shift
between query and key pixels.
It is important to stress that the above encoding is not the only one for which the conditions of
Lemma 1 are satisfied. In fact, in our experiments, the relative encoding learned by the neural
network also matched the conditions of the lemma (despite being different from the quadratic en-
coding). Nevertheless, the encoding defined above is very efficient in terms of size, as only Dp = 3
dimensions suffice to encode the relative position of pixels, while also reaching similar or better
empirical performance (than the learned one).
The theorem covers the general convolution operator as defined in eq. (17). However, machine
learning practitioners using differential programming frameworks (Paszke et al., 2017; Abadi et al.,
2015) might question if the theorem holds for all hyper-parameters of 2D convolutional layers:

• Padding: a multi-head self-attention layer uses by default the "SAME" padding while a
convolutional layer would decrease the image size by K − 1 pixels. The correct way to
alleviate these boundary effects is to pad the input image with bK/2c zeros on each side.
In this case, the cropped output of a MHSA and a convolutional layer are the same.
• Stride: a strided convolution can be seen as a convolution followed by a fixed pooling
operation—with computational optimizations. Theorem 1 is defined for stride 1, but a
fixed pooling layer could be appended to the Self-Attention layer to simulate any stride.
• Dilation: a multi-head self-attention layer can express any dilated convolution as each head
can attend a value at any pixel shift and form a (dilated) grid pattern.

Remark for the 1D case. Convolutional layers acting on sequences are commonly used in the lit-
erature for text (Kim, 2014), as well as audio (van den Oord et al., 2016) and time series (Franceschi
et al., 2019). Theorem 1 can be straightforwardly extended to show that multi-head self-attention
with Nh heads can also simulate a 1D convolutional layer with a kernel of size K = Nh with
min(Dh , Dout ) output channels using a positional encoding of dimension Dp ≥ 2. Since we have
not tested empirically if the preceding construction matches the behavior of 1D self-attention in
practice, we cannot claim that it actually learns to convolve an input sequence—only that it has the
capacity to do so.

P ROOF OF M AIN T HEOREM

The proof follows directly from Lemmas 1 and 2 stated below:

Lemma 1. Consider a multi-head self-attention layer consisting of Nh = K 2 heads, Dh ≥ Dout
and let f : [Nh ] → ∆ ∆K be a bijective mapping of heads onto shifts. Further, suppose that for
every head the following holds:

(h) 1 if f (h) = q − k
softmax(Aq,: )k = (10)
0 otherwise.

Then, for any convolutional layer with a K × K kernel and Dout output channels, there exists
(h)
{Wval }h∈[Nh ] such that MHSA(X) = Conv(X) for every X ∈ RW ×H×Din .

4
Published as a conference paper at ICLR 2020

Multi-Head Self-Attention Layer

concatenate
a key pixel the query pixel
at position at position

Filter matrices

Attention maps for pixel

Figure 1: Illustration of a Multi-Head Self-Attention layer applied to a tensor image X. Each head h
(h)
attends pixel values around shift ∆(h) and learn a filter matrix Wval . We show attention maps
computed for a query pixel at position q.

(h)
Note that each head’s value matrix Wval ∈ RDin ×Dh and each block of the projection matrix Wout
of dimension Dh × Dout are learned. Assuming that Dh ≥ Dout , we can replace each pair of
matrices by a learned matrix W (h) for each head. We consider one output pixel of the multi-head
self-attention:
!
X X (h)
MHSA(X)q,: = softmax(Aq,: )k Xk,: W (h) + bout (12)
h∈[Nh ] k

Due to the conditions of the Lemma, for the h-th attention head the attention probability is one when
k = q − f (h) and zero otherwise. The layer’s output at pixel q is thus equal to
X
MHSA(X)q = Xq−f (h),: W (h) + bout (13)
h∈[Nh ]
√
For K = Nh , the above can be seen to be equivalent to a convolutional layer expressed in eq. 17:
there is a one to one mapping (implied by map f ) between the matrices W (h) for h = [Nh ] and the
matrices Wk1 ,k2 ,:,: for all (k1 , k2 ) ∈ [K]2 .

Remark about Dh and Dout . It is frequent in transformer-based architectures to set

Dh = Dout /Nh , hence Dh < Dout . In that case, W (h) can be seen to be of rank Dout − Dh ,
which does not suffice to express every convolutional layer with Dout channels. Nevertheless, it can
be seen that any Dh out of Dout outputs of MHSA(X) can express the output of any convolutional
layer with Dh output channels. To cover both cases, in the statement of the main theorem we assert
that the output channels of the convolutional layer should be min(Dh , Dout ). In practice, we advise
to concatenate heads of dimension Dh = Dout instead of splitting the Dout dimensions among heads
to have exact re-parametrization and no “unused” channels.
Lemma 2. There exists a relative encoding scheme {rδ ∈ RDp }δ∈Z2 with Dp ≥ 3 and parame-
ckey , u with Dp ≤ Dk such that, for every ∆ ∈ ∆
ters Wqry , Wkey , W ∆K there exists some vector v
(conditioned on ∆) yielding softmax(Aq,: )k = 1 if k − q = ∆ and zero, otherwise.
Proof. We show by construction the existence of a Dp = 3 dimensional relative encoding scheme
yielding the required attention probabilities.

5
Published as a conference paper at ICLR 2020

As the attention probabilities are independent of the input tensor X, we set Wkey = Wqry = 0 which
ckey ∈ RDk ×Dp to the identity matrix (with appropriate
leaves only the last term of eq. (8). Setting W
>
row padding), yields Aq,k = v rδ where δ := k − q. Above, we have assumed that Dp ≤ Dk
such that no information from rδ is lost.
Now, suppose that we could write:
Aq,k = −α(kδ − ∆k2 + c) (14)
for some constant c. In the above expression, the maximum attention score over Aq,: is −αc and it
is reached for Aq,k with δ = ∆. On the other hand, the α coefficient can be used to scale arbitrarily
the difference between Aq,∆ and the other attention scores.
In this way, for δ = ∆, we have
2
e−α(kδ−∆k +c)
lim softmax(Aq,: )k = lim P −α(k(k−q0 )−∆k2 +c)
k0 e
α→∞ α→∞
2
e−αkδ−∆k 1
= lim P −αk(k−q 0 )−∆k2
= P −αk(k−q 0 )−∆k2
=1
α→∞
k 0 e 1 + lim α→∞ k 6=k e
0

and for δ 6= ∆, the equation becomes limα→∞ softmax(Aq,: )k = 0, exactly as needed to satisfy
the lemma statement.
What remains is to prove that there exist v and {rδ }δ∈Z2 for which eq. (14) holds. Expanding the
RHS of the equation, we have −α(kδ − ∆k2 + c) = −α(kδk2 + k∆k2 − 2hδ, ∆i + c) . Now if we
set v = −α (1, −2∆1 , −2∆2 ) and rδ = (kδk2 , δ1 , δ2 ), then

Aq,k = v > rδ = −α(kδk2 − 2∆1 δ1 − 2∆2 δ2 ) = −α(kδk2 − 2hδ, ∆i) = −α(kδ − ∆k2 − k∆k2 ),
which matches eq. (14) with c = −k∆k2 and the proof is concluded.

Remark on the magnitude of α. The exact representation of one pixel requires α (or the matrices
Wqry and Wkey ) to be arbitrary large, despite the fact that the attention probabilities of all other
pixels converge exponentially to 0 as α grows. Nevertheless, practical implementations always rely
on finite precision arithmetic for which a constant α suffices to satisfy our construction. For instance,
since the smallest positive float32 scalar is approximately 10−45 , setting α = 46 would suffice
to obtain hard attention.

4 E XPERIMENTS

The aim of this section is to validate the applicability of our theoretical results—which state that
self-attention can perform convolution—and to examine whether self-attention layers in practice
do actually learn to operate like convolutional layers when trained on standard image classification
tasks. In particular, we study the relationship between self-attention and convolution with quadratic
and learned relative positional encodings. We find that, for both cases, the attention probabilities
learned tend to respect the conditions of Lemma 1, supporting our hypothesis.

4.1 I MPLEMENTATION D ETAILS

We study a fully attentional model consisting of six multi-head self-attention layers. As it has already
been shown by Bello et al. (2019) that combining attention features with convolutional features
improves performance on Cifar-100 and ImageNet, we do not focus on attaining state-of-the-art
performance. Nevertheless, to validate that our model learns a meaningful classifier, we compare
it to the standard ResNet18 (He et al., 2015) on the CIFAR-10 dataset (Krizhevsky et al.). In all
experiments, we use a 2 × 2 invertible down-sampling (Jacobsen et al., 2018) on the input to reduce
the size of the image. As the size of the attention coefficient tensors (stored during forward) scales
quadratically with the size of the input image, full attention cannot be applied to bigger images.
The fixed size representation of the input image is computed as the average pooling of the last layer
representations and given to a linear classifier.

6
Published as a conference paper at ICLR 2020

1.0

0.9
Models accuracy # of params # of FLOPS
Test accuracy

ResNet18 0.938 11.2M 1.1B

0.8 SA quadratic emb. 0.938 12.1M 6.2B
SA learned emb. 0.918 12.3M 6.2B
SA learned emb. + content 0.871 29.5M 15B
0.7 ResNet18
SA quadratic emb.
SA learned emb.
SA learned emb. + content-based att.
0.6
0 50 100 150 200 250 300
Epoch
Figure 2: Test accuracy on CIFAR-10. Table 1: Test accuracy on CIFAR-10 and model
sizes. SA stands for Self-Attention.

Figure 3: Centers of attention of each attention head (different colors) at layer 4 during the training
with quadratic relative positional encoding. The central black square is the query pixel, whereas
solid and dotted circles represent the 50% and 90% percentiles of each Gaussian, respectively.

We used the PyTorch library (Paszke et al., 2017) and based our implementation on PyTorch Trans-
formers5 . We release our code on Github6 and hyper-parameters are listed in Table 2 (Appendix).
Remark on accuracy. To verify that our self-attention models perform reasonably well, we dis-
play in Figure 6 the evolution of the test accuracy on CIFAR-10 over the 300 epochs of training
for our self-attention models against a small ResNet (Table 1). The ResNet is faster to converge,
but we cannot ascertain whether this corresponds to an inherent property of the architecture or an
artifact of the adopted optimization procedures. Our implementation could be optimized to exploit
the locality of Gaussian attention probabilities and reduce significantly the number of FLOPS. We
observed that learned embeddings with content-based attention were harder to train probably due to
their increased number of parameters. We believe that the performance gap can be bridged to match
the ResNet performance, but this is not the focus of this work.
4.2 Q UADRATIC E NCODING
As a first step, we aim to verify that, with the relative position encoding introduced in equation (9),
attention layers learn to behave like convolutional layers. We train nine attention heads at each layer
to be on par with the 3 × 3 kernels used predominantly by the ResNet architecture. The center of
attention of each head h is initialized to ∆(h) ∼ N (0, 2I2 ).
Figure 3 shows how the initial positions of the heads (different colors) at layer 4 changed during
training. We can see that after optimization, the heads attend on specific pixel of the image forming a
grid around the query pixel. Our intuition that Self-Attention applied to images learns convolutional
filters around the queried pixel is confirmed.
Figure 4 displays all attention head at each layer of the model at the end of the training. It can be
seen that in the first few layers the heads tend to focus on local patterns (layers 1 and 2), while deeper
layers (layers 3-6) also attend to larger patterns by positioning the center of attention further from
the queried pixel position. We also include in the Appendix a plot of the attention positions for a
higher number of heads (Nh = 16). Figure 14 displays both local patterns similar to CNN and long
range dependencies. Interestingly, attention heads do not overlap and seem to take an arrangement
maximizing the coverage of the input space.
5
github.com/huggingface/pytorch-transformers
6
github.com/epfml/attention-cnn

7
Published as a conference paper at ICLR 2020

Figure 4: Centers of attention of each attention head (different colors) for the 6 self-attention layers
using quadratic positional encoding. The central black square is the query pixel, whereas solid and
dotted circles represent the 50% and 90% percentiles of each Gaussian, respectively.

4.3 L EARNED R ELATIVE P OSITIONAL E NCODING

We move on to study the positional encoding used in practice by fully-attentional models on images.
We implemented the 2D relative positional encoding scheme used by (Ramachandran et al., 2019;
Bello et al., 2019): we learn a bDp /2c position encoding vector for each row and each column pixel
shift. Hence, the relative positional encoding of a key pixel at position k with a query pixel at posi-
tion q is the concatenation of the row shift embedding δ1 and the column shift embedding δ2 (where
δ = k − q). We chose Dp = Dout = 400 in the experiment. We differ from their (unpublished)
implementation in the following points: (i) we do not use convolution stem and ResNet bottlenecks
for downsampling, but only a 2 × 2 invertible downsampling layer (Jacobsen et al., 2018) at input,
(ii) we use Dh = Dout instead of Dh = Dout /Nh backed by our theory that the effective number of
learned filters is min(Dh , Dout ).
At first, we discard the input data and compute the attention scores solely as the last term of eq. (8).
The attention probabilities of each head at each layer are displayed on Figure 5. The figure confirms
our hypothesis for the first two layers and partially for the third: even when left to learn the positional
encoding scheme from randomly initialized vectors, certain self-attention heads (depicted on the left)
learn to attend to individual pixels, closely matching the condition of Lemma 1 and thus Theorem
1. At the same time, other heads pay attention to horizontally-symmetric but non-localized patterns,
as well as to long-range pixel inter-dependencies.
We move on to a more realistic setting where the attention scores are computed using both positional
and content-based attention (i.e., q > k + q > r in (Ramachandran et al., 2019)) which corresponds to
a full-blown standalone self-attention model.
The attention probabilities of each head at each layer are displayed in Figure 6. We average the
attention probabilities over a batch of 100 test images to outline the focus of each head and remove
the dependency on the input image. Our hypothesis is confirmed for some heads of layer 2 and 3:
even when left to learn the encoding from the data, certain self-attention heads only exploit position-
based attention to attend to distinct pixels at a fixed shift from the query pixel reproducing the
receptive field of a convolutional kernel. Other heads use more content-based attention (see Figures 8
to 10 in Appendix for non-averaged probabilities) leveraging the advantage of Self-Attention over
CNN which does not contradict our theory. In practice, it was shown by Bello et al. (2019) that
combining CNN and self-attention features outperforms each taken separately. Our experiments
shows that such combination is learned when optimizing an unconstrained fully-attentional model.
The similarity between convolution and multi-head self-attention is striking when the query pixel is
slid over the image: the localized attention patterns visible in Figure 6 follow the query pixel. This
characteristic behavior materializes when comparing Figure 6 with the attention probabilities at a
different query pixel (see Figure 7 in Appendix). Attention patterns in layers 2 and 3 are not only
localized but stand at a constant shift from the query pixel, similarly to convolving the receptive
field of a convolutional kernel over an image. This phenomenon is made evident on our interactive
website7 . This tool is designed to explore different components of attention for diverse images with
or without content-based attention. We believe that it is a useful instrument to further understand
how MHSA learns to process images.

7
epfml.github.io/attention-cnn

8
Published as a conference paper at ICLR 2020

Figure 5: Attention probabilities of each head (column) at each layer (row) using learned relative
positional encoding without content-based attention. The central black square is the query pixel. We
reordered the heads for visualization and zoomed on the 7x7 pixels around the query pixel.
layer 1
layer 2
layer 3
layer 4
layer 5
layer 6

Figure 6: Attention probabilities for a model with 6 layers (rows) and 9 heads (columns) using
learned relative positional encoding and content-content based attention. Attention maps are aver-
aged over 100 test images to display head behavior and remove the dependence on the input content.
The black square is the query pixel. More examples are presented in Appendix A.

9
Published as a conference paper at ICLR 2020

5 R ELATED W ORK
In this section, we review the known differences and similarities between CNNs and transformers.
The use of CNN networks for text—at word level (Gehring et al., 2017) or character level (Kim,
2014)—is more seldom than transformers (or RNN). Transformers and convolutional models have
been extensively compared empirically on tasks of Natural Language Processing and Neural Ma-
chine Translation. It was observed that transformers have a competitive advantage over convolu-
tional model applied to text (Vaswani et al., 2017). It is only recently that Bello et al. (2019);
Ramachandran et al. (2019) used transformers on images and showed that they achieve similar ac-
curacy as ResNets. However, their comparison only covers performance and number of parameters
and FLOPS but not expressive power.
Beyond performance and computational-cost comparisons of transformers and CNN, the study of
expressiveness of these architectures has focused on their ability to capture long-term dependencies
(Dai et al., 2019). Another interesting line of research has demonstrated that transformers are Turing-
complete (Dehghani et al., 2018; Pérez et al., 2019), which is an important theoretical result but is
not informative for practitioners. To the best of our knowledge, we are the first to show that the class
of functions expressed by a layer of self-attention encloses all convolutional filters.
The closest work in bridging the gap between attention and convolution is due to Andreoli
(2019). They cast attention and convolution into a unified framework leveraging tensor outer-
product. In this framework, the receptive field of a convolution is represented by a “basis” tensor
A ∈ RK×K×H×W ×H×W . For instance, the receptive field of a classical K × K convolutional
kernel would be encoded by A∆,q,k = 1{k − q = ∆} for ∆ ∈ ∆ ∆K . The author distinguishes
this index-based convolution with content-based convolution where A is computed from the value
of the input, e.g., using a key/query dot-product attention. Our work moves further and presents
sufficient conditions for relative positional encoding injected into the input content (as done in prac-
tice) to allow content-based convolution to express any index-based convolution. We further show
experimentally that such behavior is learned in practice.

6 C ONCLUSION
We showed that self-attention layers applied to images can express any convolutional layer (given
sufficiently many heads) and that fully-attentional models learn to combine local behavior (similar
to convolution) and global attention based on input content. More generally, fully-attentional mod-
els seem to learn a generalization of CNNs where the kernel pattern is learned at the same time as
the filters—similar to deformable convolutions (Dai et al., 2017; Zampieri, 2019). Interesting di-
rections for future work include translating existing insights from the rich CNNs literature back to
transformers on various data modalities, including images, text and time series.

ACKNOWLEDGMENTS
Jean-Baptiste Cordonnier is thankful to the Swiss Data Science Center (SDSC) for funding this
work. Andreas Loukas was supported by the Swiss National Science Foundation (project Deep
Learning for Graph Structured Data, grant number PZ00P2 179981).

10
Published as a conference paper at ICLR 2020

R EFERENCES
Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.
Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew
Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath
Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,
Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin-
cent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Watten-
berg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning
on heterogeneous systems, 2015. Software available from tensorflow.org.
Jean-Marc Andreoli. Convolution, attention and structure embedding. NeurIPS 2019 workshop on
Graph Representation Learning, Dec 13, 2019, Vancouver, BC, Canada, 2019.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. In 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V. Le. Attention Augmented
Convolutional Networks. arXiv:1904.09925 [cs], April 2019.
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable
convolutional networks. CoRR, abs/1703.06211, 2017.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhut-
dinov. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. CoRR,
abs/1901.02860, 2019.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal
transformers. CoRR, abs/1807.03819, 2018.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep
bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
Jean-Yves Franceschi, Aymeric Dieuleveut, and Martin Jaggi. Unsupervised scalable representation
learning for multivariate time series. In NeurIPS 2019, 2019.
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional
sequence to sequence learning. CoRR, abs/1705.03122, 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. CoRR, abs/1512.03385, 2015.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997.
Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In 2018 IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22,
2018, pp. 7132–7141, 2018.
Jrn-Henrik Jacobsen, Arnold W.M. Smeulders, and Edouard Oyallon. i-revnet: Deep invertible
networks. In International Conference on Learning Representations, 2018.
Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–
1751, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/
D14-1181.
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced re-
search).
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
PyTorch. In NIPS Autodiff Workshop, 2017.

11
Published as a conference paper at ICLR 2020

Jorge Pérez, Javier Marinkovic, and Pablo Barceló. On the turing completeness of modern neural
network architectures. CoRR, abs/1901.03429, 2019.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. 2018.
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon
Shlens. Stand-alone self-attention in vision models. CoRR, abs/1906.05909, 2019.
Aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City,
UT, USA, June 18-22, 2018, pp. 7794–7803, 2018.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR,
abs/1906.08237, 2019.
Luca Zampieri. Geometric deep learning for volumetric computational fluid dynamics. pp. 67,
2019.

12
Published as a conference paper at ICLR 2020

A PPENDIX
A M ORE E XAMPLES WITH C ONTENT- BASED ATTENTION

We present more examples of attention probabilities computed by self-attention model. Figure 7

shows average attention at a different query pixel than Figure 6. Figures 8 to 10 display attention for
single images.
layer 1
layer 2
layer 3
layer 4
layer 5
layer 6

Figure 7: Attention probabilities for a model with 6 layers (rows) and 9 heads (columns) using
learned relative positional encoding and content-content attention. We present the average of 100
test images. The black square is the query pixel.
original
layer 1
layer 2
layer 3
layer 4
layer 5
layer 6

Figure 8: Attention probabilities for a model with 6 layers (rows) and 9 heads (columns) using
learned relative positional encoding and content-content based attention. The query pixel (black
square) is on the frog head.

13
Published as a conference paper at ICLR 2020

original
layer 1
layer 2
layer 3
layer 4
layer 5
layer 6

Figure 9: Attention probabilities for a model with 6 layers (rows) and 9 heads (columns) using
learned relative positional encoding and content-content based attention. The query pixel (black
square) is on the horse head.
original
layer 1
layer 2
layer 3
layer 4
layer 5
layer 6

Figure 10: Attention probabilities for a model with 6 layers (rows) and 9 heads (columns) using
learned relative positional encoding and content-content based attention. The query pixel (black
square) is on the building in the background.

14
Published as a conference paper at ICLR 2020

B H YPER - PARAMETERS USED IN OUR E XPERIMENTS

Hyper-parameters
number of layers 6
number of heads 9
hidden dimension 400
intermediate dimension 512
invertible pooling width 2
dropout probability 0.1
layer normalization epsilon 10−12
number of epochs 300
batch size 100
learning rate 0.1
weight decay 0.0001
momentum 0.9
cosine decay X
linear warm up ratio 0.05

Table 2: Self-attention network parameters

C P OSITIONAL E NCODING R EFERENCES

type of positional encoding

Model relative
sinusoids learned quadratic
Vaswani et al. (2017) X
Radford et al. (2018) X
Devlin et al. (2018) X
Dai et al. (2019) X X
Yang et al. (2019) X X
Bello et al. (2019) X X
Ramachandran et al. (2019) X X
Our work X X X

Table 3: Types of positional encoding used by transformers models applied to text (top) and images
(bottom). When multiple encoding types have been tried, we report the one advised by the authors.

D G ENERALIZED L EMMA 1
We present a generalization of Lemma 1 that replaces the necessity of hard attention (to single
pixels) by a milder assumption: the attention probabilities should span the grid receptive field. The
conditions of this Lemma are still satisfied by Lemma 2, hence Theorem 1 follows.
Lemma 3. Consider a multi-head self-attention layer consisting of Nh ≥ K 2 heads, Dh ≥ Dout
and let ω : [H]×[W ] → [HW ] be a pixel indexing. Then, for any convolutional layer with a K ×
(h)
K kernel and Dout output channels, there exists {Wval }h∈[Nh ] and Wout such that MHSA(X) =
Conv(X) for every X ∈ RW ×H×Din if and only if, for all q ∈ [H] × [W ], 8

∆K }) ⊆ span({vect(softmax(A(h)
span({eω(q+∆) ∈ RHW : ∆ ∈ ∆ q,: )) : h ∈ [Nh ]}) .

8
the vectorization operator vect(·) flattens a matrix into a vector

15
Published as a conference paper at ICLR 2020

1
1
1
1

0 0 0

Figure 11: Factorization of the vectorized weight matrices Vqconv and VqSA used to compute the
output at position q for an input image of dimension H × W . On the left: a convolution of kernel
2 × 2, on the right: a self-attention with Nh = 5 heads. Din = 2, Dout = 3 in both cases.

Proof. Our first step will be to rework the expression of the Multi-Head Self-Attention operator from
equation (1) and equation (4) such that the effect of the multiple heads becomes more transparent:
(h)
X
MHSA(X) = bout + softmax(A(h) )X Wval Wout [(h − 1)Dh + 1 : hDh + 1] (15)
| {z }
h∈[Nh ]
W (h)
(h)
Note that each head’s value matrix ∈ RDin ×Dh and each block of the projection matrix Wout
Wval
of dimension Dh × Dout are learned. Assuming that Dh ≥ Dout , we can replace each pair of
matrices by a learned matrix W (h) for each head. We consider one output pixel of the multi-head
self-attention and drop the bias term for simplicity:
X X (h) X X
(h)

MHSA(X)q,: = aq,k Xk,: W (h) = Xk,: aq,k W (h) , (16)
h∈[Nh ] k k h∈[Nh ]
| {z }
SA ∈RDin ×Dout
Wq,k
(h)
with aq,k = softmax(A(h)
q,: )k . We rewrite the output of a convolution at pixel q in the same manner:

Xk,: 1{k−q∈∆
X X
Conv(X)q,: = Xq+∆,: W∆,:,: = ∆K } Wk−q,:,: . (17)
∆∈∆ ∆ k∈[H]×[W ]
| {z }
K
conv ∈RDin ×Dout
Wq,k

Equality between equations (16) and (17) holds for any input X if and only if the linear transfor-
conv SA
mations for each pair of key/query pixels are equal, i.e. Wq,k = Wq,k ∀q, k. We vectorize the
conv conv
weight matrices into matrices of dimension Din Dout × HW as Vq := [vect(Wq,k )]k∈[H]×[W ]
SA SA
and Vq := [vect(Wq,k )]k∈[H]×[W ] . Hence, to show that Conv(X) = MHSA(X) for all X, we
must show that Vqconv = VqSA for all q.
The matrix Vqconv has a restricted support: only the columns associated with a pixel shift ∆ ∈ ∆ ∆K
in the receptive field of pixel q can be non-zero. This leads to the factorization Vqconv = W conv Eq
2 2
displayed in Figure 11 where W conv ∈ RDin Dout ×K and Eq ∈ RK ×HW . Given an ordering of
conv
the shifts ∆ ∈ ∆ ∆K indexed by j, set (W ):,j = vect(W∆,:,: ) and (Eq )j,: = eω(q+∆) . On the
(h)
other hand, we decompose VqSA = W SA Aq with (W SA ):,h = vect(W (h) ) and (Aq )h,i = aq,ω(i) .
The proof is concluded by showing that row(Eq ) ⊆ row(Aq ) is a necessary and sufficient condition
for the existence of a W SA such that any Vqconv = W conv Eq can be written as W SA Aq .
2
Sufficient. Given that row(Eq ) ⊆ row(Aq ), there exists Φ ∈ RK ×Nh such that Eq = ΦAq and a
valid decomposition is W SA = W conv Φ which gives W SA Aq = Vqconv .
Necessary. Assume there exists x ∈ RHW such that x ∈ row(Eq ) and x 6∈ row(Aq ) and set x>
to be a row of Vqconv . Then, W SA Aq 6= Vqconv for any W SA and there is no possible decomposition.

16
Published as a conference paper at ICLR 2020

E G ENERALIZED Q UADRATIC P OSITIONAL E NCODING

We noticed the similarity of the attention probabilities in the quadratic positional encoding (Sec-
tion 3) to isotropic bivariate Gaussian distributions with bounded support:
2
e−αk(k−q)−∆k
softmax(Aq,: )k = P −αk(k0 −q)−∆k2
. (18)
k0 ∈[W ]×[H] e

Building on this observation, we further extended our attention mechanism to non-isotropic Gaus-
sian distribution over pixel positions. Each head is parametrized by a center of attention ∆ and a
covariance matrix Σ to obtain the following attention scores,
1 1 1
Aq,k = − (δ − ∆)> Σ−1 (δ − ∆) = − δ > Σ−1 δ + δ > Σ−1 ∆ − ∆> Σ−1 ∆ , (19)
2 2 2
where, once more, δ = k − q. The last term can be discarded because the softmax is shift invariant
and we rewrite the attention coefficient as a dot product between the head target vector v and the
relative position encoding rδ (consisting of the first and second order combinations of the shift in
pixels δ):
1
v = (2(Σ−1 ∆)1 , 2(Σ−1 ∆)2 , −Σ−1 −1 −1 > 2 2 >
1,1 , −Σ2,2 , −2 · Σ1,2 ) and rδ = (δ1 , δ2 , δ1 , δ2 , δ1 δ2 ) .
2
Evaluation. We trained our model using this generalized quadratic relative position encoding. We
were curious to see if, using the above encoding the self-attention model would learn to attend to
non-isotropic groups of pixels—thus forming unseen patterns in CNNs. Each head was parametrized
by ∆ ∈ R2 and Σ−1/2 ∈ R2×2 to ensure that the covariance matrix remained positive semi-definite.
We initialized the center of attention to ∆(h) ∼ N (0, 2I2 ) and Σ−1/2 = I2 + N (0, 0.01I2 ) so that
initial attention probabilities were close to an isotropic Gaussian. Figure 12 shows that the network
did learn non-isotropic attention probability patterns, especially in high layers. Nevertheless, the fact
that we do not obtain any performance improvement seems to suggest that attention non-isotropy is
not particularly helpful in practice—the quadratic positional encoding suffices.

Figure 12: Centers of attention of each attention head (different colors) for the 6 self-attention layers
using non-isotropic Gaussian parametrization. The central black square is the query pixel, whereas
solid and dotted circles represent the 50% and 90% percentiles of each Gaussian, respectively.

Pruning degenerated heads. Some non-isotropic attention heads attend on “non-intuitive”

patches of pixels: either attending a very thin stripe of pixels, when Σ−1 was almost singular, or
attending all pixels uniformly, when Σ−1 was close to 0 (i.e. constant attention scores). We asked
ourselves, are such attention patterns indeed useful for the model or are these heads degenerated and
unused? To find out, we pruned all heads having largest eigen-values smaller than 10−5 or condition
number (ratio of the biggest and smallest eigen-values) greater than 105 . Specifically in our model
with 6-layer and 9-heads each, we pruned [2, 4, 1, 2, 6, 0] heads from the first to the last layer. This
means that these layers cannot express a 3 × 3 kernel anymore. As shown in yellow on fig. 2, this
ablation initially hurts a bit the performance, probably due to off biases, but after a few epochs of
continued training with a smaller learning rate (divided by 10) the accuracy recovers its unpruned
value. Hence, without sacrificing performance, we reduce the size of the parameters and the number
of FLOPS by a fourth.

F I NCREASING THE N UMBER OF H EADS

For completeness, we also tested increasing the number of heads of our architecture from 9 to 16.

17
Published as a conference paper at ICLR 2020

1.0

0.9
Models accuracy # of params # of FLOPS
Test accuracy

ResNet18 0.938 11.2M 1.1B

0.8 SA quadratic emb. 0.938 12.1M 6.2B
ResNet18 SA quadratic emb. gen. 0.934 12.1M 6.2B
SA quadratic emb. SA quadratic emb. gen. pruned 0.934 9.7M 4.9B
0.7 SA quadratic emb. gen. SA learned emb. 0.918 12.3M 6.2B
SA quadratic emb. gen. pruned SA learned emb. + content 0.871 29.5M 15B
SA learned emb.
SA learned emb. + content-based att.
0.6
0 50 100 150 200 250 300 350 400
Epoch

Figure 13: Evolution of test accuracy on CIFAR- Table 4: Number of parameters and accuracy
10. Pruned model (yellow) is continued training on CIFAR-10 per model. SA stands for Self-
of the non-isotropic model (orange). Attention.

Figure 14: Centers of attention for 16 attention heads (different colors) for the 6 self-attention layers
using quadratic positional encoding. The central black square is the query pixel, whereas solid and
dotted circles represent the 50% and 90% percentiles of each Gaussian, respectively.

Similar to Figure 4, we see that the network distinguishes two main types of attention patterns.
Localized heads (i.e., those that attend to nearly individual pixels) appear more frequently in the first
few layers. The self-attention layer uses these heads to act in a manner similar to how convolutional
layers do. Heads with less-localized attention become more common at higher layers.

Quantitative Techniques For Management PDF
88% (8)
Quantitative Techniques For Management PDF
507 pages
Computer Engineering and Systems Group Orientation 2013
No ratings yet
Computer Engineering and Systems Group Orientation 2013
29 pages
Remote Sensing Image Processing
100% (1)
Remote Sensing Image Processing
137 pages
Cse Vi Computer Graphics and Visualization 10CS65 Notes PDF
100% (1)
Cse Vi Computer Graphics and Visualization 10CS65 Notes PDF
97 pages
AF12 Chapter 5 Solutions
No ratings yet
AF12 Chapter 5 Solutions
111 pages
5 Three Phase System1
No ratings yet
5 Three Phase System1
28 pages
Lock in Amplifiers Applications
No ratings yet
Lock in Amplifiers Applications
198 pages
2023-24 Physics Lab Manual Class 12
No ratings yet
2023-24 Physics Lab Manual Class 12
294 pages
Converse, Inverse, Contrapositive, and Biconditional: Welcome, Grade 8
No ratings yet
Converse, Inverse, Contrapositive, and Biconditional: Welcome, Grade 8
20 pages
Levelling and Profile Ploting PDF
100% (4)
Levelling and Profile Ploting PDF
5 pages
Le Maitre 1976
No ratings yet
Le Maitre 1976
10 pages
Focal Self-Attention For Local-Global Interactions in Vision Transformers
No ratings yet
Focal Self-Attention For Local-Global Interactions in Vision Transformers
21 pages
3 3 1 Optical Applications With CST MICROWAVE STUDIO
No ratings yet
3 3 1 Optical Applications With CST MICROWAVE STUDIO
36 pages
Yan 2022
No ratings yet
Yan 2022
22 pages
LP and SP of Microteaching
No ratings yet
LP and SP of Microteaching
4 pages
Yang 2022
No ratings yet
Yang 2022
20 pages
The Impact of Reward and Recognition On Employee Engagement at Pt. Bank Sulutgo, Manado
No ratings yet
The Impact of Reward and Recognition On Employee Engagement at Pt. Bank Sulutgo, Manado
13 pages
NFA To DFA Conversion: Rabin and Scott (1959)
No ratings yet
NFA To DFA Conversion: Rabin and Scott (1959)
14 pages
Cultrera 2020
No ratings yet
Cultrera 2020
10 pages
Bundesen 2015
No ratings yet
Bundesen 2015
9 pages
Attention Augmented Convolutional Networks
No ratings yet
Attention Augmented Convolutional Networks
12 pages
Fibonacci Sequence
No ratings yet
Fibonacci Sequence
6 pages
Duman Keles23a
No ratings yet
Duman Keles23a
23 pages
Mehta, Rastegari - 2022 - Separable Self-Attention For Mobile Vision Transformers
No ratings yet
Mehta, Rastegari - 2022 - Separable Self-Attention For Mobile Vision Transformers
18 pages
C Bitwise Operators
No ratings yet
C Bitwise Operators
24 pages
What Is Language?: Medium of Communication
No ratings yet
What Is Language?: Medium of Communication
3 pages
Attention Mechanism in Neural Networks
No ratings yet
Attention Mechanism in Neural Networks
22 pages
PSLT
No ratings yet
PSLT
16 pages
General Physics 2 Module 2
No ratings yet
General Physics 2 Module 2
9 pages
Autodesk Constraints
No ratings yet
Autodesk Constraints
16 pages
Visual Attention Network
No ratings yet
Visual Attention Network
21 pages
Presentation 2nd
No ratings yet
Presentation 2nd
26 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Transformer
No ratings yet
Transformer
59 pages
QSPM 1
No ratings yet
QSPM 1
4 pages
Vision Transformers: Revolutionizing Computer Vision
No ratings yet
Vision Transformers: Revolutionizing Computer Vision
14 pages
Neurocomputing: Zhaoyang Niu, Guoqiang Zhong, Hui Yu
No ratings yet
Neurocomputing: Zhaoyang Niu, Guoqiang Zhong, Hui Yu
15 pages
ME 2016 Spring 24 Homework 3
No ratings yet
ME 2016 Spring 24 Homework 3
4 pages
5 Attention
No ratings yet
5 Attention
50 pages
Binary Arithmetic: Example of Binary Addition
No ratings yet
Binary Arithmetic: Example of Binary Addition
22 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
Algebra - DPP
No ratings yet
Algebra - DPP
25 pages
Transformer 1803.02155
No ratings yet
Transformer 1803.02155
5 pages
SimpleTron NeurIPS 2022
No ratings yet
SimpleTron NeurIPS 2022
15 pages
Transformers
No ratings yet
Transformers
102 pages
Attention Is All You Need Paper - Removed
No ratings yet
Attention Is All You Need Paper - Removed
9 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Dong 21 A
No ratings yet
Dong 21 A
11 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Study of Spatial Attention Mechanisms
No ratings yet
Study of Spatial Attention Mechanisms
10 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Unit 11 Area and Its Boundary
No ratings yet
Unit 11 Area and Its Boundary
2 pages
Bello Attention Augmented Convolutional Networks ICCV 2019 Paper
No ratings yet
Bello Attention Augmented Convolutional Networks ICCV 2019 Paper
10 pages
(2020TACL) E Cient Content-Based Sparse Attention With Routing Transformers
No ratings yet
(2020TACL) E Cient Content-Based Sparse Attention With Routing Transformers
24 pages
Mid Term Last Year
No ratings yet
Mid Term Last Year
4 pages
Attention - Self-Expression Is All You Need
No ratings yet
Attention - Self-Expression Is All You Need
12 pages
Example File
No ratings yet
Example File
3 pages
Attention Is Not You Need: Pure Attention Loses Rank Doubly Exponentially With Depth
No ratings yet
Attention Is Not You Need: Pure Attention Loses Rank Doubly Exponentially With Depth
22 pages
Visual Attention Methods in Deep Learning An In-De
No ratings yet
Visual Attention Methods in Deep Learning An In-De
20 pages
NeurIPS 2020 Fast Transformers With Clustered Attention Paper
No ratings yet
NeurIPS 2020 Fast Transformers With Clustered Attention Paper
10 pages
2296 Treeformer Dense Gradient Tree
No ratings yet
2296 Treeformer Dense Gradient Tree
15 pages
Transformers
No ratings yet
Transformers
102 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
CNN-LSTM Facial Expression Recognition Method Fuse
No ratings yet
CNN-LSTM Facial Expression Recognition Method Fuse
9 pages
Screenshot 2024-05-27 at 7.54.47 PM
No ratings yet
Screenshot 2024-05-27 at 7.54.47 PM
29 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Transformer
No ratings yet
Transformer
10 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
Beyond Self-Attention - External Attention Using Two Linear Layers For Visual Tasks
No ratings yet
Beyond Self-Attention - External Attention Using Two Linear Layers For Visual Tasks
11 pages
A Survey On Efficient Vision Transformers Algorithms Techniques and Performance Benchmarking
No ratings yet
A Survey On Efficient Vision Transformers Algorithms Techniques and Performance Benchmarking
19 pages
Data Analysis Exercises
No ratings yet
Data Analysis Exercises
4 pages
BASTRCSX Learning-Activity-2 With Answers
No ratings yet
BASTRCSX Learning-Activity-2 With Answers
4 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Paper 2
No ratings yet
Paper 2
8 pages
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
No ratings yet
Intra-Neuronal Attention Within Language Models: Relationships Between Activation and Semantics
42 pages
NLP Lecture 01-15-Attnmechanism
No ratings yet
NLP Lecture 01-15-Attnmechanism
13 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Attention
No ratings yet
Attention
15 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
7181 Attention Is All You Need
No ratings yet
7181 Attention Is All You Need
11 pages
Attention Mechanism, Max-Affine Partition, and Universal Approximation
No ratings yet
Attention Mechanism, Max-Affine Partition, and Universal Approximation
83 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
3.1 Language Models and Attention
No ratings yet
3.1 Language Models and Attention
22 pages
Attention02. Nystromformer
No ratings yet
Attention02. Nystromformer
10 pages
Lec 12
No ratings yet
Lec 12
30 pages
NLP 8
No ratings yet
NLP 8
42 pages
Transformers From Scratch PoliTO - Ipynb Colab
No ratings yet
Transformers From Scratch PoliTO - Ipynb Colab
17 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Perceptrons: Fundamentals and Applications for The Neural Building Block
From Everand
Perceptrons: Fundamentals and Applications for The Neural Building Block
Fouad Sabry
No ratings yet

Cordonnier 2020

Uploaded by

Cordonnier 2020

Uploaded by

Published as a conference paper at ICLR 2020

ON THE R ELATIONSHIP BETWEEN S ELF -ATTENTION

Jean-Baptiste Cordonnier, Andreas Loukas & Martin Jaggi

Recent trends of incorporating attention mechanisms in vision have led re-

2 BACKGROUND ON ATTENTION M ECHANISMS FOR V ISION

2.1 T HE M ULTI -H EAD S ELF -ATTENTION L AYER

2.2 ATTENTION FOR I MAGES

2.3 P OSITIONAL E NCODING FOR I MAGES

> > > > > >

3 S ELF -ATTENTION AS A C ONVOLUTIONAL L AYER

P ROOF OF M AIN T HEOREM

The proof follows directly from Lemmas 1 and 2 stated below:

Multi-Head Self-Attention Layer

Attention maps for pixel

Remark about Dh and Dout . It is frequent in transformer-based architectures to set

4.1 I MPLEMENTATION D ETAILS

ResNet18 0.938 11.2M 1.1B

4.3 L EARNED R ELATIVE P OSITIONAL E NCODING

We present more examples of attention probabilities computed by self-attention model. Figure 7

B H YPER - PARAMETERS USED IN OUR E XPERIMENTS

Table 2: Self-attention network parameters

C P OSITIONAL E NCODING R EFERENCES

type of positional encoding

E G ENERALIZED Q UADRATIC P OSITIONAL E NCODING

Pruning degenerated heads. Some non-isotropic attention heads attend on “non-intuitive”

F I NCREASING THE N UMBER OF H EADS

ResNet18 0.938 11.2M 1.1B

You might also like