Attention Mechanism in Neural Networks
Attention Mechanism in Neural Networks
Derya Soydaner
Abstract A long time ago in the machine learning literature, the idea of
incorporating a mechanism inspired by the human visual system into neural
networks was introduced. This idea is named the attention mechanism, and it
has gone through a long development period. Today, many works have been
devoted to this idea in a variety of tasks. Remarkable performance has re-
cently been demonstrated. The goal of this paper is to provide an overview
from the early work on searching for ways to implement attention idea with
neural networks until the recent trends. This review emphasizes the impor-
tant milestones during this progress regarding different tasks. By this way,
this study aims to provide a road map for researchers to explore the current
development and get inspired for novel approaches beyond the attention.
1 Introduction
Human eye sees the world in an interesting way. We suppose as if we see the
entire scene at once, but this is an illusion created by the subconscious part
of our brain [1]. According to the Scanpath theory [2, 3], when the human eye
looks at an image, it can see only a small patch in high resolution. This small
patch is called the fovea. It can see the rest of the image in low resolution which
is called the periphery. To recognize the entire scene, the eye performs feature
extraction based on the fovea. The eye is moved to different parts of the image
until the information obtained from the fovea is sufficient for recognition [4].
These eye movements are called saccades. The eye makes successive fixations
Derya Soydaner
Department of Brain and Cognition, University of Leuven (KU Leuven), Leuven, Belgium
Tel.: +32-16710471
E-mail: [email protected]
2 Derya Soydaner
concept of selective tuning is proposed [26]. As the years go by, several studies
that use the attention idea in different ways have been presented for visual
perception and recognition [27, 28, 29, 30].
By the 2000s, the studies on making attention mechanisms more useful for
neural networks continued. In the early years, a model that integrates an at-
tentional orienting where pathway and an object recognition what pathway is
presented [31]. A computational model of human eye movements is proposed
for an object class detection task [32]. A serial model is presented for visual pat-
tern recognition gathering Markov models and neural networks with selective
attention on the handwritten digit recognition and face recognition problems
[33]. In that study, a neural network analyses image parts and generates pos-
terior probabilities as observations to the Markov model. Also, attention idea
is used for object recognition [34], and the analysis of a scene [35]. An inter-
esting study proposes to learn sequential attention in real-world visual object
recognition using a Q-learner [36]. Besides, a computational model of visual
selective attention is described to automatically detect the most relevant parts
of a color picture displayed on a television screen [37]. The attention idea is
also used for identifying and tracking objects in multi-resolution digital video
of partially cluttered environments [38].
In 2010, the first implemented system inspired by the fovea of human retina
was presented for image classification [39]. This system jointly trains a re-
stricted Boltzmann machine (RBM) and an attentional component called the
fixation controller. Similarly, a novel attentional model is implemented for si-
multaneous object tracking and recognition that is driven by gaze data [40].
By taking advantage of reinforcement learning, a novel recurrent neural net-
work (RNN) is described for image classification [41]. Deep Attention Selective
Network (DasNet), a deep neural network with feedback connections that are
learned through reinforcement learning to direct selective attention to certain
features extracted from images, is presented [42]. Additionally, a deep learning
based framework using attention has been proposed for generative modeling
[43].
It can be said that 2015 is the golden year of attention mechanisms. Because
the number of attention studies has grown like an avalanche after three main
studies presented in that year. The first one proposed a novel approach for
neural machine translation (NMT) [44]. As it is known, most of the NMT
models belong to a family of encoder-decoders [45, 46], with an encoder and a
decoder for each language. However, compressing all the necessary information
of a source sentence into a fixed-length vector is an important disadvantage of
this encoder-decoder approach. This usually makes it difficult for the neural
network to capture all the semantic details of a very long sentence [1].
The idea that [44] introduced is an extension to the conventional NMT
models. This extension is composed of an encoder and decoder as shown in
4 Derya Soydaner
Fig. 1 The extension to the conventional NMT models that is proposed by [44]. It generates
the t-th target word yt given a source sentence (x1 , x2 , ..., xT ).
Fig 1. The first part, encoder, is a bidirectional RNN (BiRNN) [47] that takes
word vectors as input. The forward and backward states of BiRNN are com-
puted. Then, an annotation aj for each word xj is obtained by concatenating
these forward and backward hidden states. Thus, the encoder maps the input
sentence to a sequence of annotations (a1 , ..., aTx ). By using a BiRNN rather
than conventional RNN, the annotation of each word can summarize both
the preceding words and the following words. Besides, the annotation aj can
focus on the words around xj because of the inherent nature of RNNs that
representing recent inputs better.
In decoder, a weight αij of each annotation aj is obtained by using its
associated energy eij that is computed by a feedforward neural network f as
in Eq. (1). This neural network f is defined as an alignment model that can
be jointly trained with the proposed architecture. In order to reduce compu-
tational burden, a multilayer perceptron (MLP) with a single hidden layer is
proposed as f. This alignment model tells us about the relation between the
inputs around position j and the output at position i. By this way, the decoder
applies an attention mechanism. As it is seen in Eq. (2), the αij is the output
of softmax function:
exp(eij )
αij = PTx (2)
k=1 exp(eik )
Attention Mechanism in Neural Networks: 5
Tx
X
ci = αij aj (3)
j=1
Based on the decoder state, the context and the last generated word, the
target word yt is predicted. In order to generate a word in a translation, the
model searches for the most relevant information in the source sentence to
concentrate. When it finds the appropriate source positions, it makes the pre-
diction. By this way, the input sentence is encoded into a sequence of vectors
and a subset of these vectors is selected adaptively by the decoder that is rel-
evant to predicting the target [44]. Thus, it is no longer necessary to compress
all the information of a source sentence into a fixed-length vector.
The second study is the first visual attention model in image captioning
[48]. Different from the previous study [44], it uses a deep convolutional neural
network (CNN) as an encoder. This architecture is an extension of the neural
network [49] that encodes an image into a compact representation, followed by
an RNN that generates a corresponding sentence. Here, the annotation vectors
ai ∈ RD are extracted from a lower convolutional layer, each of which is a D-
dimensional representation corresponding to a part of the image. Thus, the
decoder selectively focuses on certain parts of an image by weighting a subset
of all the feature vectors [48]. This extended architecture uses attention for
salient features to dynamically come to the forefront instead of compressing
the entire image into a static representation.
The context vector ct represents the relevant part of the input image at
time t. The weight αi of each annotation vector is computed similar to Eq. (2),
whereas its associated energy is computed similar to Eq. (1) by using an MLP
conditioned on the previous hidden state ht−1 . The remarkable point of this
study is a new mechanism φ that computes ct from the annotation vectors ai
corresponding to the features extracted at different image locations:
ct = φ( ai , αi ) (4)
The definition of the φ function causes two variants of attention mecha-
nisms: The hard (stochastic) attention mechanism is trainable by maximizing
an approximate variational lower bound, i.e., by REINFORCE [50]. On the
other side, the soft (deterministic) attention mechanism is trainable by stan-
dard backpropagation methods. The hard attention defines a location variable
st , and uses it to decide where to focus attention when generating the t-th
word. When the hard attention is applied, the attention locations are con-
sidered as intermediate latent variables. It assigns a multinoulli distribution
parametrized by αi , and ct becomes a random variable. Here, st,i is defined
as a one-hot variable which is set to 1 if the i-th location is used to extract
visual features [48]:
6 Derya Soydaner
X
ct = st,i ai (6)
i
L
X
Ep(st |α) [ct ] = αt,i ai (7)
i=1
Fig. 2 Examples of the attention mechanism in visual. (Top) Attending to the correct
object in neural image caption generation [48]. (Bottom) Visualization of original image
and question pairs, and co-attention maps namely word-level, phrase-level and question-
level, respectively [52].
During two years from 2015, the attention mechanisms were used for different
tasks, and novel neural network architectures were presented applying these
mechanisms. After the memory networks [53] that require a supervision signal
instructing them how to use their memory cells, the introduction of the neural
Turing machine [54] allows end-to-end training without this supervision signal,
via the use of a content-based soft attention mechanism [1]. Then, end-to-end
memory network [55] that is a form of memory network based on a recurrent
attention mechanism is proposed.
In these years, an attention mechanism called self-attention, sometimes
called intra-attention, was successfully implemented within a neural network
architecture namely Long Short-Term Memory-Networks (LSTMN) [56]. It
modifies the standard LSTM structure by replacing the memory cell with a
memory network [53]. This is because memory networks have a set of key
vectors and a set of value vectors, whereas LSTMs maintain a hidden vector
and a memory vector [56]. In contrast to attention idea in [44], memory and
attention are added within a sequence encoder in LSTMN. In order to compute
a representation of a sequence, self-attention is described as relating different
positions of it [19]. One of the first approaches of self-attention is applied for
natural language inference [57].
Many attention-based models have been proposed for neural image cap-
tioning [58], abstractive sentence summarization [59], speech recognition [60,
61], automatic video captioning [62], neural machine translation [63], and rec-
ognizing textual entailment [64]. Different attention-based models perform vi-
sual question answering [65, 66, 67]. An attention-based CNN is presented for
modeling sentence pairs [68]. A recurrent soft attention based model learns to
focus selectively on parts of the video frames and classifies videos [69].
On the other side, several neural network architectures have been pre-
sented in a variety of tasks. For instance, Stacked Attention Network (SAN)
8 Derya Soydaner
Fig. 3 The Transformer architecture and the attention mechanisms it uses in detail [19].
(Left) The Transformer with one encoder-decoder stack. (Center) Multi-head attention.
(Right) Scaled dot-product attention.
QK T
Attention(Q, K, V ) = sof tmax( √ )V (8)
dk
This calculation is performed by every word against the other words. This
leads to having values of each word relative to each other. For instance, if
the word x2 is not relevant for the word x1 , then the softmax score gives low
probability scores. As a result, the corresponding value is decreased. This leads
to an increase in the value of relevant words, and those of others decrease. In
the end, every word obtains a new value for itself.
As seen from Fig. 3, the Transformer model does not directly use scaled
dot-product attention. But the attention mechanism it uses is based on these
calculations. The second mechanism proposed, called the multi-head attention,
linearly projects the queries, keys and values h times with different, learned
linear projections to dq , dk and dv dimensions, respectively [19]. The attention
function is performed in parallel on each of these projected versions of queries,
keys and values, i.e., heads. By this way, dv -dimensional output values are
obtained. In order to get the final values, they are concatenated and projected
one last time as shown in the center of Fig. 3. By this way, the self-attention is
calculated multiple times using different sets of query, key and value vectors.
Thus, the model can jointly attend to information at different positions [19]:
improving self-attention continues as well [100, 101, 102]. Besides, based on the
self-attention mechanisms proposed in the Transformer, important studies that
modify the self-attention have been presented. Some of the most recent and
prominent studies are summarized below.
selects a subset of head tokens, and relates each head token to a small sub-
set of dependent tokens to generate their context-aware representations [105].
For this purpose, a novel hard attention mechanism called reinforced sequence
sampling (RSS), which selects tokens from an input sequence in parallel and
trained via policy gradient, is proposed. Given an input sequence, RSS gener-
ates an equal-length sequence of binary random variables that indicates both
the selected and discarded ones. On the other side, the soft attention provides
reward signals back for training the hard attention. The proposed RSS pro-
vides a sparse mask to self-attention. ReSA uses two RSS modules to extract
the sparse dependencies between each pair of selected tokens.
pre-training is carried out by jointly conditioning on both left and right con-
text. BERT differs from the left-to-right language model pre-training from this
aspect.
Recently, BERT model has been examined in detail. For instance, the be-
haviour of attention heads are analysed [141]. Various methods have been
investigated for compressing [142, 143], pruning [144], and quantization [145].
Also, BERT model has been considered for different tasks such as coreference
resolution [146]. A novel method is proposed in order to accelerate BERT
training [147].
Since 2017 when the Transformer was presented, research directions have
generally focused on novel self-attention mechanisms, adapting the Trans-
former for various tasks, or making them more understandable. In one of the
most recent studies, NLP becomes possible in the mobile setting with Lite
Transformer. It applies long-short range attention where some heads specialize
in the local context modeling while the others specialize in the long-distance
relationship modeling [159]. A deep and light-weight Transformer DeLighT
[160] and a hypernetwork-based model namely HyperGrid Transformers [161]
perform with fewer parameters. Graph Transformer Network is introduced
for learning node representations on heterogeneous graphs [162] and different
applications are performed for molecular data [163] or textual graph represen-
tation [164]. Also, Transformer-XH applies eXtra Hop attention for structured
text data [165]. AttentionXML is a tree-based model for extreme multi-label
text classification [166]. Besides, attention mechanism is handled in a Bayesian
framework [167]. For a better understanding of Transformers, an identifiabil-
ity analysis of self-attention weights is conducted in addition to presenting
effective attention to improve explanatory interpretations [168]. Lastly, Vision
Transformer (ViT) processes an image using a standard Transformer encoder
as used in NLP by interpreting it as a sequence of patches, and performs well
on image classification tasks [169].
16 Derya Soydaner
Inspired by the human visual system, the attention mechanisms in neural net-
works have been developing for a long time. In this study, we examine this
duration beginning with its roots up to the present time. Some mechanisms
have been modified, or novel mechanisms have emerged in this period. Today,
this journey has reached a very important stage. The idea of incorporating
attention mechanisms into deep neural networks has led to state-of-the-art re-
sults for a large variety of tasks. Self-attention mechanisms and GPT-n family
models have become a new hope for more advanced models. These promising
progress bring the questions whether the attention could help further devel-
opment, replace the popular neural network layers, or could be a better idea
than the existing attention mechanisms? It is still an active research area and
much to learn we still have, but it is obvious that more powerful systems are
awaiting when neural networks and attention mechanisms join forces.
18 Derya Soydaner
Conflict of interest
References
38. S. Gould, et al., International Joint Conference on Artificial Intelligence (IJCAI) pp.
2115–2121 (2007)
39. H. Larochelle, G. Hinton, Advances in Neural Information Processing Systems 23 pp.
1243–1251 (2010)
40. L. Bazzani, et al., International Conference on Machine Learning (2011)
41. V. Mnih, et al., Advances in Neural Information Processing Systems 27 pp. 2204–2212
(2014)
42. M. Stollenga, et al., Advances in Neural Information Processing Systems 27 pp. 3545–
3553 (2014)
43. Y. Tang, N. Srivastava, R. Salakhutdinov, Advances in Neural Information Processing
Systems 27 (2014)
44. D. Bahdanau, K. Cho, Y. Bengio, International Conference on Learning Representa-
tions (2015)
45. I. Sutskever, O. Vinyals, Q. Le, Advances in Neural Information Processing Systems
27 pp. 3104–3112 (2014)
46. K. Cho, et al., Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP) pp. 1724–1734 (2014)
47. M. Schuster, K. Paliwal, IEEE Transactions on Signal Processing 45(11), 2673 (1997)
48. K. Xu, et al., International Conference on Machine Learning pp. 2048–2057 (2015)
49. O. Vinyals, et al., In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition pp. 3156–3164 (2015)
50. R. Williams, Machine Learning 8(3-4), 229 (1992)
51. M.T. Luong, H.P..C. Manning, Proceedings of the 2015 Conference on Empirical Meth-
ods in Natural Language Processing, Lisbon, Portugal pp. 1412–1421 (2015)
52. J. Lu, et al., Advances in Neural Information Processing Systems 29 (2016)
53. J. Weston, S. Chopra, A. Bordes, International Conference on Learning Representa-
tions (2014)
54. A. Graves, G. Wayne, I. Danihelka, arXiv preprint arXiv:1410.5401 (2014)
55. S. Sukhbaatar, et al., Advances in Neural Information Processing Systems 28 pp. 2440–
2448 (2015)
56. J. Cheng, L. Dong, M. Lapata, Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing pp. 551–561 (2016)
57. A. Parikh, et al., Proceedings of the 2016 Conference on Empirical Methods in Natural
Language Processing, Austin, Texas pp. 2249–2255 (2016)
58. Q. You, et al., In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Las Vegas, NV pp. 4651–4659 (2016)
59. A. Rush, S. Chopra, J. Weston, Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, Lisbon, Portugal pp. 379–389 (2015)
60. D. Yu, et al., Interspeech pp. 17–21 (2016)
61. J. Chorowski, et al., Advances in Neural Information Processing Systems 28 pp. 577–
585 (2015)
62. M. Zanfir, E. Marinoiu, C. Sminchisescu, In Asian Conference on Computer Vision,
Springer, Cham pp. 104—-119 (2016)
63. Y. Cheng, et al., Proceedings of the 25th International Joint Conference on Artificial
Intelligence (2016)
64. T. Rockt International Conference on Learning Representations (2016)
65. Y. Zhu, et al., Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition pp. 4995–5004 (2016)
66. K. Chen, et al., arXiv preprint arXiv:1511.05960 (2015)
67. H. Xu, K. Saenko, In European Conference on Computer Vision pp. 451–466 (2016)
68. W. Yin, et al., Transactions of the Association for Computational Linguistics 4, 259
(2016)
69. S. Sharma, R. Kiros, R. Salakhutdinov, International Conference on Learning Repre-
sentations (2016)
70. Z. Yang, et al., In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition pp. 21–29 (2016)
71. I. Sorokin, et al., arXiv preprint arXiv:1512.01693 (2015)
72. J. Ba, et al., Advances in Neural Information Processing Systems 28 pp. 2593–2601
(2015)
20 Derya Soydaner
73. K. Gregor, et al., International Conference on Machine Learning pp. 1462–1471 (2015)
74. E. Mansimov, et al., International Conference on Learning Representations (2016)
75. S. Reed, et al., Advances in Neural Information Processing Systems 29 pp. 217–225
(2016)
76. E. Voita, et al., In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, Florence, Italy pp. 5797–5808 (2019)
77. G. Kerg, et al., Advances in Neural Information Processing Systems 33 (2020)
78. J.B. Cordonnier, A. Loukas, M. Jaggi, International Conference on Learning Repre-
sentations (2020)
79. Z. Lin, et al., International Conference on Learning Representations (2017)
80. R. Paulus, C. Xiong, R. Socher, International Conference on Learning Representations
(2018)
81. N. Kitaev, D. Klein, In Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Long papers) pp. 2676–2686 (2018)
82. D. Povey, et al., IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), IEEE pp. 5874–5878 (2018)
83. A. Vyas, et al., Advances in Neural Information Processing Systems 33 (2020)
84. W. Chan, et al., IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), Shanghai pp. 4960—-4964 (2016)
85. M. Sperber, et al., In proceedings of Annual Conference of the International Speech
Communication Association (InterSpeech) pp. 3723–3727 (2018)
86. L. Kaiser, et al., arXiv preprint arXiv:1706.05137 (2017)
87. C. Xu, et al., Proceedings of the 56th Annual Meeting of the Association for Compu-
tational Linguistics (Short papers), Melbourne, Australia pp. 778–783 (2018)
88. S. Maruf, A. Martins, G. Haffari, Proceedings of NAACL-HLT, Minneapolis, Minnesota
pp. 3092–3102 (2019)
89. P. Ramachandran, et al., Advances in Neural Information Processing Systems 32 pp.
68–80 (2019)
90. Y. Li, et al., International Conference on Machine Learning (2019)
91. I. Goodfellow, et al., Advances in Neural Information Processing Systems 27 pp. 2672–
2680 (2014)
92. H. Zhang, et al., International Conference on Machine Learning pp. 7354–7363 (2019)
93. T. Xu, et al., In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) pp. 1316–1324 (2018)
94. A. Yu, et al., International Conference on Learning Representations (2018)
95. J. Zhang, et al., Conference on Uncertainty in Artificial Intelligence (2018)
96. D. Romero, et al., International Conference on Machine Learning (2020)
97. R. Al-Rfou, et al., AAAI Conference on Artificial Intelligence 33, 3159 (2019)
98. J. Du, et al., Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing pp. 2216–2225 (2018)
99. X. Li, et al., Advances in Neural Information Processing Systems 33 (2020)
100. B. Yang, et al., AAAI Conference on Artificial Intelligence 33, 387 (2019)
101. B. Yang, et al., Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, Brussels, Belgium pp. 4449–4458 (2018)
102. Proceedings of the IEEE International Conference on Computer Vision pp. 3286–3295
103. P. Shaw, J. Uszkoreit, A. Vaswani, Proceedings of NAACL-HLT, New Orleans,
Louisiana pp. 464–468 (2018)
104. T. Shen, et al., AAAI Conference on Artificial Intelligence pp. 5446–5455 (2018)
105. T. Shen, et al., In Proceedings of the 27th International Joint Conference on Artificial
Intelligence, (IJCAI-18) pp. 4345–4352 (2018)
106. H. Le, T. Tran, S. Venkatesh, International Conference on Machine Learning (2020)
107. T. Shen, et al., International Conference on Learning Representations (2018)
108. S. Bhojanapalli, et al., International Conference on Machine Learning (2020)
109. Y. Tay, et al., International Conference on Machine Learning (2020)
110. S. Sukhbaatar, et al., Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, Florence, Italy pp. 331–335 (2019)
111. Y. Jernite, et al., International Conference on Learning Representations (2017)
112. R. Shu, H. Nakayama, In Proceedings of the First Workshop on Neural Machine Trans-
lation, Vancouver, Canada pp. 1–10 (2017)
Attention Mechanism in Neural Networks: 21