Banino 10096744 Thesis AARAV
Banino 10096744 Thesis AARAV
Banino 10096744 Thesis AARAV
Andrea Banino
Centre for Mathematics, Physics and Engineering in the Life Sciences and
Experimental Biology
University College London
May 4, 2020
2
I, Andrea Banino, confirm that the work presented in this thesis is my own.
Where information has been derived from other sources, I confirm that this has
been indicated in the work.
Abstract
In recent years, deep neural networks have enjoyed tremendous successes in a va-
riety of challenging tasks. Despite these breakthroughs, there remain key areas in
which humans are still strikingly superior: the ability to learn in a one-shot fashion
(episodic memory) and spatial navigation being two core examples. Fortuitously,
these areas are topics in neuroscience that have deep theoretical and empirical foun-
dations. Consequently, in this body of work we drew on this opportunity to develop
neuroscience-inspired architectures that support navigation and episodic memory
and in so doing, we also provided new neuroscientific insights. Specifically we
identified architectural constraints in neural network models that allowed the emer-
gence of spatial representation that resemble the ones found in the mammalian brain
(e.g. place cells, grid cells, head direction cells). Grid cells in particular are believed
to provide multi-scale periodic representation that functions as a metric for coding
space which is critical to plan direct trajectories to goals. To test this hypothesis
we used our artificial agent to show that emergent grid-like representations fur-
nish it with a Euclidean spatial metric and associated vector operations, providing
a foundation for proficient navigation. As such, our results supported neuroscien-
tific theories that see grid cells as critical for vector-based navigation. In a second
line of work we focused on episodic memory and in particular on the role of the
hippocampus in generalisation. We employed a classic associative inference task
from the human neuroscience literature - the paired associative inference task (PAI)
- to carefully probe the reasoning capacity of existing memory-augmented neural
networks. Surprisingly, we found that current architectures struggle to reason over
long distance associations. Consequently we developed a new memory architecture
Abstract 4
First of all I would like to thank DeepMind for giving me the chance and the schol-
arship to pursue this PhD. I want to personally thank Demis for this and for being a
continuous source of inspiration. To Caswell, my supervisor, thanks for all the neu-
roscientific insights, the guidance and the kindness throughout these years. Also
a big thanks to Charles, for teaching me a lot of machine learning and for always
giving me a different point of view on things, this has been key for my professional
development. To my dear friend and mentor Dharsh, it is difficult to find the right
words to say how much grateful I am, this would have simply been impossible
without you, thanks for believing in me!
I want to thank my wife, by walking side by side we made another step to-
gether. To my son Lorenzo, without saying a single word you already gave me
so many lessons. My mum, sorry if getting here has not been straightforward, but
thanks for always being present and supportive. And finally, to my dad... well, you
are not here to see this, but I know you would have been over the moon. I’m not
sure if I will ever write something else so... this is not much, but it is dedicated to
you, and to your smile. Grazie papà.
Publications Arising
The following publications were produced from work undertaken as part of this
thesis:
• Benigno Uria, Borja Ibaraz, Andrea Banino, Caswell Barry, Charles Blun-
dell. “Memory Maps: an unspervised framework for learning spatial repre-
sentation” (in preparation)
Impact Statement
In the first part of the work we used deep-learning to investigate the role of
grid cells in supporting vector based navigation, that is calculate the direct route
between two places. In particular we developed a new deep learning architecture
to perform the task of self localisation to investigate whether grid cells emerged in
the network as a consequence of performing the task. Indeed the “grid units” that
spontaneously emerged in the network were remarkably similar to the ones found
in the mammalian brains. We then embedded this system into a deep reinforcement
learning agent that was able to find the shortest path to a goal even in complex
mazes with random blockages. None of the the other agents we tested, all lacking
“grid-like” units, demonstrated the same performance. This work was impactful
for the neuroscience community as it was an example of how deep learning could
be used as tool to validate neuroscientific theories. It seems likely that a similar,
approach could be used by researchers interested in limb control. They could train
a neural network to control a robotic arm the way that the brain controls a living
Impact Statement 8
arm, and then run experiments on the artificial system to generate further insights
into the living one. We believe that this aspect makes our approach a general pur-
pose neuroscience tool. At the same time our model is the first artificial agent that
solves complex navigation tasks, like finding shortcuts in unexplored mazes. So
by solving such complicated tasks we strengthen the idea that by taking inspiration
from neuroscience we can design deep learning systems with the ability to tackle
problems closer to the ones that require high level cognitive functions.
In the second part of this work we employed a classic associative inference task
from the memory-based reasoning neuroscience literature in order to more carefully
probe the reasoning capacity of existing neural network architectures. In particular
the associative inference task was used to capture the essence of reasoning, that is
the appreciation of distant relationships among elements distributed across multiple
facts or memories. In our analysis we found that current artificial neural network
architectures struggle to reason over long distance associations. Consequently we
designed a new neural network architecture endowed with the capacity to perform
multistep reasoning. Our new model was able to solve the associative inference
task and was also state of the art in a challenging question and answering language
task. The architecture presented in this work as been patented with potential uses in
commercial artificial intelligence systems.
Contents
1 Introduction 15
1.1 Cognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Understanding complex systems . . . . . . . . . . . . . . . . . . . 17
1.3 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 A brief description of Artificial neural networks and back-
propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3.2 Feed-Forward and Recurrent Networks . . . . . . . . . . . 21
1.3.3 Frameworks used to train Neural Networks . . . . . . . . . 22
1.3.4 The longstanding relation between Neural Networks and
Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5.1 Working Memory . . . . . . . . . . . . . . . . . . . . . . . 30
1.5.2 Episodic Memory . . . . . . . . . . . . . . . . . . . . . . . 33
1.5.3 An hybrid approach: Neural network augmented with ex-
ternal memory . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Using ANN to model Cognition . . . . . . . . . . . . . . . . . . . 35
Appendices 121
Bibliography 134
List of Figures
A.1 Linear layer spatial activity maps from the supervised learning ex-
periment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.2 Grid-like units did not emerge in the linear layer when dropout was
not applied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.3 Robustness of grid cell agent and performance of other agents. . . . 124
Introduction
1.1 Cognition
Our world is inherently complex, and all forms of life are constantly competing
against each other for scarce resources [1]. Despite the astonishing complexity of
life on earth, virtually all creatures are stuck in the same loop. First they observe a
state of the world, then they process the information to learn about that state, next
they decide which action to take to maximise some measure of success, and finally
they observe the new world state consequent to that action. Then this loop repeats,
tirelessly.
Cognition can be seen as the characterisation of this loop, and in this vein it
could be defined as the mechanism that processes sensory information to perform
goal directed behaviours. This definition is intriguing, because it does not exclude
any creature, instead it sees cognition as a continuum where the possession of partic-
ular cognitive ability depends on the information processing capacity of an agents,
being it biological or artificial. One important term in the definition is informa-
tion, which is a non physical quantity [2], and so this implies that its content can
be handled by different physical substrates (e.g. brains, computers), with the same
resulting outcome. This is particularly important in the context of this work, where
we will take the endeavour of using recent techniques developed in artificial intel-
ligence to investigate how the brain supports navigation and inferential reasoning,
two key aspects of animal cognition. However, before diving into the details of our
1.1. Cognition 16
The idea to equate the mind to an information processing system lies at the
centre of the cognitive shift that happens in the field of psychology in the 1950s,
and that gave rise to the field of cognitive science [3]. At the beginning of that
decades, the ideas of Norbert Wiener’s on cybernetics [4] were gaining popularity,
but it was in the 1956 that a collection of ideas and findings started to shape the
study of cognition in ways that are still relevant today. In particular, in that year
Claude Shannon and John McCarthy edited a book that investigated the possibility
of designing a machine that could simulate a brain [5]. Also, Marvin Minsky started
to circulate a report that laid the foundations of the field of artificial intelligence [6].
Moreover, at about this time, the term ”cognitive strategies” first appeared in the
book ”A Study of Thinking” from Jerry Bruner, Jackie Goodenough and George
Austin [7]. In concert, critical work, showing the computational limits of the human
mind, was published by George Miller [8]. Finally, Noam Chomsky applied similar
ideas in linguistics, effectively revolutionising our understating of language [9].
All these approaches were influenced by the work of Kurt Gödel, Alan Tur-
ing, Alonzo Church and John von Neumann in the 1940s and 1950s, which defined
the basic theories of computation that eventually gave rise to the invention of digi-
tal computers. Nowadays, the computer metaphor referred to as the von Neumann
computer architecture seems outdated, but the idea that the brain is a computer is
not. Indeed, a computer is just a device that can compute many different computable
functions. Our brain is one such device, so brains are literally computers. These
consideration are not merely superficial, instead they emphasise that, to understand
the link between brain and behaviour, we need an approach that can described both
the representations and computations capabilities of the mind (the functions), and
how these are grounded into the structures and functions of the brain (i.e. the de-
vice). Only by following this path we can aim for a general theory of brain function.
However, given the complexity and the breath of this venture, it rapidly became ev-
ident that a framework was required to validate the computational models. It was
the British computer scientist and psychologist David Marr, who defined perhaps
1.2. Understanding complex systems 17
the best know scheme to formalise what it means to understand the connection be-
tween brain and behaviour. Marr’s work plays a central role in the justification of
artificial neural networks (ANN) as a model cognition, it is to this that we now turn.
• level 1, the computation, this defines the goal achieved by the system - e.g.
navigating from A to B;
• level 2, the algorithm, these are the rules implemented by the system to
achieve the goal - e.g. calculating the shortest path between the 2 points;
• level 3, the implementation, this is the physical substrate - e.g. how neural
firing supports the calculation of the shortest path.
The intuition behind this scheme is that by only looking at the implementation it
could be too difficult, if not impossible, to deduce the algorithm that the brain im-
plements to successfully achieve a specific goal. In particular, Marr was objecting
that by only looking at neurophyslogical data it could, at best, be possible to de-
scribe some properties of the neurons. But, on its own, a mere description of the
1.3. Artificial neural networks 18
parts composing such complex system will probably not lead to a fundamental un-
derstanding of its function. This should not be seen as a criticism of the analysis of
physiological data but rather as a call to consider the whole spectrum, not just part
of it.
Futher, the relationship between the three levels is not arbitrary, as the compu-
tational level should be used to suggest the possible algorithm supporting it, which
in turn predicts its mechanistic implementation. The process also works in reverse,
whereby the implementation provides feedback to the higher levels. This suggests
that a fruitful approach to model cognition should be neither top-down, nor bottom-
up, instead the focus should be on a systemic method. However defining a tradi-
tional computational model that takes into account all the levels might be a daunt-
ing exercise, too many assumptions would need to be made, and so by definition
the generality of the model would be limited. Nonetheless, we think that artificial
neural networks represent a promising way to overcome these limitations. As such,
the next chapter introduces ANNs and explains why they are powerful mechanism
to explain cognition.
a. The equations for the forward pass in an ANNs with two hidden layers and one
output layer. For each layer in the network, each unit calculate the weighted sum
of all the incoming connections from the layer below (biases are omitted for
simplicity). Then the weighted sum is fed trough a non-linear activation function
f (.), such as the sigmoid function or the rectified linear unit (ReLU) b. The
equations for the backwards pass. At each layer the error derivative with respect to
each unit is computed by calculating a weighted sum of the error derivatives with
respect to all the connections entering that unit from the layer above. This is the
error derivative with respect to the output, ∂∂Ey , which is then converted to the error
∂E
derivative with respect to the input, ∂o , by multiplying it by the gradient of f (o),
∂y
∂o
be trained. Finally, will introduce the history of the field, and highlight how deep
learning and reinforcement learning share roots with neuroscience.
train very deep networks that could discover good hidden features needed to model
complex input-output problems, such as visual and speech recognition. However,
by the 1980s hand-engineering features had become the standard in many fields
of artificial intelligence, since experts knew from empirical studies which features
were important to solve a task. Nevertheless hand-engineering successful features
requires a lot of knowledge and practice, and makes it difficult to scale these systems
to real world task, due to the difficulty of knowing a large set of good features.
Interestingly, since the early days of artificial intelligence research the aim of
researchers has been to replace hand-engineered features with learnt ones. Despite
its simplicity, it almost took another twenty years before several groups of indepen-
dent researchers found the solution to train very deep neural networks (DNNs), the
backpropagation algorithm [22, 23, 24, 25]. Backpropagation is a general purpose
learning algorithm based on stochastic gradient descent that requires two central
elements. The first is an objective function, that is, a smooth mathematical formu-
lation used to evaluate the quality of the solution. The second is the need for the
layers in the network to be smooth functions of their inputs and internal weights.
The smoothness of both components is required because backpropagation is noth-
ing else than a practical application of the chain rule of derivatives, hence the need
for the functions to be differentiable [26]. The reason for the name backpropaga-
tion comes from the fact that the algorithm calculates the gradients going backwards
down the network. Gradients are the multi-variable derivative of the loss function
with respect to all the network parameters. Specifically, it starts with the calculation
of the errors between the objective function and each of the output units, then it pro-
ceeds backwards down through all the layers to the input units (see Fig. 1.1 for an
example of forward and backward pass with the corresponding equations). The key
aspects is that the calculation performed for one layer are reused in the layer before
to allow an efficient flow of the errors with respect to the weights of each layers.
Critically, each parameter (weight) is adjusted in relation to its contribution in re-
ducing the error (see Fig. 1.1b). The idea behind this gradient descent technique is
very general and can be seen as a way to optimise the behaviour of the network in
1.3. Artificial neural networks 21
The recurrent network receives an input xt at time t, which is then fed trough a set
of parameters U into the hidden layer, h. The hidden activation ht are calculated as
f (Uxt +W ht−1 ), where W is another set of parameters, f (.) is the activation
function. In this way the network has a memory of the computations performed at
the previous time steps. Finally, the output, ot , is calculated as g(V ht ), where g(.)
is the activation function and V a set of parameters. The same U,W,V are used at
each time step (biases are omitted from this figure for clarity).
relation to its experience. The technique can also be extended to the reinforcement
learning domain (see below).
[42, 37, 43]. Interestingly not just RL, bus also ANNs have strong ties with work
done in psychology and neuroscience, so in the next section we will review how the
field of deep learning is directly connected with research done in cognitive science.
wt+1 t t t
i, j = wi, j + λ xi x j (1.1)
It was roughly a decade later that single layer perceptrons, the basis of modern
ANNs, were introduced together with the rule to train it via supervisory feedback
[46]:
wt+1
i = wti + r · (d j − ytj )x j,i , (1.2)
• wi is the ith value in the weight vector, to be multiplied by the value of the ith
input feature
• xi, j is the value of the ith feature of the jth training input vector
At that time AI research was dominated by the classical theory that saw cog-
nition as a series of manipulations made to symbolic representations similar to the
ones performed by computers [47]. More recently, ’symbolic AI’ has been chal-
lenged by the alternative view that the brain uses distributed representations which
are processed in parallel. In particular, the parallel distributed processing (PDP)
movement suggested that cognition could be explained by flexible and distributed
connections between the units of ANNs, which are learnt in an iterative fashion with
the goal of maximising an objective function [48]. Connectionist models based on
simple ANNs were successfully used to model several aspect of cognition such as,
language [e.g. 30, 25, 49], attention [e.g. 50, 51], motor control [52], memory [e.g.
53, 54], vision [e.g. 55, 56, 57] and neurospychology [58]. Interestingly, not only
were these models able to mimic the input-output pattern observed in experiments,
but the behaviour of the artificial units resembled biological neurons [59].
The issue about representations is at the centre of the debate between logic and
neural network based AI models for understanding cognition. The former builds on
the naive idea that information is stored in local representations, that is each neuron
is coding for a specific pattern. The latter assumes information to be stored in a
distributed fashion, specifically patterns are represented by the activity of several
neurons distributed across large parts of the cortex. It has been shown that one of
the main advantages of distributed representations is their ability to generalise be-
yond the training examples much more easily than local ones [60, 61]. Distributed
representation, unlike symbols, are likely to be more robust to damage or silencing
of units [62]. Moreover, the simple rules learnt by distributed representations can
be reused exponentially often as a function of the number of layers in the network
[63]. These aspects of connectionist models lie at the heart of current deep learn-
ing techniques that are the de-facto standard in most of modern machine learning
applications. For instance, current models of natural language processing are typi-
cally based on the idea of learnt word vectors [64]; that is, in the first layer of the
1.3. Artificial neural networks 26
model each word is transformed into a vector, then the following layers learn how
to convert this vector into a set of probabilities to predict the following word in a
sentence. It has been proven that these vectors learn to code each word as a set of
distinct features that together represent the word itself [65]. Critically, these fea-
tures were not present in the input of the network, instead the learning procedure
found them to be a good basis set to factorise the input space in ways that maximise
the final objective. Distributed vector representation, such as these, are now the
most used embeddings in language modelling [66, 67, 68].
1.4 Attention
Neural networks are generally characterised as universal function approximators –
they can estimate any arbitrary functions with any arbitrary precision [80]. How-
ever, it is well understood that this is the case only in the limit of infinite capacity,
while in practice there are limitations due to the number of hidden units that one can
train. To overcome these limitations, and so increase the expressivity of ANNs, one
possibility is to use attention mechanisms [81], which mirror the ability of brains
to focus only on a subset of the inputs while ignoring the rest. Attention is an area
of active research in cognitive science, and the predominant view is that it can be
divided into three functional components: alerting, orienting, and executive atten-
tion. Alerting is the ability of maintaining active vigilance during the performance
of a certain task. Orienting is the capability to prioritise a specific sensory input.
Finally, brains are normally able to process different streams of inputs, but when
interference arises, then executive attention mediates these conflicts. Executive at-
tention is assumed to reside in higher parts of the brain, like prefrontal cortex (PfC),
and it is thought to be the only one involving awareness [82].
relevant part of the inputs, even if these occur several steps back in the sequence
[67]. Subsequently, similar forms of attention were then employed in networks
performing vision tasks. Here the attention was used to extract glimpses that were
particularly useful to solve the task [85]. Similarly attention was used to perform
image-to-caption generation [86], and in the context of generative modelling in both
variational autoencoders [87] and adversarial training [88].
More recently, the most prominent use of attention in ANNs has been ’self-
attention’, where elements of a single sequence are compared to each other to gen-
erate representations of the sequence itself [89]. This approach has been extended
to fully feed forward networks in a model called ’Transformer’[90]. Here the com-
bination of multiple self-attention heads with layer normalisation [91] and residual
connections [92], resulted in state-of-the-art performance across a large spectrum
of language task, such as next word prediction [68], and text generation [93]. Also,
self-attention was recently successfully implemented in multi-agent reinforcement
learning [94]. The main advantage of the ’Transformer’ model resides in its ability
to select relevant information from the inputs by performing an highly parallel all-
to-all comparison between the elements of the sequence. Nevertheless, the ability
of self-attention to model long-range dependencies has been recently been ques-
tioned [95], as its quadratic complexity in the input length, make the model almost
intractable for long sequences. An alternative attention schema is represented by
dynamic lightweight convolution that, by reusing the same weights for each con-
textual element independently from the sequence time-step, reduce the amount of
computational complexity while maintaining the same performance as self-attention
[96].
1.5 Memory
Learning, a central topic in modern AI, can be defined as the process of acquiring
knowledge about the world, or in other words, the process of building memories.
Decades of psychological studies have sought to characterise memory and under-
stand that processes that contribute to it. In neuroscience, a common distinction
can be made between short-term memory, working memory and long-term memory
[98]. Short-term memory is the storage of a limited amount of readily available
information for a brief delay. For instance, to understand the meaning of a sen-
tence, it is important to keep in mind the beginning of it while processing the rest;
likewise, to perform a subtraction, it could be important to remember the carrying-
over number. These are examples of task supported by short-term memory, which
it is believed to store pointers to knowledge that resides in other part of the brain,
rather than complete concepts [99]. Short-term memory forms the basis for the
working memory system, a mental workspace where information from short-term
memory and long-term memory is retrieved and manipulated to support complex
tasks like reasoning, learning and decision-making. Working memory is believed
to depended on the interactions between a central ”executive controller” and three
distinct, domain specific, buffers: the phonological loop, the visual sketch pad and
the episodic buffer [100, 101]. Finally, long-term memory is a collection of systems
that support the ability to retain information over long periods of time. Long-term
memory can be further split into explicit and implicit memory. The former can be
intentionally retrieved, and can either be related to personal events – episodic mem-
ory – or to general facts about the world – semantic memory [102, 103, 104]. In
contrast, implicit memory (or non-declarative memory) is accessed without aware-
ness, and it is reflected through performance rather than explicit recollection. One
form of implicit memory is called procedural and enables us to carry out commonly
learned tasks without consciously thinking about them, like writing or riding a bi-
cycle. Implicit memory can also come about from priming, the process by which a
past experience increases the accuracy or quickness of a response. For instance, if a
word has been heard very recently, or many more times than another one, then that
1.5. Memory 30
word is more likely to be retrieved more quickly. Another form of implicit memory
is conditioning, where the association of one thing with another is unconsciously
learn [105].
Episodic memory is often defined as the capacity to remember specific events
with just a single exposure (i.e. one-shot); experiences are stored within the medial
temporal lobe, and the hippocampal system in particular, by remembering what,
where and when an event happened [106]. The retrieval of these episodes takes
place through so called ”mental time travel” [103] allowing an individual to go
back in time and re-live a specific moment but equally can be use to travel forward
to plan for future events. For instance, one could remember a chat with a friend the
night before at the pub, and agreeing to help painting the friend’s new flat (travel
back); this memory can then be used to decided what to wear and what tools are
appropriate to help painting (travel forward).
The accumulation of separate events in memory is thought to be the main in-
put to a consolidation process that gradually extract relevant facts and features into
a more general representation, the above mentioned semantic memory (even though
consolidation is not restricted to this cf. [107]). Albeit the precise nature of the in-
teractions between episodic and semantic memory remains unclear, it has been ob-
served that patients with amnesia, related to damages in the hippocampus, struggle
to form new general knowledge, even if they maintain the previously acquired one
[103, 106], thus confirming the need of episodic memory to form new knowledge
(even though it is not clear cut as there has been some debate on what memories can
become semantic [e.g. 108]).
+1
if ∑ j wi, j s j > θi ,
si = (1.3)
−1
otherwise.
where
• wi j is the strength of the connection weight from unit j to unit i (the weight of
the connection),
In this process the nodes of the network get updated and the process eventually
converges to a stable state and at that time a pattern is read-out. This networks are
useful to recover stored pattern from degraded inputs [109]. Subsequently, RNNs
for supervised learning were introduced by Jordan [110], this was effectively a feed
forward network with a single hidden layer but was equipped with state units. This
latter units, which were self-connected, fed back the value of the output units at
the following time step, thus making the network recurrent. Interestingly, this way
of using the output as new input is also at the basis of modern machine translation
architectures [84]. Finally, Elman [30] introduced the equivalent of modern RNNs,
where each hidden unit had a single recurrent connection. Interestingly, some theo-
retical work on this class of models proved that fixed-sized RNNs, when equipped
with nonlinear activation function, can simulate a universal Turing machine [111].
Despite the potential of recurrent neural networks, learning long-range depen-
dencies with such models has proven to be difficult due the problems of vanishing
and exploding gradients [112, 113]. These problems are related to gradient based
learning, where the value of a weight in the network is adjusted based on its contri-
bution to the network’s output (see section 1.3.1 for details on the backpropagation
learning rule). If a change in the weight’s value causes very small change in the
network’s output then the network is unable to learn the parameter effectively, this
is what is referred to as vanishing gradient. On the opposite, exploding gradient is a
problem due to large error gradients which accumulate over time and result in very
1.5. Memory 32
large updates to the neural network weights, thus resulting in saturation of the net-
work. In the context of recurrent networks, both problems have been mitigated with
the Long Short-Term Memory (LSTM) [114] network. LSTMs replace the units in
the hidden layer of traditional RNNs with a memory cell, which contains a set of
gated nodes. These self-connected nodes can learn to create paths through time that
ensure that gradients flow without vanishing or exploding. In its initial version, the
weights of the self-connected nodes in the memory cell were fixed, this limitation
was addressed by making these weights conditioned on the hidden state produced
by network at the previous time step [115], thus making LSTMs more powerful in
dealing with longer sequences. Alongside LSTMs, a futher advancement was made
with the Bidirectional Recurrent Neural Network [116], which is able to access
both past and future elements in the sequence to determine the output. The use of
LSTMs and bidirectional recurrent neural networks has been extremely successful
in several application, such as phoneme classification [117], handwriting recogni-
tion [118], speech recognition [119], machine translation [84], image captioning
[86] and syntactic parsing [120]. More recently, a simplified version of the LSTM,
the gated recurrent unit [121], uses a single gate to decided when to forget and to
update the content of the cell. This architecture has shown performance that in most
cases matches the LSTM framework, so rising doubts about which elements of the
architecture are needed. However subsequent investigations [122, 123] of several
variants of the LSTM and gated recurrent unit reached the conclusion that learn-
ing to forget is essential, but did not find a conclusive answer which architectural
components are universally necessary, leaving this research question open to future
investigations.
A further line of work that is worth mentioning in connection with the neu-
roscience literature is the so-called fast weights [124], which take inspiration from
short term potentiation of biological neurons [125, 15]. With fast weights each
weight of a NNs is augmented with another more plastic one that is generally trained
with the Hebbian learning rule. This new weight can grow and decay rapidly as a
function of the current inputs, and so store traces of the recent past that are useful
1.5. Memory 33
to solve sequential memory tasks. More recently, two concurrent papers [126, 127]
augmented RNNs with fast Hebbian weights, allowing the networks to attend to the
recent past with better performance than traditional LSTMs, without the need to
explicitly store copied of the neural activation.
NNs where the optimal value of each connection is a function of the values of all the
others. This interdependence is especially critical at the beginning of training, when
there is virtually no structure in the weights ensemble, and so the error signal com-
ing from the gradients calculation is weak and noisy. In this scenario, it has been
shown that adjusting the behaviour of an agent by using single experiences stored
in an episodic like memory, a process called episodic control [135], is beneficial in
the case of complex tasks with high inferential noise and especially at the beginning
of learning, when the statistical knowledge provided by the parametric system has
not set in yet [136]. The idea of episodic control has been used recently in deep
reinforcement learning [137, 138], where an agent was furnished with an episodic
buffer to store visual observations together with the actions taken and the associated
reward. The agent was then trained to select a novel action based on the similarity
between the current visual observation and the one stored in memory, by critically
taking into account the reward observed before. By using such a training regime
the agent was able to substantially increase data efficiency and reach much higher
scores on specific Atari games which had proven to be hard for simple reactive al-
gorithms like DQN. This success was mainly due to the ability of the algorithm
to perform one-shot learning [138], in a vein similar to the one supported by the
hippocampus [135]. A similar approach was also investigated in the domain of su-
pervised learning, where a network was trained to classify images rapidly from only
few examples, by employing an episodic buffer where a set of common represen-
tations for each ImageNet class were stored, and then used to match new instances
using cosine similarity [139].
like the interference of items in memory, Neural Turing Machine was extended
with the inclusion of memory usage statistics, in the so called Differential Neural
Computer [141]. A more recent extension incorporated sparsity into the Differential
Neural Computer [142], allowing the model to perform similarly to the original Dif-
ferential Neural Computer, but employing a fraction of the computation. However,
despite its ability to solve complex reasoning tasks, Differential Neural Computer
proved to be sometime difficult to train. Another architecture, termed Memory Net-
works, avoided some of these limitations by using a read-only memory storage that
scaled with the size of the inputs [143, 144], and so avoiding the problem of learning
to write. This approach was also successfully used to tackle some complex naviga-
tion task in DRL [145]. Since these initial architectures were introduced, a number
of alternative have been developed. For instance, the Dynamic Memory Network
[146] shares some similarities with Memory Networks, but instead of using a pre-
filled memory buffer it works with sequential inputs, thus making the method more
general. The Working Memory Network [147] is again heavily based on Memory
Networks, but it integrates a working memory buffer and the ability to explicitly per-
form relational reasoning using a RelationNet [148]. Finally, the Recurrent Entity
Network [149] has analogies with the DNC, but by employing a parallel architec-
ture it exploits simultaneous updates across several memory locations. However, all
these models are based on the same differentiable attention read-out function [67],
or analogous variants [150] that, despite several success, suffers from non trivial
scaling issue, thus making this approach difficult to generalise to useful scenarios.
The work presented in this chapter was previously published as [161]. The current
text is based on the published manuscript but has been expanded and elaborated
upon.
2.1. A Brief Outline of the Neural Mechanisms of Self-Localisation 38
unique subset of cells, with an overlap of about 15% / 25% between any two of
these subsets. Other studies shown that place cell firing vectors seem to be effec-
tively decorrelated between different environments, a process called remapping -
with individual cells changing their firing locations and rates[174, 175, 176]. The-
oretical studies have described remapping in terms of pattern seperation, a process
that likely depends on dentate gyrus and CA3 [177, 128, 178, 179, 180, 181, 182]
(see 3.1 for a thorough explanation of pattern separation).
Head direction cells signal the azimuthal angle of an animal’s head via di-
rectionally tuned receptive fields - being maximally active when the animal’s head
occupies a certain allocentric orientation spectrum (∼ 100°). These cells have been
discovered outside of the hippocampus, in the presubiculum [167], thalamic nuclei
[183] and mammillary bodies [184] as well as in the entorhinal cortex [185]. Im-
portantly, between environments these cells maintain their relative angular offset,
hence two cell that fire ∼ 45° apart in one environment will do the same in another,
even though their absolute firing fields have shifted.
Grid cells, first identified in medial entorhinal cortex (mEC) and predominately
in layer II, are distinguished by an interesting periodic pattern of activation com-
posed of multiple firing fields arranged in a triangular lattice tiling the whole envi-
ronment. Some of these cells also present a specific directional tuning reminiscent
of head direction cells [185]. Grid cells are clustered in modules where all cells
share the same orientation and scale, being essentially a translated version of each
other. Grid scale, defined from the periodicity of pattern, increases in discrete steps,
following a geometric progression along the mEC dorso-ventral axis [186, 187].
Empirical data from grid cell recordings have provided fertile ground for
computational models seeking to characterise the neural activity. In particular
models have centred around two possible mechanisms, oscillatory interference
[188, 189, 190, 191] and continuous attractors [192, 193, 194, 195]. In turn,
these models which initially focused on the mechanism by which self-motion
might be integrated to update allocentric location, have lead to suggestions that
grid cells provide a Euclidean spatial metric framework. This suggests that grid
2.1. A Brief Outline of the Neural Mechanisms of Self-Localisation 40
The work in this chapter aimed to address this point in two stages. First we
trained a deep recurrent neural network to perform a path integration task to in-
vestigate whether grid cells could emerge as a consequence of minimising the ob-
jective of self-localisation. The network was trained in a virtual square arena of
2.2m×2.2m, using simulated trajectories modelled on those of foraging rodents.
The network was required to update its estimate of location and head direction based
on translational and angular velocity signals, mirroring those available to the mam-
malian brain [199, 200, 201] (see Methods and Fig.2.7a&b). The network used was
a Long Short-Term Memory (LSTM) [114], which as explained in section 1.5.1 is a
model particularly well suited to deal with sequential data of the kind we are inter-
ested here. The LSTM received velocity inputs and it was trained using backprop-
agation through time, allowing the network to dynamically combine current input
signals with activity patterns reflecting past events (see Fig.2.7a). Importantly, the
network was subject to regularisation, in particular dropout [202] and gradient clip-
ping (see Methods 2.2.1.7). The vector of activities in the place and head direction
units, corresponding to the current position, was provided as a supervised training
signal at each time step (see Methods), following evidence that in mammals, place
and head direction representations exist in close anatomical proximity to entorhinal
grid cells [185] and emerge in rodent pups prior to the appearance of mature grid
cells [203, 204]. Equally, in adult rodents, entorhinal grid cells are known to project
to the hippocampus and appear to contribute to the neural activity of place cells
2.2. Methods 41
[205].
2.2 Methods
In this section we present the main ideas and models used to develop the grid cells
agent. First we present the details of the supervised learning experiments, by de-
tailing the inputs, the network architectures, the objective function and the regular-
isation techniques. We then define the Deep Reinforcement learning experiments
by presenting the environments used, the grid cells agent architecture, the compar-
ison agents, the general training algorithm and the control experiments. Finally we
present the neuroscience-based analyses used to characterise the artificial cells.
2.2. Methods 42
(h)
where the M head direction centres µi ∈ [−π, π], were chosen uniformly at ran-
dom before training, and κ (h) the concentration parameter is a positive scalar fixed
for each experiment.
To clarify, in this motion model the head of the artifical agent was fixed, that is the
agent always faces the direction of travel.
The recurrent layer of the grid cell network is an LSTM with 128 hidden units. The
recurrent layer receives as input the vector [vt , sin(ϕ̇t ), cos(ϕ̇t )]. The initial cell state and
hidden state of the LSTM, ~l0 and ~m0 respectively, were initialised by computing a linear
transformation of the ground truth place ~c0 and head-direction ~h0 activity at time 0. The
output of the LSTM is followed by a linear layer on which dropout is applied. The output
of the linear layer, ~gt , is linearly transformed and passed to two softmax functions that
calculate the predicted head direction cell activity,~zt , and place cell activity, ~yt ,
respectively. We found evidence of grid-like and head direction-like units in the linear
layer activations ~gt .
probabilistically respond to their inputs. The idea behind this technique is to reduce
correlation amongst units in the network, in turn making the model more robust to
overfitting. The recurrent LSTM layer consists of one cell of 128 hidden units. In-
put to the recurrent LSTM layer is the vector [vt , sin(ϕ̇t ), cos(ϕ̇t )]. The initial cell
state and hidden state of the LSTM, ~l0 and ~m0 respectively, were initialised by com-
puting a linear transformation of the ground truth place and head-direction cells at
2.2. Methods 45
time 0:
The parameters of these two linear transformations (W (cp) , W (cd) , W (hp) , and W (hd) )
were optimised during training. The output of the LSTM, ~mt is then used to produce
predictions of the place cells ~yt and head direction cells ~zt by means of a linear
decoder network.
The linear decoder consists of three sets of weights and biases: first, weights
and biases that map from the LSTM hidden state ~mt to the linear layer activations
~gt ∈ R512 . The other two sets of weights map from the linear layer activations ~gt
to the predicted head directions, ~zt , and predicted place cells, ~yt , respectively via
softmax functions[208]. Dropout [202] with drop probability 0.5 was applied to
each ~gt unit, which means that at each time step in the sequence 50% units in the
linear layer were randomly silenced. Note that there is no intermediary non-linearity
in the linear decoder.
N M
L(~y,~z,~c,~h) = − ∑ ci log(yi ) − ∑ h j log(z j ), (2.7)
i=1 j=1
Gradients of (2.7) with respect to the network parameters were calculated using
backpropagation through time[209], unrolling the network into blocks of 100 time
2.2. Methods 46
a. Example of the Goal-Driven environment where the goal is shown. b. Example of the
Goal-Doors environment where a closed door (in black) on the left is show, and an open
door is shown on the right.
evaluation in the larger maze. Note that the weights of the agent were frozen during
evaluation on the larger maze. Evaluation was over 100 episodes of fixed duration
12, 600 environment steps (210 seconds).
remaining corridors, at any time, on each side only one was accessible (top or mid-
dle, randomly determined). Each time the agent reached the goal, the doors were
randomly configured again (with the same constraints). The agent always started in
a random location in the central room with a random orientation. At test time, after
the agent reached the goal for the first time, all corridors were opened, allowing
potential shortcut behaviour (see 2.12g&h). During the test phase, the agent always
started in the centre of the central room facing north. Each agent was trained for
1e9 environment step divided into episodes of 5, 400 steps (90 seconds), and sub-
sequently tested for 100 episodes, each one lasting for a fixed duration of 5, 400
environment steps (90 seconds).
For evaluating agent performance during training (as in Fig. 2.10f, Fig. 2.11e,f) we
selected the 30 replicas (out of 60) which had the highest average cumulative reward
across 100 episodes. Also we assessed the robustness of the architecture over dif-
ferent initial random seeds and the hyperparameters in Table A.2 by calculating the
area under the curve (AUC). To plot the AUC we ran 60 replicas with hyperparam-
eters sampled from the same interval (see Table A.2) and different initial random
seeds (see Fig.A.3a-c).
2.2. Methods 51
to grid cells functions to correct for drift and anchor grids to environmental cues[206, 186],
visual input was processed by a convolutional network to produce place cell (and head
direction cell) activity patterns which were used as input to the grid network. The output of
the vision module was only provided 5% of the time to the grid network, akin to
occasional observations made by behaving animals of salient environmental cues[206].
The agent architecture (see Fig.2.4) was composed of a visual module, of the
grid cell network (described above), and of an actor-critic learner[214]. The visual
module was a neural network with input consisting of a three channel (RGB) 64×64
image φ ∈ [−1, 1]3×64×64 . The image was processed by a convolutional neural net-
work (see below for the details of the convolutional neural network), which pro-
duced embeddings, ~e, which in turn were used as input to a fully connected linear
layer trained in a supervised fashion to predict place and head-direction cell ensem-
ble activations, ~c and ~h (as specified above), respectively. The predicted place and
head direction cell activity patterns were provided as input to the grid network 5%
of the time on average, akin to occasional imperfect observations made by behaving
animals of salient environmental cues[206]. Specifically, the output of the convo-
lutional network ~~e was then passed through a masking layer which zeroed the units
with a probability of 95%. This was done to prevent the network to rely too much
on vision and discard the velocity inputs. T
The grid cell network of the agent was implemented as in the supervised learn-
ing set up except that the LSTM (“GRID LSTM”) was not initialised based upon
ground truth place cell activations but rather set to zero. The input to the grid cell
network were the two translational velocities, u and v, as in DeepMind Lab it was
possible to move in a direction different from the facing direction, and the sine
and cosine of the angular velocity, ϕ̇, (these velocities were provided by DeepMind
Lab) — and additionally the ~y and ~z output by the vision module. In contrast to
the supervised learning case, here the grid cell network had to use ~y and ~z to learn
how to reset its internal state each time it was teleported to an arbitrary location in
the environment (e.g. after visit to goal). As in the supervised learning experiments
described above, the configuration of place fields (i.e. location of place field centres
2.2. Methods 53
For the actor-critic learner the input was a three channel (RGB) 64 × 64 im-
age φt ∈ [−1, 1]3×64×64 , which was processed by a convolutional neural network
followed by a fully connected layer. The convolutional neural network had four
convolutional layers. The first convolutional layer had 16 filters of size 5 × 5 with
stride 2 and padding 2. The second convolutional layer had 32 filters of size 5 × 5
with stride 2 and padding 2. The third convolutional layer had 64 filters of size 5 × 5
with stride 2 and padding 2. Finally, the fourth convolutional layer with 128 filters
of size 5 × 5 with stride 2 and padding 2. All convolutional hidden layers were
followed by a rectifier nonlinearity. The last convolution was followed by a fully
connected layer with 256 hidden units. The same convolutional neural network was
used for the actor-critic learner. The weights of the two network were not shared.
The output of the fully connected layer of the convolutional network ~et1 was then
concatenated with the reward rt , the previous action at−1 , the current “grid code”,
~gt , goal “grid code”, ~g∗ (i.e. linear layer activations observed last time the goal was
reached) — or zeros if the goal had not yet been reached in the episode. Note we re-
fer to these linear layer activations as “grid codes” for brevity, even though units in
this layer comprise also units resembling head direction cells, border cells, and also
other non classified units (e.g. see Fig.2.9a). This concatenated input was provided
to an LSTM with 256 units. The LSTM had 2 different outputs. The first output,
the actor, is a linear layer with 6 units followed by a softmax activation function,
that represents a categorical distribution over the agent’s next action. The second
output, the critic, is a single linear unit that estimates the value function. Note that
we refer to this as the ”policy LSTM” for brevity, even though it also outputs the
value function.
2.2. Methods 54
The place cell agent architecture is shown in Figure see Fig.2.5b. In contrast
2.2. Methods 55
to the grid cell agent, the place cell agent used ground truth information: specifi-
cally, the ground-truth place,~ct , and head-direction, ~ht , cell activations (as described
above). These activity vectors were provided as input to the policy LSTM in an
analogous way to the provision of grid codes in the grid cell agent.
Specifically, the output of the fully connected layer of the convolutional net-
work ~et was concatenated with the reward rt , the previous action at−1 , the ground-
truth current place code, ~ct , and current head-direction code, ~ht — together with the
ground truth goal place code,~c∗ , and ground truth head direction code, ~h∗ , observed
last time the goal was reached — or zeros if the goal had not yet been reached in
the episode (see Fig.2.5b). The convolutional network had the same architecture
described for the grid cells agent.
2.2. Methods 56
Figure 2.6: Architecture of the place cell prediction agent and of the NavMemNet agent.
a) The architecture of the place cell prediction agent is similar to the grid cell agent —
having a grid cell network with the same parameters as that of the grid cell agent. The key
difference is the nature of the input provided to the policy LSTM. Instead of using grid
codes from the linear layer of the grid network ~g, we used the predicted place cell
population activity vector ~y and the predicted head direction population activity vector~z
2.2. Methods 57
(i.e. the activations present on the output place and head direction unit layers of the grid
cell network, corresponding to the current and goal position) as input for the policy LSTM.
As in the grid cell agent, the output of the fully connected layer of the convolutional
network, ~et , the reward rt , and the previous action at − 1, were also input to the policy
LSTM. The convolutional network had the same architecture described for the grid cell
agent. b) NavMemNet agent. The architecture implemented is the one described in[145],
specifically FRMQN but the Asynchronous Advantage Actor-Critic (A3C) algorithm was
used in place of Q-learning. The convolutional network had the same architecture
described for the grid cell agent and the memory was formed of 2 banks (keys and values),
each one composed of 1350 slots.
The architecture of the place cell prediction agent (see Fig.2.6a) is similar to
the grid cell agent described above: the key difference is the nature of the input
provided to the policy LSTM as described below. The place cell prediction agent
had a grid cell network — with the same parameters as that of the grid cell agent.
However, instead of using grid codes from the linear layer of the grid network ~g,
as input for the policy LSTM (i.e. as in the grid cell agent), we used the predicted
place cell population activity vector ~y and the predicted head direction population
activity vector~z (i.e. the activations present on the output place and head direction
unit layers of the grid cell network at each timestep). Specifically, the output of the
fully connected layer of the convolutional network, ~et , was concatenated with the
reward rt , the previous action at − 1, the current predicted place cell activity vector,
~yt , and the current predicted head direction cell activity vector ~ht - and the goal
predicted place cell activity vector , ~y∗ , and goal head direction activity vector, ~h∗ ,
observed the last time the agent had reached the goal - or zeros if the agent had not
yet reached the goal within the episode (see Fig.2.5). The convolutional network
had the same architecture described for the grid cell agent.
The critical difference between the place cell agent and the place cell predic-
tion agent (see Fig.2.5b and 2.6a respectively) is that the former used ground truth
information (i.e. place and head direction cell activations for current location and
goal location) - whereas the latter used the population activity produced across the
2.2. Methods 58
output place and head direction cell layers (i.e. for current location and goal loca-
tion) by the linear layer of the same grid network as utilised by the grid cell agent.
Place cell agent with heterogeneously sized place fields: to control for differ-
ences in the number and area of spatial fields between agents, we also generated
two further place cell agents that were explicitly matched to the grid cell agent.
Specifically, we used a watershedding algorithm[215] to detect 660 individual grid
fields in the grid-like units of the grid cell agent. The distribution of the areas of
these fields were found to exhibit 3 peaks — based on a Gaussian fitting procedure
— having means equivalent to 2D Gaussians with standard deviations of 8.2cm,
15.0cm, and 21.7cm. Hence we generated a further control agent having 395 place
cells of size 8.2cm, 198 of size 15.0cm, and 67 of 21.7cm — 660 place cells in
total, the relative numbers reflecting the magnitudes of the Gaussians fit to the dis-
tribution. A final control agent was also generated having 256 place cell units in
total — the same number of linear layer units as the grid agent — distributed across
the same three scales in a similar ratio. Additionally, we note that from a machine
learning perspective, the place cell and grid cell agents with the same number of lin-
ear layer units are in principle well matched since they are provided with the same
input information and have an identical number of parameters
2.2.3.5 A3C
wards, Rt:t+n = ∑ni=1 γ i−1 rt+i . The value function is the expected return from state
s, V π (s) = E[Rt:∞ |st = s, π], under actions selected accorded to a policy π(a|s).
We used the Asynchronous Advantage Actor-Critic (A3C) algorithm[214],
which implements a policy, π(a|s, θ ), and an approximation to its value function,
V (s, θ ), using a neural network parameterised by θ , which is trained by minimising
the following loss function: LA3C = Lπ + αLV + β LH , where Lπ = −Est ∼π R̂t ,
h 2 i
LV = Est ∼π R̂t −V (st , θ ) , LH = −Est ∼π [H(π(·|st , θ ))]. Where LH is a pol-
icy entropy regularisation term . The grid cell network and the vision module
were trained with the same loss reported for supervised learning: L (~y,~z,~c,~h) =
− ∑N M
i=1 ci log(yi ) − ∑ j=1 h j log(z j )
grid network, as previously described (see Fig.2.4 for details). Both networks were
trained using one single thread, one to train the vision module and another to train
2.2. Methods 61
the grid network (so in total we used 34 threads). Also, there was no gradient shar-
ing between the actor-critic learners, the vision module and the grid network.
The hyperparameters of the grid cell network were kept fixed across all the
simulations and were derived from the best performing network in the supervised
learning experiments. For the hyperparameter details of the vision module, the grid
network and the actor-critic learner please refer to Table A.2. For each of the agents
in this paper, 60 replicas were run with hyperparameters sampled from the same
interval (see Table A.2) and different initial random seeds.
To demonstrate that the goal grid code provided sufficient information to enable the
agent to navigate to an arbitrary location we took an agent trained in the square
arena, we froze the weights and we ran it in the same square arena for 5, 400 steps.
Critically, after the 6th time that the agent reaches the goal, we sampled the grid
code from a random point that the agent visited in the environment (called fake goal
grid code). We then substituted the true goal grid code with this fake goal grid code,
to show that this would be sufficient to direct the agent to a location where there was
no actual goal.
Spatial (ratemaps) and directional activity maps were calculated by taking a fully
trained network and collecting data for 500 episodes of 1350 steps each. Then each
point in the trajectory was assigned to a specific spatial and directional bin based
on its location and direction of facing. Spatial bins were defined as a 32×32 square
grid spanning each environment and directional bins as 20 equal width intervals.
Then, for each unit, the mean activity over all the trajectories points assigned to that
bin was found. These values were displayed and analysed further without additional
smoothing.
For each unit the reliability of spatial firing between baseline trials was assessed by
calculating the spatial correlation between pairs of rate maps taken at 2 different
logging steps in training (t = 2e5; t 0 = 3e5). The total training time was 3e5 so
the points were selected with enough time difference to minimise the chances of
finding random correlations. The Pearson product moment correlation coefficient
was calculated between equivalent bins in the two trials and unvisited bins were
excluded from the measure.
2.2. Methods 63
where α and β are, respectively, the centre and intensity of angular bin i in the
activity map. These vectors were averaged to generate a mean resultant vector:
∑N
n=1 ri
~r = N
, (2.9)
∑n=1 βi
and the length of the resultant vector calculated as the magnitude of~r. We used 20
angular bins.
2.2. Methods 64
Border score. To identify units that were preferentially active adjacent to the edges
of the enclosure we adopted a modified version of the border score[169]. For each
of the four walls in the square enclosure, the average activation for that wall, bi , was
compared to the average centre activity c obtaining a border score for that wall, and
the maximum was used as the border-score for the unit:
bi − c
bs = max (2.10)
i∈{1,2,3,4} bi + c
where bi is the mean activation for bins within db distance from the i-th wall and c
the average activity for bins further than db bins from any wall. In all our experi-
ments 20 by 20 bins where used and db took value 3.
Threshold setting for gridness, border score, and directional measures. The
hexagonality of the spatial activity map (gridness), directional modulation (length of
resultant vector), and propensity to be active against environmental boundaries (bor-
der scale) exhibited by units in the linear layer were benchmarked against the 95th
percentile of null distributions obtained using a permutation procedure [217, 218]
applied to each unit’s ratemap. This shuffling procedure aimed to preserve the local
topography of fields within each ratemap while distributing the fields themselves at
random[218]
For the gridness measure and border score, null distributions were constructed
using a ’field shuffle’ procedure equivalent to that specified by[218]. Briefly, a
watershedding algorithm[215] was applied to the ratemap to segment spatial fields.
The peak bin of each field was found and allocated to a random position within the
ratemap. Bins around each peak were then incrementally replaced, retaining as far
as possible their proximity to the peak bin. This procedure was repeated 100 times
for each of the units present in the linear layer and the gridness and border score
of the shuffled ratemaps assessed as before. In each case the 95th percentile of the
resulting null distribution was found and used as a threshold to determine if that
unit exhibited significant grid or border-like activity.
To validate the thresholds obtained using shuffling procedures we calculated al-
ternative null distributions by analysing the grid and border responses of linear units
2.2. Methods 65
from 500 untrained networks. Again, in each case, a grid score and border score
for each unit was calculated, these were pooled, and the 95th percentile found. In
all cases the thresholds obtained by the first method were found to be most strin-
gent and these were used for all subsequent analyses. The means, over units, of the
thresholds obtained were gridness > 0.37 and border score > 0.50. Units exceeding
these thresholds were considered to be grid-like and border-like, respectively.
To establish a significance threshold for directional modulation we calculated
the length of the resultant vector that would demonstrate statistical significance un-
der a Rayleigh test of directional uniformity at α = 0.01. The resultant vector
was calculated by first calculating the average activation for 20 directional bins. A
threshold length of 0.47 for the resultant vector was obtained. The most stringent
of these two thresholds was used.
bution of scales from grid-like units was fit with Gaussian mixture distributions
containing 1 to 8 components. Fits were made using an Expectation-Maximization
approach implemented with fitgmdist (Matlab 2016b, Mathworks, MA). The ef-
ficiency of fits made with different numbers of components was compared using
Bayesian Information Criterion (BIC)[219] the model (3 components) with the
lowest BIC score was selected as the most efficient.
function. We ran two separate decoding analyses, looking for evidence of each of
the two metric codes (i.e. Euclidean distance, allocentric goal direction). For each
decoding analysis we trained a L2-regularized (ridge) regression model on all data
apart from the first 21 time-steps of each episode. The model was then tested on the
four early sampling steps of interest, where accuracy was assessed as the Pearson
correlation between the predicted and actual values over the 200 episodes. The
penalization parameter was selected by randomly splitting the training data into
internal training and validation sets (90% and 10% of the episodes respectively).
The optimal parameter was selected from 30 values, evenly spaced on a log scale
between 0.001 and 1000, based on the best performance on the validation set. This
parameter was then used to train the model on the full training set, and evaluated
on the fully independent test set. As the allocentric direction metric is circular, we
decomposed the vector into two target variables: the cosine and sine of the polar
angle. All reported allocentric decoding results are the average of the cosine and
sine results. For the purpose of comparing decoding accuracy across agents, we
report the difference in accuracy, along with a 95% bootstrapped confidence interval
on this difference, based on 10,000 samples.
We followed the guidelines outlined by[220]. Specifically reporting effect sizes and
confidence intervals. Unless otherwise stated, the effect sizes are calculated using
the following formula:
µgroup1 − µgroup2
e f f ect size = , (2.11)
σ pooled
The confidence interval for the effect size was calculated accordingly to[222] using:
s
Ngroup1 + Ngroup2 e f f ect size2
cie f f ectsize = + + . (2.13)
Ngroup1 × Ngroup2 2 × (Ngroup1 + Ngroup2 )
2.3 Results
In this section we are presenting the results of the supervised learning experiment
where the ”grid network” developed grid-like units as a solution to optimise the self-
location problem posed by the objective function. We then present how the ”grid
network” has been used in a deep learning agent, the ”grid cells agent”, to test the
idea that grid cells support vector based navigation. Finally we present the perfor-
mance of the ”grid cells agent” in complex multi-compartment environments which
showed the ability of this agent to take shortcut through part of the environments
previously unvisited.
The ”grid network” was able to path integrate accurately in this setting involving
foraging behaviour (mean error after 15s trajectory, 16cm vs 91cm for an untrained
network, effect size = 2.83; 95% CI [2.80, 2.86], Fig. 2.7b&c). Strikingly, indi-
vidual units within the linear layer of the network developed stable spatial activity
profiles similar to neurons within the entorhinal network [159, 185] (Fig. 2.7d, and
Fig. A.1 for the whole population). Specifically, 129 of the 512 linear layer units
(25.2%) resembled grid cells, exhibiting significant hexagonal periodicity (gridness
[185]) versus a null distribution generated by a conservative fields shuffling pro-
cedure (see 2.2.6.3 for details on the shufflingg procedure), which resulted in a
threshold of 0.37. The scale of the grid-patterns, measured from the spatial autocor-
relograms of the activity maps [185], varied between units (range 28cm to 115cm,
mean 66cm) and followed a multi-modal distribution, consistent with empirical re-
sults from rodent grid cells [186, 187] (Fig. 2.7e). To assess these clusters we
fit mixtures of Gaussians, finding the most parsimonious number by minimizing
the Bayesian Information Criterion (BIC, in Fig. 2.8 we report the distribution of
2.3. Results 69
Figure 2.7: Emergence of Enthorinal Grid Cells in a Deep Neural Network Trained to Path
Integrate.
a, Schematic of network architecture (see Extended Data Figure 1 for details) . b, Example
trajectory (15s), self-location decoded from place cells resembles actual path (respectively,
dark and light-blue). c, Accuracy of decoded location before (blue) and after (green)
training. d, Linear layer units exhibit spatially tuned responses resembling grid, border,
and head direction cells. Ratemap shows activity over location (top), spatial
autocorrelogram of the ratemap with gridness indicated (middle), polar plot show activity
vs. head direction (bottom). e, Spatial scale of grid-like units (n = 129) is clustered.
Distribution is more discrete[187] than chance (effect size = 2.98, 95% CI [0.97, 4.91])
and best fit by a mixture of 3 Gaussians (centres 0.47, 0.70 & 1.06m, ratio=1.49 & 1.51).
f, Directional tuning of the most strongly directional units (n = 52). Lines indicate length
and orientation of resultant vector (see Methods), exhibiting a six-fold clustering
reminiscent of conjunctive grid cells[213]. g, Distribution of gridness and directional
tuning. Dashed lines indicate 95% confidence interval from null distributions (based on
500 data permutations), 14 (11%) grids exhibit directional modulation (see Methods).
Similar results were seen in a circular environment (Extended Data Figure 3).
2.3. Results 70
BIC vs. the number of Gaussians fit). The distribution was best fit by 3 Gaussians
(means 47cm, 70cm, and 106cm), indicating the presence of scale clusters with a
ratio between neighbouring clusters of approximately 1.5, closely matching theo-
retical predictions [223] and lying within the range reported for rodents [186, 187]
(Fig. 2.7e, Fig. 2.8). Interestingly when the network was trained again in a cir-
cular environment the ratio adapted to 1.66. The linear layer also exhibited units
resembling head direction cells (10.2%), border cells (8.7%), and a small number
of place cells [200] as well as conjunctions of these representations (Fig. 2.7d,f&g,
Fig. A.1).
Figure 2.8: Characterization of grid-like units in Square environment and Circular environ-
ment.
a) The scale (assessed from the spatial autocorrelogram of the ratemaps) of grid-like units
exhibited a tendency to cluster at specific values. The number of distinct scale clusters was
assessed by sequentially fitting Gaussian mixture models with 1 to 8 components. In each
2.3. Results 71
case, the efficiency of the fit (likelihood vs. number of parameters) was assessed using
Bayesian information criterion (BIC). BIC was minimized with three Gaussian
components indicating the presence of three distinct scale clusters. b) Spatial stability of
units in the linear layer of the supervised network was assessed using spatial correlations
— bin-wise Pearson product moment correlation between spatial activity maps (32 spatial
bins in each map) generated at 2 different points in training, t = 2e5 and t 0 = 3e5 training
2
steps. That is, 3 of the way through training and the end of training, respectively. This
separation was imposed to minimise the effect of temporal correlations and to provide a
conservative test of stability. Grid-like units (gridness > 0.37) blue, directionally
modulated units (resultant vector length > 0.47) green. Grid-like units exhibit high spatial
stability, while directionally modulated units do not. c) Robustness of the grid
representation to starting conditions. The network was retrained 100 times with the same
hyperparamters but different random seeds controling the initialisation of network weights,
~c and ~h. Populations of grid-like units (gridness > 0.37) were found to appear in all cases,
the average proportion of grid-like units being 23% (SD of 2.8%). d) Circular
environment: the supervised network was also trained in a circular environment (diameter
= 2.2m). As before, units in the linear layer exhibited spatially tuned responses resembling
grid, border, and head direction cells. Eight units are shown. Top, ratemap displaying
activity binned over location. Middle, spatial autocorrelogram of the ratemap,
gridness[185] is indicated above. Bottom, polar plot of activity binned over head direction.
e) Spatial scale of grid-like units (n = 56 (21.9%)) is clustered. Distribution is best fit by a
mixture of 2 Gaussians (centres 0.58 & 0.96m, ratio = 1.66). f) Distribution of directional
tuning for 31 most directionally active units, single line for each unit indicates length and
orientation of resultant vector[216] g) Distribution of gridness and directional tuning.
Dashed lines indicate 95% confidence interval derived from shuffling procedure (500
permutations), 5 grid units (9%) exhibit significant directional modulation.
50% best agent replicas (n=32) plotted (see Methods). The gray band displays the 68%
confidence interval based on 5000 bootstrapped samples. d) After locating the goal for the
first time during an episode the agent typically returned directly to it from each new
starting position, showing decreased latencies for subsequent visits, paralleling the
behaviour exhibited by rodents.
trol place cell agent (Fig. 2.10f) with homogeneous place fields tuned to maximize
performance. This agent was chosen because place cells provide a robust represen-
tation of self-location but are not thought to provide a substrate for long range vec-
tor calculations [198]. Further, to additionally control for differences in the number
and area of spatial fields between agents, we also generated two place cell agents
– incorporating 256 and 660 heterogeneously sized place fields – that were explic-
itly matched to the grid cell agent (see section 2.2.3.4 in the Methods for details).
Again, the performance of the grid cell agent was found to be considerably better
than these additional place cell agents (Average score over 100 episodes: grid cell
agent = 289 vs. best place agent with 660 heterogeneous fields = 212, effect size
= 3.93, 95% CI [3.54, 4.31]; best place agent with 256 heterogeneous fields = 225,
effect size = 3.52, 95% CI [3.18, 3.87]).
Finally, we examined the units in the linear layer, again finding a heterogeneous
population resembling those found in entorhinal cortex, including 21.4% cells that
surpassed the threshold for being considered grid-like units (Fig. 2.10g, Fig 2.9)
— paralleling the dependence of mammalian grid cells on self-motion information
[201, 227] and spatial cues [159, 186].
targeted lesions to the goal grid code and reexamined performance and representa-
tion of the goal directed vector. When 25% of the most grid-like units were silenced
(see Methods 2.2.5.1), performance was worse than lesioning 25% of cells in the
linear layer at random (average score for 100 episodes: 126.1 vs. 152.5, respec-
tively; effect size = 0.38, 95% CI [0.34, 0.42]). Further, as expected, goal-directed
vector codes were more strongly degraded (Euclidean distance: random lesions de-
coding accuracy r = 0.45, top-grid lesions decoding accuracy r = 0.38, difference
in decoding accuracy = 0.08, 95% CI [0.03, 0.13]). We also performed an additional
experiment where the effect of the targeted grid lesion was compared to that of le-
sioning non-grid units with patchy firing. As explained in 2.2.5.1, the patchy multi-
field spatial cells that were non-grid units were chosen amongst the units with a
grid score lower than the 0.37 threshold. The units chosen had also a head-direction
score lower than 0.47 and the number of spatial fields was in the same range as
grid-like units (3 to 13). The number of fields in each ratemap was calculated by
applying a watershedding[215] algorithm to their ratemap ignoring fields with area
smaller than 4 bins. Our results show that the targeted grid cell lesion had a greater
effect than the patchy non-grid cell lesion (average score for 100 episodes: 126.1
vs. 151.7, respectively; effect size = 0.38, 95% CI [0.34, 0.42]). These results sup-
port a role for the grid-like units in vector-based navigation, with the relatively mild
impact on performance potentially accounted for by the difference in lesioning net-
works as compared to animals. Specifically, the procedure for lesioning networks
differs in important respects from experimental lesions in animals — which bears
upon the results observed. Briefly, networks have to be trained in the presence of
an incomplete goal grid code and thus have the opportunity to develop a degree of
robustness to the lesioning procedure – which would otherwise likely result in a
catastrophic performance drop (see Methods 2.2.5.1). This opportunity is not typi-
cally afforded to experimental animals. This, therefore, may explain the significant
but relatively small performance deficit observed in lesioned networks.
2.3. Results 78
Our comparison agents for the grid cell agent included an agent specifically de-
signed to use a different representational scheme for space (i.e. place cell agent,
see Fig.2.5b and see Methods 2.2.3.2), and a baseline deep RL agent (A3C [214],
see Fig.2.5a). The place cell agent relates to theoretical models of goal-directed
navigation from the neuroscience literature (e.g. [230, 231]). A key difference be-
tween grid and place cell based models is that the former are proposed to enable the
computation of goal-directed vectors across large-scale spaces [196, 198, 197] and
[232], whereas place cell based models are inherently limited in terms of naviga-
tional range (i.e. to the largest place field) and do not support route planning across
unexplored spaces [198]. First, we tested these three agents in the “goal-driven”
maze (see Methods 2.2.2.2). The grid-cell agent exhibited high levels of perfor-
mance, and over the course of 100 episodes, attained an average score of 346.5,
beating both the place cell agent (average score 258.76; contrast effect size = 1.98,
95% CI [1.79, 2.18]) and the A3C agent (average score 137.00; contrast effect size
= 14.31, 95% CI [12.91, 15.71]). The grid cell agent showed markedly superior
performance compared to the other agents in the “goal-doors” maze (average score
over 100 episodes: grid cell agent = 284.30 vs place cell agent = 90.53, effect size
= 7.86, 95% CI [7.09, 8.63]; A3C agent = 48.69, effect size = 7.73, 95% CI [6.97,
8.48]). Interestingly, therefore, the enhanced performance of the grid cell agent was
particularly evident when it was necessary to recompute trajectories due to changes
in the door configuration, highlighting the flexibility of vector-based navigation in
exploiting ad hoc short-cuts (Fig. 2.11f).
The grid cell agent exhibited stronger performance than a professional human
player in both “goal-driven” (average score: grid cell agent = 346.50 vs. profes-
sional human player = 261, effect size = 4.00, 95% CI [3.50, 4.52]) and “goal-
doors” (average score: grid cell agent = 284.30 vs. professional human player =
240.5, effect size = 2.49, 95% CI [2.18, 2.81]). The human expert received 10
episodes worth of training in each environment before undergoing 20 episodes of
2.3. Results 79
testing. This is considerably less training than that experienced by the network.
Importantly, however, the mammalian brain has evolved to path integrate and nat-
urally the human expert had a lifetimes worth of relevant navigational experience.
Hence, although directly drawing concrete conclusions from relative performance
of human and agents is necessarily difficult, providing human-level performance is
useful as a broad comparison and represents a commonly used benchmark in similar
papers [36].
Further, decoding accuracy for Euclidean distance and goal direction was sub-
stantially and significantly higher in the grid cell agent than both the place cell
(Euclidean distance difference in r = 0.44; 95% CI [0.37, 0.51]; Goal direction
2.3. Results 80
difference in r = 0.52; 95% CI [0.49, 0.56]) and deep RL (Euclidean distance dif-
ference in r = 0.57; 95% CI [0.5, 0.63]; Goal direction difference in r = 0.66; 95%
CI [0.62, 0.70]) control agents (Figure 2.11j&k).
the grid cell agent and comparison agents to use novel shortcuts when they became
available in specifically configured probe mazes. First, agents trained in the goal-
doors environment were exposed to a linearised version of Tolman’s sunburst maze
with no further training. The maze contained 5 evenly spaced corridors, each of
which had a door at the end closest to the start position of the agent. The agent al-
ways started on one side of the corridors with the same heading orientation (North;
see Fig 2.12a) and the goal was always placed in the same location on the other side
of the corridors. Until the agent reached the goal the first time only one door was
open (door 5, Fig. 2.12a), but after that all the doors were opened for the remain-
der of the episode. After reaching the goal, the agent was teleported to the original
position with the same heading orientation. The grid cell agent, but not compari-
son agents, was reliably able to exploit shortcuts, preferentially passing through the
doorways that offered a direct route towards the goal (Fig.2.12a-c). The average
testing score of the grid cell agent was higher than that of the place agent (124.1 vs
60.9, effect size = 1.46, 95% CI [1.32, 1.61]) and of the A3C agent (124.1 vs. 59.7,
effect size = 1.51, 95% CI [1.36, 1.66]), see Fig. 2.12d.
(grid cell agent vs. place cell agent, effect size = 1.89, 95% CI [1.69, 2.09]; grid cell
agent vs. A3C agent, effect size = 12.77, 95% CI [11.48, 14.07]; place cell agent
vs. A3C agent, effect size = 14.87, 95% CI [13.35, 16.38]).
2.4 Discussion
Several theoretical papers argue for a computational role of grid cells in provid-
ing an efficient and noise-tolerant representation for space [226, 196, 159, 193].
Here we use a novel approach to provide evidence that supports this view, by
demonstrating that grid-like representations can emerge in a generic deep neu-
ral network trained to optimise the objective of self localisation. Notably, our
work contrasts with previous approaches where grid cells have been hard wired
[188, 190, 233, 195, 192], derived through eigendecomposition of place fields
[234, 235], or arisen through self organisation in the absence of an objective func-
tion [236]. It is worth noting that our experiments were not designed to provide
insights into the development of grid cells in the brain - due to the limitations of the
training algorithm used (i.e. backpropagation) in terms of biological plausibility
(although see [237]). More generally, however, our findings accord with the per-
spective that the internal representations of individual brain regions such as the en-
torhinal cortex arise as a consequence of optimizing for specific objective functions
(e.g. path integration), providing a parallel to the optimisation process in artificial
neural networks [154].
tion based on the computation of goal-directed vectors across large scale spaces
[196, 232, 198, 197]. Moreover, we demonstrate that vector based navigation can
be effectively combined with a path-based barrier avoidance strategy enabling the
exploitation of optimal routes in complex multi-compartment environments.
Our model also departs from the traditional machine learning way of approach-
ing navigation trough simultaneous localisation and mapping (SLAM). Conven-
tional SLAM based techniques typically require extensive experience to construct
an accurate map of the environment in order to support navigation [225, 238], an
approach that requires considerable data to be collected and is inflexible to modifi-
cation of the environment. Recently deep reinforcement learning (RL) approaches
to navigation have been developed [239, 145, 240]; however, these primarily use
reactive route-based navigational strategies, and fail to traverse unexplored space or
exploit short-cuts. In contrast, at the start of each trajectory, our model is able to
rapidly develop a goal-directed vector, providing agents with the ability to exploit
novel short-cuts in complex, novel environments.
Finally it is worth pointing out that in this work we primarily set our focus
on grid cells and head direction cells, taking for granted the availability of place
cells. We believe that a possible future direction could be to augment the model
with an episodic buffer that can be used to store visual memories of visited places.
Then the network could be be trained to anticipate, given a set of velocity inputs,
the similarity of its current visual input to past visual memories. We believe that
this augmented model would provide a framework within which place cells would
develop - these in turn would become the target on which the appearance of grid
cells depend.
Also, a further limitation stems from the fact that we only explored the do-
main of spatial navigation, by considering the role of grid and head direction cells
in the integration of self-motion. However several recent reviews have provided
increasing evidence for the role of the hippocampus and surrounding areas in do-
mains other than spatial reasoning [241, 242, 243, 244]. In particular it has been
shown that the humans hippocampus is involved in mapping abstract spaces by cre-
2.4. Discussion 85
ating links between elements that were not experienced together, a process which is
believed to support inferential reasoning [160, 245]. Intriguingly, recent evidence
also suggest that grid-like representation are used to navigate in this abstract space
in support of non-spatial reasoning [246] and to create links between memories. By
building on this evidence the next chapter will investigate inferential reasoning, how
this is supported by the hippocampus and how to build a deep learning model with
similar capabilities.
Chapter 3
The work presented in this chapter was previously published as [247] at the ICLR
2020 conference. The current text is based on the published manuscript but has
been expanded and elaborated upon.
3.1 Introduction
During everyday life we often need to make judgements that require combinations
of facts which were acquired separately, possibly at different times and places. For
instance, imagine walking your daughter to a coding summer camp and encounter-
ing another little girl with a woman. You might conclude that the woman is the
mother of the little girl. A few weeks later, at a coffee shop near your house, you
see the same little girl, this time with a man. Based on these two separated episodes
you might infer that there is a relationship between the woman and the man. This
flexible recombination of single experiences in novel ways is called inferential rea-
soning and this task is thought to capture the essence of reasoning, that is the ap-
preciation of distant relationships among elements distributed across multiple facts
or memories [248].
Interestingly, there is mounting evidence that the hippocampus is critical to
3.1. Introduction 87
the building of cognitive maps of abstract spaces that creates links between asso-
ciations experienced at different times to support the kind of inferential reasoning
outlined above. For instance, in one study rats were trained to perform the so called
paired associative inference task (PAI). Here the animals had to learn to link ran-
domly paired objects (e.g. A-B and B-C) and later they were required to infer the
indirect relationship between objects that were never been experienced together (A-
C). All animals were able to learn the direct associations but only rats with intact
hippocampi succeeded in the inference trial [249]. In a following study rats where
trained on a set of overlapping choice trials (A vs. B, B vs. C, C vs. D and D vs. E)
and then subsequently tested on transitive inference judgements (e.g. B vs. D). As
before, all rats learned the direct judgements but only the ones with an intact hip-
pocampi where able to perform the longer range inference tests [250]. These results
have also been replicated in humans [160, 251] confirming that the hippocampus is
needed for creating the associations and hierarchical relations which form the basis
of our complex reasoning skills. However, the involvement of the hippocampus in
linking events experienced at different times seems to be in conflict with its role
in supporting the ability to distinguish similar events, a crucial aspect of episodic
memory [252, 106]. Indeed, several computational models highlight the contribu-
tion of the hippocampus in the so called pattern separation process, whereby similar
events are stored in orthogonal patterns to avoid interference [253, 177].
A recent line of research [254, 245, 255, 256] sheds light on this tension –
i.e. how can separated memories be chained together? In particular, it has been
shown that the integration of separated experiences emerges at the point of re-
trieval through a recurrent mechanism that allows multiple pattern separated codes
to interact, and therefore support inference. In particular, a computational model
published recently, Recurrency and Episodic MEmory REsults in Generalization
(REMERGE) [254, 245], provides two clear principles for how the recurrent mech-
anism might be implemented in the hippocampal circuit. First, memories remain
stored in the hippocampus as separated codes to preserve information about their
constituent elements, allowing greater flexibility at the time of retrieval. Second, in
3.1. Introduction 88
REMERGE the single retrieved memories are recirculated as a new input to the hip-
pocampal circuit, a process that continues until the network has settled into a fixed
point. Contingent on the difficulty of a given tasks the number of re-circulation in
memory for the network to settle to a fixed point varies: hence computation in RE-
MERGE is variable length. Recent studies employing a paired-associate inference
task found empirical support for this account in human behavior [245] and neural
data [256].
Based on these findings we set out to create a new neural network architecture,
called MEMO, which introduces a new multistep retrieval mechanism that supports
the flexible weighting of individual elements in memory. This is achieved through
a powerful recurrent attention mechanism which adapts the number of memory re-
trieval operations based on the difficulty of the task at hand. Both aspects will be
described in details in the next section.
standard machine learning test for text understanding and reasoning (see details in
3.2.8), to investigate what kind of memory representations effectively support mem-
ory based reasoning. The end to end memory networks (EMN) and other similar
models [144, 148, 147] have used fixed memory representations based on com-
bining word embeddings with a positional encoding transformation. A similar ap-
proach has been recently implemented by current state of the art language model
[90, 68]. By contrast our approach, called MEMO, retains the full set of facts into
memory, and then learns a linear projection paired with a powerful recurrent at-
tention mechanism that enables greater flexibility in the use of these memories.
MEMO is based on the same basic structure of the external memory presented in
EMN [144]. However, its new architectural components can potentially allow for
flexible weighting of individual elements in memory and so supporting the form of
the inferential reasoning outlined above.
the use of REINFORCE [262] to learn a discrete latent variable which dynamically
adjusts the number of computation steps. This has been applied to recurrent neural
networks where each layer decides whether or not to activate the next layer [263].
REINFORCE has also been used to learn how many steps to ”jump” in sequence, so
reducing the total number of processed inputs [264]. This jump technique has also
been applied to recurrent neural network without the need of REINFORCE [265].
3.2 Methods
In this section we recapitulate End-to-End Memory Networks [144], we then in-
troduce MEMO and two more baselines used for comparisons: Differential Neural
3.2. Methods 91
An example where the network took 3 hops. In MEMO the input sequence
{x0 , ..., xT }, is embedded into the memory slots using a series of linear projections.
T is the length of the sequence. q0 represent the original query and is used to
retrieve a slot from memory h0 . The retrieved slot is combined with the original
query using a residual connection and the result of this operation is a vector that
define a proposed answer q1 . Based on this proposed answer the network decides,
using REINFORCE [262] if another query in memory is necessary or the retrieved
content is enough to answer, in this latter case the network decides to stop the
memory re-circulation and an answer is given a. (see 3.2.2 for the mathematical
details)
Computer [141] and Universal Transformer [257]. We then describe the tasks used
in the experiments: Paired Associative Inference (PAI), Shortest Path Graph Traver-
sal and bAbI [259] and the training regimes.
[259] (see Methods 3.2.8 and Fig. 3.4 for details on bAbI), I will be the number of
stories and S is the number of words in each sentence in a story. xis will be the word
in position s in the sentence, in the ith story and will be a O-dimensional one hot
vector encoding one of O possible input words.
vi = ∑ Wv xis (3.2)
s
q0 = Wq q (3.3)
where Wk ,Wv ∈ Rd×O , Wq ∈ Rd×S are embedding matrices for the key, values and
query, respectively. Also ls is a positional encoding column vector (as defined in
[144]), “·” represents an element wise multiplication and O is the size of the vocab-
ulary.
At each step t, EMN calculates the vector of weights over the memory elements
ki and produces the output. Let K be the I × d matrix formed by taking ki as its rows,
and similarly V formed by vi as rows, then:
wt = softmax(Kqt ) (3.4)
where wt ∈ RI are weights over the memory slots, Wqv ,Wa ∈ Rd×d is a linear map-
ping relating the query at the previous step to the current one, qt+1 is the query to be
used at the next step, and at is the answer (usually only produced right at the end).
EMN is trained via a cross entropy loss on at at the final step.
3.2. Methods 93
3.2.2 MEMO
MEMO embeds the input differently. First, a common embedding ci , of size S × dc ,
is derived for each input matrix xi ∈ RS×O :
ci = xiWc (3.7)
(h) (h)
ki = Wk vec(ci ) (3.8)
(h) (h)
vi = Wv vec(ci ) (3.9)
(h) (h)
q0 = Wq q (3.10)
The attention mechanism used by MEMO also differs from that shown above
for EMN. Firstly, is uses multi-head attention, instead of just a single head . Sec-
ondly, we use DropOut [202] and Layer Normalisation [91] to improve generalisa-
tion and learning dynamics. With Dropout, some number of layer units, in our case
50% (otherwise differently stated), are randomly silenced during the training phase.
Dropout has the effect of making the training process noisy, forcing units within
a layer to probabilistically respond to their inputs. The idea behind this technique
is to reduce correlation amongst units in the network, in turn making the model
more robust to overfitting. Layer normalisation (LayerNorm) is a technique to nor-
malise the distributions of intermediate layers. It enables smoother gradients, faster
training, and better generalisation accuracy.
(h)
Let K (h) ∈ RI×d denote the matrix formed by taking each ki as a row, and
(h)
V (h) be the matrix formed by taking each vi as a row. In contrast, let Qt ∈ RH×d
(h)
be the matrix formed by taking each qt as the rows. The attention mechanism then
becomes:
(h) 1 (h)
ht = √ Wh K (h) qt (3.11)
d
(h) (h)
wt = DropOut(softmax(ht )) (3.12)
(h) (h)
qt+1 = wt V (h) (3.13)
where Wh ∈ RI×I ,Wq ∈ RHd×Hd are matrices for transforming the logits and queries
respectively. Wa ∈ RO×da and Wqa ∈ Rda ×Hd are the matrices of the output MLP that
produces the answer at . It is worth noting, that even though our attention mech-
anism uses some of the feature implemented in [90] – i.e. normalization factor
sqrt(d) and multiheading, it differs from it because rather than doing self-attention
it preserves the query separated from the keys and the values. This aspect is par-
ticularly important in terms of computational complexity, in that MEMO is linear
with respect to the number of sentences of the input, whereas methods relying on
3.2. Methods 95
ht = σ (πt ) (3.18)
This network is trained using REINFORCE [262]. More concretely, the pa-
rameters θ are adjusted using a n-step look ahead values, R̂t = ∑i=0...n−1 γ i rt+i +
γ nV (st+n , θ ), where γ is the discount factor. The objective function of this net-
work is to minimize: LHop−Net = Lπ + αLV + β LHop where: Lπ = −Est ∼π R̂t ,
h 2 i
LV = Est ∼π R̂t −V (st , θ ) and LHop = −Est ∼π [π(·|st , θ )]. Interestingly, LHop
is a term that directly follows from the fact that π is a binary policy. Specifically,
the expectation of a binary random variable is its probability and the expectation of
their sum is the sum of the expectation. Consequently, the new term that we intro-
duce in the loss, LHop allows us to directly minimize the expected number of hops.
3.2. Methods 96
where a is the target answer associated with the input and â is the prediction from the
network. The final layer of MLPR was initialized with biasinit , in order to increase
the chances that π produces a probability of 1 (i.e. completes more than one more
hop). Finally, we set a maximum number of hops, N, that the network could take.
If N was reached, the network stopped performing additional hops. Critically, there
was no gradient sharing between the hop network and the main MEMO network
explained above. All model hyperparameters are reported in Appendix B.4.
3.2.4 Baselines
3.2.4.1 EMN
For EMN please refer to 3.2.1 and the hyper-parameters used are the same as in
MEMO.
3.2. Methods 97
The panel on the left illustrates a memory store filled with random pairs of images.
The panels to the right illustrate (from left to right) two ’direct’ queries (AB and
BC) where no inference is require, and an ’indirect’ query (AC) where inference is
required
In this work we introduced a task, derived from neuroscience [249], to carefully probe the
reasoning capacity of neural networks. This task is thought to capture the essence of rea-
soning the appreciation of distant relationships among elements distributed across multiple
facts or memories. This process is formalized in a prototypical task widely used to study
the role of the hippocampus in generalization the paired associative inference (PAI) task
[249, 245, see Fig.3.2]. Here, two images are randomly associated together. For example,
analogous to seeing a little girl with a woman as in the example in the introduction, in the
PAI task the agent (human participant or a artificial neural network) would be presented
with an image A, e.g. a woman, and image B, e.g. a girl, side by side. Later, in a separate
event, the agent would be exposed to a second pair, the image B again, but this time paired
with a new image C, e.g. the other person. This is analogous to seeing the little girl a sec-
ond time with a different person. During test time two types of query can be asked: direct
3.2. Methods 99
and indirect queries. Direct queries are a test of episodic memory as the answer relies on
retrieving an episode that was experienced. In contrast, indirect queries require inference
across multiple episodes. Here the network is presented with CUE, image A, and two pos-
sible choices: image C, the MATCH, that was originally paired with B; or another image
0 0 0 0 0
C , the LURE, which was paired with B – i.e. forming a different triplet A − B −C . The
right answer, C, can only be produced by appreciating that A and C are linked because they
both were paired with B. This is analogous to the insight that the two people walking with
the same little girl are likely to have some form of association.
To make this task challenging for a neural network we started from the ImageNet
dataset [272]. We created three sets, training, validation, and test which used the images
from the respective three sets of ImageNet to avoid any overlapping. All images were
embedded using a pre-trained ResNet [92]. We generated 3 distinct datasets with sequences
of length three (i.e. A − B −C), four (i.e. A − B −C − D) and five (i.e. A − B −C − D − E)
items. Each dataset contains 1e6 training images, 1e5 evaluation images and 2e5 testing
images. Each sequence was randomly generated with no repetition in each single dataset.
To explain how the batch was built we take an example with sequences of length, S,
being equal to 3. Each batch entry is composed of a memory, a query and a target. In order
to create a single entry in the batch we selected N sequences from the pool, with N = 16.
First, we created the memory content with all the possible pairwise associations between
the items in the sequence, e.g. A1 B1 and B1 C1 , A2 B2 and B2 C2 , ..., AN BN and BN CN .
For S = 3, this resulted in a memory with 32 rows. Then we generated all the possible
queries. Each query consists of 3 images: the cue, the match, and the lure. The cue is
an image from the sequence (e.g. A1 ), as is the match (e.g. C1 ). The lure is an image
from the same memory set but from a different sequence (e.g. C7 ). There are two types of
queries - ’direct’ and ’indirect’. In ’direct’ queries the cue and the match can be found in
the same memory slot, so no inference is required. For example, the sequence A1 - B1 - C1
produces the pairs A1 - B1 and B1 - C1 which are stored in different slots in memory. An
example of a direct test trial would be A1 (cue) - B1 (match) - B3 (lure). Therefore, ’direct’
queries are a test of episodic memory as the answer relies on retrieving an episode that
was experienced. In contrast, ’indirect’ queries require inference across multiple episodes.
For the previous example sequence, the inference trial would be A1 (cue) - C1 (match)
- C3 (lure). The queries are presented to the network as a concatenation of three image
3.2. Methods 100
embedding vectors (the cue, the match, and the lure). The cue is always in the first position
in the concatenation, but to avoid any degenerate solution, the position of the match and
lure are randomised. It is worth noting that the lure image always has the same position in
the sequence (e.g. if the match image is a C the lure is also a C) but it is randomly drawn
from a different sequence that is also present in the current memory. This way the task can
only be solved by appreciating the correct connection between the images, and this needs to
be done by avoiding the interference coming for other items in memory. For instance in Fig
3.2 the items C1 and C2 are both coming from the same memory content, but from different
slots. For each entry in the batch we generated all possible queries that the current memory
store could support and then one was selected at random. Also the batch was balanced, i.e.
half of the elements were direct queries and the other half was indirect. The targets that the
network needed to predict are the class of the matches.
It is worth mentioning that longer sequences provide more ’direct’ queries, but also
multiple ’indirect’ queries that require different levels of inference, e.g. the sequence An
- Bn - Cn - Dn - En produces the ’indirect’ trial A1 (cue) - C1 (target) - C3 (lure) with
’distance’ 1 (one pair apart) and A1 (cue) - E1 (target) - E3 (lure) with ’distance’ 4 (4 pairs
apart). The latter is obviously harder as more inference steps are required to appreciate the
overlap between memories. Finally we use the inputs as follows:
• For EMN and MEMO, the memory content and the query are used as direct inputs to
the model (see 3.2.2 for details).
• In the case of Differential Neural Computer (section 3.2.4.2), we embed stories and
query in the same way as for MEMO. Memory and query are presented in sequence
to the model (in that order), followed by blank inputs as pondering steps to provide a
final prediction.
• For Universal Transformer, we embed stories and queries in the same way as for
MEMO. Then we used the encoder of Universal Transformer with architecture de-
scribed in Section 3.2.4.3.
3.2. Methods 101
Graph generation In the shortest path experiments, we generate the graph in the same
fashion as [141]: the graphs used to train the networks are generated by uniformly sampling
a set of two-dimensional points from a unit square, each point corresponding to a node in the
graph. For each node, the K nearest neighbours in the square are used as the K outbound
connections, with K independently sampled from a uniform range for each node (see an
example in Fig. 3.3).
Graph representation We represent our task in three parts: a graph description, a query,
and the target. The graph description is presented a sequence of tuples of integers that
represent a connection between nodes, holding a token for the source node, and another for
the destination node. The query is also represented as a tuple of integers, although, in that
case, source and destination are simply the beginning and end of the path to find. The target
is the sequence of node IDS that constitute the path between source and destination of the
query.
When training, we sample a mini-batch of 64 graphs, with associated queries and
target paths. Following our description above, queries are represented as a matrix of size
64 × 2, targets are of size 64 × (L − 1), and graph descriptions are of size 64 × M × 2, where
L is the length of the shortest path, and M is the number of maximum nodes we allow one
graph description to have. In our experiments, we fix the upper bound M to be the maximum
number of nodes that we have multiplied by the out-degree of the nodes in the graph.
All networks were trained for 2e4 epochs, each one formed by 100 batch updates.
• For EMN and MEMO, we set the graph description to be the contents of their mem-
ory, and we use the query as input. In order to answer the sequence of nodes that
3.2. Methods 102
(h) (h)
is used as target, we keep the keys ki and values vi fixed, and we proceed to use
our algorithm as described for each answer, with independent numbers of hops for
each one. The model then predict the answers for nodes sequentially: the first node
is predicted before the second. However, one important difference between MEMO
and EMN is that for EMN we use the ground truth answer of the first node as the
query for the second node, whereas for MEMO we used the answer predicted by the
model for the first node as the query for the second node. This was done to enhance
the performance of EMN while testing the real capabilities of MEMO to sequentially
reasoning over multiple steps problems. The weights that are used for each answer
are not shared.
• For the Universal Transformer, we also embed the query and graph description as
done for EMN and MEMO. After that, we concatenate the embeddings of query
and graph description and use the encoder of the Universal Transformer architecture
(with specific description in Section 3.2.4.3). We use its output as answer. After
providing an answer, that answer is provided as initial query for the following round
of hops. The weights that are used for each answer are not shared.
• For Differential Neural Computer, we also embed the query and graph description as
done for EMN and MEMO. Since it is naturally a sequential model, the information
is presented differently: the tuples of the graph description are presented first, and
after that the query tuple is presented. Finally, the pondering steps are used to be able
to output the sequence of nodes that constitute the proposed shortest path.
The output of the models was trained using Adam with a cross-entropy loss against
all the sampled target sequences. Training is done for a fixed number of steps, detailed in
Appendix Section B.4.
For evaluation, we sample a batch of 600 graph descriptions, queries, and targets. We
evaluate the mean accuracy over all the nodes of the target path. We report average values
and standard deviation over the best 5 hyper parameters we used.
It is worth noting that given this training regime:
• Differential Neural Computer and Universal Transformer have a global view on the
problem in order to provide an answer for the second node. This means that, to
answer the second node in the path, they can still reason and work backwards from
3.2. Methods 103
the end node, and while still having information about the initial node in the path.
This makes it intuitive for them to achieve better performance in the second node, as
it is closest to the end node of the path, so less reasoning is needed to achieve good
performance.
• On the contrary, MEMO has a local view of the problem, the answer to the second
node depends on the answer about the first node. Therefore, it cannot exceed chance
if the answer to the first node is not correct.
a. Example of bAbI story from the task called a “single supporting fact”. To
answer the query only one supporting sentence is needed, but it is requested to
avoid distractions. b. Example of bAbI story from the task called “counting”
One of the main purpose of the research on natural language processing is to design a system
that can generically perform a set of question answering problems. Following this spirit, the
bAbi (not a acronym) tasks are a synthetic dataset of 20 tasks released by the Facebook AI
Research team that help evaluate systems in this domain [259] (see and example of two
stories from two different task in Fig. 3.4).
3.2. Methods 104
For this experiment we used the English Question Answer dataset [259]. We use the
training and test datasets that they provide with the following pre-processing:
• Commas only appear in answers, and they are not ignored. This means that, e.g.
for the path finding task, the answer ’n,s’ has its own independent label from the
answer ’n,w’. This also implies that every input (consisting of ’query’ and ’stories’)
corresponds to a single answer throughout the whole dataset.
• All the questions are stripped out from the text and provided separately (given as
”queries” to our system).
At training time, we sample a mini-batch of 128 queries from the test dataset, as well
as its corresponding stories (which consist of the text prior to the question). As a result, the
queries are a matrix of 128 × 11 tokens, and sentences are of size 128 × 320 × 11, where
128 is the batch size, 320 is the max number of stories, and 11 is the max sentence size.
We pad with zeros every query and group of stories that do not reach the max sentence and
stories size.
• For EMN and MEMO, the memory content and the query are used as direct inputs to
the model (see 3.2.2 for details).
• In the case of Differential Neural Computer, we embed stories and queries in the same
way as for MEMO. Stories and queries are presented in sequence to the model (in
that order), followed by blank inputs as pondering steps to provide a final prediction.
• For Universal Transformer, we embed stories and queries in the same way as for
MEMO. Then, we use the encoder of Universal Transformer with architecture de-
scribed in Section 3.2.4.3. We use its output as the output of the model.
After the mini-batch is sampled, we perform one optimization step using Adam for
all the models that we have run in our experiments. The hyper parameters are detailed in
Appendix Section B.4. We stop after a fixed number of time-steps, as also detailed in B.4.
3.3. Results 105
Many of the tasks in bAbI require some notion of temporal context, to account for
this in MEMO we added a column vector to the memory store. This vector adds a piece
of information to each word embedding about its position in the sentence. This vector is
called the positional encoding and after [90] is defined in the following way. Let t be the
desired position in an input sentence, ~pt ∈ Rd be its corresponding encoding, and d be the
encoding dimension (i.e. the number of memory slots in MEMO) Then f : N → Rd will be
the function that produces the positional vector ~pt and it is defined as follows:
sin(ωk .t), if i = 2k
~pt (i) = f (t)(i) := (3.19)
cos(ω .t), if i = 2k + 1
k
1
where: ωk = 100002k/d
All networks were trained for 2e4 epochs, each one formed by 100 batch updates. For
evaluation, we sample a batch of 10, 000 elements from the dataset and compute the forward
pass in the same fashion as was done in training. With that, we compute the mean accuracy
over those examples, as well as the accuracy per task for each of the 20 tasks of bAbI. We
report average values and standard deviation over the best 5 hyper parameters we used.
3.3 Results
In this section we present the results on three different tasks: Paired associative inference,
shortest path graph traversal and bAbI question answering.
Test results for the best 10 hyper-parameters (chosen by validation loss) for MEMO. For
EMN and Differential Neural Computer (DNC) and Universal Transformer (UT) results on
the best run (chosen by validation loss). The score represent the percentage accuracy on
the test set. Chance level is 50%. For MEMO we also report the standard deviation in
brackets for the best 10 hyper-parameters (chosen by validation loss)
3.3. Results 106
Table 3.1 reports the summary of results of our model (MEMO) and the other baselines
on the hardest inference query for each of the PAI tasks. On the smaller set - i.e. A-B-C
- MEMO was able to achieve the highest accuracy together with Differential Neural Com-
puter, whereas EMN, even with 10 hops, wasn’t able to achieve the same level of accuracy,
also Universal Transformer was not able to accurately solve this inference test. For longer
sequences -i.e. length 4 and 5 - MEMO was the only architecture which successfully an-
swered the most complex inference queries, with an accuracy of 84.54(SD=5.72), whereas
all the other architecture performed almost at chance level.
Trial
EMN DNC UT MEMO
Type
A-B 98.19 98.58 97.43 99.82(0.30)
B-C 97.93 99.34 98.28 99.76(0.38)
A-C 61.01 96.85 85.60 98.26(0.67)
Trial
EMN DNC UT MEMO
Type
A-B 96.31 94.26 99.32 99.57(0.20)
B-C 97.57 84.94 88.31 99.33(0.13)
C-D 96.59 95.68 93.37 99.58(0.13)
A-C 48.71 49.38 54.87 98.93(0.15)
B-D 47.42 49.89 51.92 99.14(0.19)
A-D 48.63 58.63 44.16 97.22(0.13)
Tables 3.2, 3.3, 3.4 report the full set of results on both direct and indirect trials for
length 3, 4, and 5 respectively.
3.3. Results 107
Trial
EMN DNC UT MEMO
Type
A-B 95.68 98.88 96.94 99.20(0.43)
B-C 95.82 94.60 92.63 98.93(0.17)
C-D 95.43 95.20 89.99 97.27(0.21)
D-E 95.16 95.98 97.27 95.06(0.12)
A-C 48.68 48.66 41.85 87.33(0.12)
B-D 45.75 46.87 39.62 86.65(1.27)
C-E 49.46 49.51 35.87 87.08(0.92)
A-D 52.08 50.32 52.38 86.12(0.57)
B-E 46.69 52.27 43.27 86.37(0.77)
A-E 48.31 48.79 47.93 84.54(5.72)
smallskip
Test results for the best 10 hyper-parameters (chosen by validation loss) for MEMO. For
EMN and Differential Neural Computer (DNC) and Universal Transformer (UT) results on
the best run (chosen by validation loss). The score represent the percentage accuracy on
the test set. Chance level is 50%. For MEMO we also report the standard deviation in
brackets for the best 10 hyper-parameters (chosen by validation loss)
The histogram reports the 36 different set of hyper-parameters used for MEMO that solve
the inference trials successfully – i.e. accuracy above 95% – out of a total of 48
hyper-parameters run.
To further investigate how these results were achieved, we ran further analyses on the
3.3. Results 108
length 3 PAI task. Interestingly, to solve this task Differential Neural Computer required
10 pondering steps to solve the inference trials. MEMO instead converged on a median of
3 hops (see Fig. 3.5). To understand how MEMO approached this task we then analysed
the attention weights of an inference query, where the goal was to associate a CUE with
the MATCH and avoid the interference of the LURE (see 3.2.6 for task details) and we
consistently found the same pattern of results, of which an example is presented in 3.6.
For clarity we report here the original sequence A − B − C, respectively composed by the
following class IDs: 611 − 191 − 840 (however this sequence was not directly experienced
together by the network, as the two associations A − B and B −C were stored in slot 10 and
25, respectively). As depicted in Figure 3.7, in the first hop MEMO retrieved the memory
in slot 10, which contained the CUE, ID 611, and the associated item, ID 191, which form
an A − B association. Then in the following hop this slot was partially active, but most
of the mass was placed on slot 16, which contained the memory association B − C; that
is, ID 191 and ID 840, the MATCH. Interestingly, slot 13 which was associated with the
LURE, ID 943, saw some probability mass associated with it. Hence, in this second hop
MEMO assigned appropriate probability masses to all the slots needed to support a correct
inference decision, which was then confirmed in the last hop. This sequence of memories
activation is reminiscent of the one predicted by computational model of the hippocampus
[254, 255] and observed in neural data [256]. Moreover, another instance of MEMO which
used 7 hops, to solve the task with the same level of accuracy, presented a very different
pattern of memory activation (see Fig.B.1 in the Appendix B.1.1). This is an indication
of the fact that the algorithm used to solve the inference problem depends on how many
hops the network takes. This aspect could also be related to knowledge distillation in neural
networks [273, 274], whereby many hops are used to initially solved the task (i.e. over-
parametrization) and then these are automatically reduced to use less computation (see Fig.
3.6).
3.3. Results 109
a. Evaluation accuracy on the inference trial A-C; b. Number of hops taken during
training; c. Distribution of evaluation accuracy obtained by averaging direct queries (A-B
and B-C). This was obtained over 100 different hyper-parameters and seeds; d. same as c,
but on the inference queries (A-C)
To better understand what parts of MEMO were critical to its performance we also ran
a set of ablation experiments, the results of which are presented in Table 3.5. This analysis
confirmed that it is the combination of the specific memory representations (i.e. facts kept
separated) and the recurrent attention mechanism that support successful inference – i.e.
employing these two components individually was not enough. Interestingly, this conclu-
sion was valid only for inference queries, not for direct queries (see Fig. 3.6c,d). Indeed,
by definition, direct queries are a pure test of episodic memory and so can be solved with
a single memory look-up. Finally, we also compared our adaptive computation mechanism
with ACT [260] and we found that, for this task, our method was more data efficient (see
Fig. B.2 in Appendix B.1.2).
3.3. Results 110
Figure 3.7: Weights analysis of an inference query in the length 3 PAI task.
An example of memory content and related inference query is reported in the first column
on the left. For clarity we report image class ID. Cue and Match are images from the same
sequence e.g. A10 −C10 , where 10 is the slot ID. The lure is an image presented in the
same memory store, but associated with a different sequence, e.g. C13 . The 3 most right
columns report the weights associated with the 3 hops used by the network, for each
probability mass we report the associated retrieved slot.
Test results for the best 10 hyper-parameters (chosen by validation loss) for MEMO. The
score represent the percentage accuracy on the test set for the A-C inference trial. Chance
level is 50%. Standard deviation is reported
Prediction of Prediction of
Graph Structure
First Node Second Node
Nodes Out-degree Path length EMN UT DNC MEMO EMN UT DNC MEMO
10 2 2 94.99 100.00 100.00 100.00(0.00) n/a n/a n/a n/a
20 3 3 31.99 39.00 97.00 94.40(0.02) 66.00 84.80 98.00 93.00(0.03)
20 5 3 23.99 28.00 30.00 69.20(0.07) 43.00 61.51 40.99 68.80(0.09)
Test results for the best 5 hyper-parameters (chosen by training loss) for MEMO, mean and
corresponding standard deviation are reported. For EMN, Universal Transformer (UT),
and Differential Neural Computer (DNC) we report results from the best run.
between two nodes. On a small graph with 10 nodes, with a path length of 2 and 2 outgo-
ing edges per node, Differential Neural Computer, the Universal Transformer, and MEMO
achieved perfect accuracy in predicting the intermediate shortest path node. However, on
more complicated graphs (20 nodes, 3 separated outgoing edges), with a path length of 3,
MEMO outperformed EMN in predicting the first node of the path ( 31.99% vs. 94.40%,
with chance level being 20%), and, similarly to Differential Neural Computer, almost com-
pletely solved the task. Additionally, MEMO outperformed Differential Neural Computer
in more complicated graphs with a high degree of connectivity (5 out-degree), being better
by more than 20% at predicting both nodes in the shortest path. This showed the great scal-
ability that MEMO holds - the model was able to iterate and considered more paths as the
number of hops increases.
To better compare the performance of MEMO versus EMN we ran additional experi-
ments to test the models in two futher conditions:
• The ground truth answer of the first node was used as the query for the second node.
• The answer predicted by the model for the first node was used as the query for the
second node.
The results are summarized in Table 3.7. In the case of 20 Nodes with 5 outbound
edges, we can see that if we give MEMO the ground truth for node 1 as query for node 2 the
performance increases from the one related to the prediction of the first node (85.38%(0.05)
vs. 69.20%(0.07)). Interestingly, if we use for EMN the same training regime used for
MEMO - i.e. the prediction is used to query the second node - then EMN performs almost
at chance level (22.30% vs. chance level of 20%) down from 43.00% obtained when using
the ground truth answer.
3.3. Results 112
Prediction of Prediction of
Graph Structure
First Node Second Node
EMN EMN MEMO EMN
Nodes Out-degree Path length EMN MEMO
ground truth predicted answer ground truth predicted annswer
20 5 3 23.99 69.20(0.07) 43.00 22.30 85.38(0.05) 68.80(0.09)
Comparing results on the second node based on using ground truth or predicted answer of
the first node. Test results for the best 5 hyper-parameters (chosen by training loss) for
MEMO, mean and corresponding standard deviation are reported. For EMN we report
results from the best run.
Finally, we turn our attention to the bAbI question answering dataset [259], which con-
sists of 20 different tasks. In particular we trained our model on the joint 10k training set
(specifics of training are reported in 3.2.8).
Average error and in parentheses the number of failed tasks (¿ 5% error) out of 20 (lower is
better in both cases) on the bAbI dataset. Results are shown for the best run (chosen by
validation loss) as standard practice for bAbI. The full set of results is presented in B.2.1
Differential Neural Computer (DNC) results from [141], Universal Transformer results
from [257].
Table 3.8 reports the averaged accuracy of our model (MEMO) and the other baselines
on bAbI (the accuracy for each single task averaged across the best set of hyper-parameters
is reported in Table B.1 in the Appendix B.2.1). In the 10k training regime MEMO was
able to solve all tasks, thereby matching the number of tasks solved by [258, 257], but with
a lower error (for single task results refer to Appendix B.1).
3.4. Discussion 113
Results for the best run (chosen by validation set) on the bAbI task. The model was trained
and tested jointly on all tasks. All tasks received approximately equal training resources.
7= not present; 3= present
3.4 Discussion
In this work we employed a classic associative inference task from the memory-based rea-
soning neuroscience literature, the Paired Associative Inference task [249], to more care-
fully probe the reasoning capacity of existing artificial neural networks. This task is thought
to capture the essence of reasoning – the appreciation of distant relationships among ele-
ments distributed across multiple facts or memories. Surprisingly, we found that current
architectures struggle to reason over long distance associations. Similar results were ob-
tained on a more complex task involving finding the shortest path between nodes in a path.
We therefore developed MEMO, an architecture endowed with the capacity to reason over
longer distances. This was accomplished with the addition of two novel components. First,
it introduces a separation between memories/facts stored in external memory and the items
that comprise these facts in external memory. Second, it makes use of an adaptive retrieval
mechanism, allowing a variable number of memory hops before the answer is produced.
3.4. Discussion 114
MEMO was capable of solving our novel reasoning tasks and matched state of the art re-
sults in a challenging language task, bAbI (see Methods 3.2.8 and Fig. 3.4 for an example).
One important new feature of this model is that single items remain stored separately
in memory and they are not combined together using, for instance, positional embedding
techniques. The idea is that preserving the pattern separated items in memory will allow
more flexibility at the time of retrieval, as these items can be recombined in novel ways
to solve previously unseen queries. This intuition was confirmed by ablation experiments
which showed that it was the combination of a powerful attention mechanism and the pattern
separated codes that gave rise to generalisation at test time. In contrast, when the items in
memory were merged together even attention was not enough to solve the inference trials.
On one hand, these results suggest that the role of the hippocampus in creating pattern
separated codes to support episodic memory is not in contrast with its role in supporting
generalisation [254, 245, 255, 256]. On the other hand, these findings also challenged the
current machine learning practice of using a hand coded positional embedding to merge the
items in memory - we show this greatly reduced model flexibility.
The work also complements that presented in chapter 2, as it is focused on the more
general role of the hippocampus in supporting declarative memory rather than being nar-
rowly focused on navigation. Indeed, as shown in figure 3.7, the attention mechanism im-
plemented in MEMO learned to establish relations between events experienced at different
points in time, and it did so by rapidly forming associations between its inputs and reac-
tivated relational memories. This result is in line with the view that the hippocampus is a
general relational processing machine that primarily supports declarative memory [275] by
creating events based on the relations among the objects that occurred together in a cer-
tain context. Events are then chained together as they happen in time to form episodes,
and finally relational networks are drawn as links between events and episodes to support
inference between events that have not been experienced together [252].
The spatial and declarative views of the hippocampus do not need to be in contrast, on
the contrary they might be underlain by a common mechanism [248]. Indeed, navigation
can be framed as just a special case where memory is involved, rather than being a distinct
process that requires the hippocampus to perform ad hoc computations [248]. This idea
of a common mechanism is also more in line with the view that Tolman proposed in his
original work [163], where he argued that the ability of humans and other animals to perform
3.4. Discussion 115
General Conclusions
In this work we put forward the argument that artificial neural network are a powerful class
of models that can be used to understand and explain the mechanisms that give rise to
cognition. We provided two stream of evidence for this, the first used deep reinforcement
learning to understand the role of grid cells in spatial navigation, whereas the second in-
vestigated inferential reasoning under the lens of memory-augmented neural network. In
this final chapter we will again use the artificial neural network framework proposed in the
introduction to shed some light on the tension between the spatial and declarative view of
the hippocampal-entorhinal functions.
proposal is that place and grid cells provide a metric code for abstract reasoning, and that
there is a straightforward overlap from representations of two dimensional physical space
to the ones for abstract spaces. Although intriguing, we do not think that such unequivocal
mapping exist, and one reason to believe so comes from the research on natural language
processing. Language is a very clear example of high level cognitive function, it is based
on abstractions, and it is generally considered one of the hallmark of human intelligence.
Language has also been an active topic of research in the artificial neural network com-
munity [e.g. 84, 90]. One of the most successful frameworks in this domain are recurrent
neural networks trained to predict the probability for any word in the vocabulary to be the
next word in a sentence. In doing so, the networks normally learns to represent the words
as distributed representation that contain information both about an individual word and
how words relate to each other [84]. These representations not only contrasts with a sparse
code like the place cells one, but also they do not map onto the two-dimensional manifold
required to support the hypothesis presented by Bellmund and colleague. Possibly more
compelling is the connection with the results we presented in chapter 2, where a very simi-
lar network to the one we just describe for language is used to predict which location would
be the most active out of all the possible ones in a room (a task akin to next word predic-
tion). In this case the representations developed by the network are indeed the ones found in
the spatial literature (see Chapter 2). Hence, using these two examples we can think about
what the key differences are between the two tasks that result in the same network produc-
ing such different solutions. In both examples we have the same learning rules, the same
architectures, and the same objective functions - the main difference is the dataset used, so
it is worth exploring this dimension to garner further insights.
tuned for [e.g. 287, 198]. So for the cases where the space onto which the task is based
can be uniformly sampled, then it seems that evolution found a set of factorised represen-
tation between the hippocampus and the enthorinal cortex that indeed can be mapped onto
a two-dimensional manifold. This line of reasoning would also explain two recent results
where place and grid cells have been related to more conceptual tasks (also cf. [288]).
The first being a brain imaging study on conceptual learning where subjects learn arbitrary
associations along trajectories in abstract space represented by a combination of neck and
legs of a bird [246]. In the second study, rats were required to use a joystick to manipulate
sound along a continuous frequency axis [289]. In both cases the analysis of neural repre-
sentation revealed that neurons involved in the representation of the task overlapped with
the cell types normally involved in spatial reasoning, like place and grid cells. However
a closer look at the experimental methodologies reveal that in the two cases, subject were
trained extensively to sample the whole continuous space of neck-legs combination and
sound frequencies respectively. Therefore, these results could just be explained by the fact
that the designs imposed by the experimenter to reduce the dimensionality of the problem
to a few factors of variations made the tasks close enough to spatial reasoning that the same
computations are employed.
However if we consider language as our domain of training then the picture is quite
different. Language is based on a finite set of of words, from which it is possible to generate
an infinite number of meaningful sentences [290]. In other words language is combinatorial,
which makes it impossible to uniformly sample the space of all configurations. In this case
using a code with capacity O(n2 ), like the place cells one, to represent each word would be
pointless, as a typical combinatorial space has O(exp(exp(n))) configurations that cannot
be identified by that capacity [291]. Instead, one way to represent this space is through
distributed embeddings, where each word is represented across a variety of units, and each
unit participates in encoding many words. Indeed, this is the coding scheme developed by
the LSTM network when trained in the language domain. This code not only has higher
capacity, but it also approximates continuity in high dimensional vector space, which allow
relationships to be inferred across words, thus supporting interpolation-based generalisation
[48]. Critically, this vector space representations does not lie on a two-dimensional manifold
required as postulated by several authors [244, 243, 284], but it is higher dimensional, thus
showing that an alternative computation from the one supported by place and grid cells
4.2. What can artificial neural networks teach us? 119
needs to employed to support abstract reasoning in combinatorial settings, which are where
much of our high level cognition resides.
This conclusion does not mean that an high level computational framework could not
relate the spatial and declarative views of the hippocampus in memory. Indeed, a few years
ago, Manns and Eichenbaum [292] proposed to see this issue under an evolutionary frame-
work that we believe could be fruitful in this respect. In their proposal the hippocampus
creates conjunctive representations by combining incoming information about structural
relationships from the medial enthorinal and postrhinal cortices, with incoming sensory in-
formation from perirhinal and lateral entorhinal cortices. The explanatory power of this
schematic model lies on the fact that it assumes a factorisation between the computation of
structural relationship and the ones representing sensory stimuli, and as it has been recently
proposed [243, 276, 277] this factorisation is key for flexible generalisation, which is what
gives creatures an evolutionary advantage. However, the model does not make any specific
assumption regarding the mechanistic implementation of the factorisation nor does it sug-
gest that a common mechanism underlies both spatial and declarative memory – as has been
suggested recently [244, 277]. We believe that such mechanism might not exist, instead we
argue that the same circuit learned ad hoc solutions by optimising for the statistics of the
task at hand. More generally speaking, such general mechanism, or optimal solutions, or
well designed principles, might not even exist for the case where biological brains operate
[293] - i.e. overparametrised circuit in a high dimensional data regime. Instead, a more
conservative explanation would be to consider the fitting procedure performed by biolog-
ical neural networks under the light of evolution, whereby they learn ad hoc solutions for
the task they are trying to solve, a solution that might be reused, adapted, or combined with
others depending on how close a domain relates to another one [294].
layer of the network could be misleading, as the structure of these embeddings is a direct
consequence of all these components. Consequently we should be very careful in interpret-
ing the structure of these embeddings as a general property of the network, rather we should
think of these as a property of the training.
The second point follows directly from the first one. We believe artificial neural net-
works can really represent a shift from the traditional way of modelling cognitive functions.
Most of the work done today is performed using simple interpretable models based on data
collected in very controlled experiments. Artificial neural networks instead are complex
models that can work in real world domains, sometimes by approximating human level ca-
pability. This means that by employing these models we might finally be able to shift away
from the now so common reductionist approach in neuroscience [296, 18], and work with
real world tasks. However many think that this modelling capability comes with a cost -
lack of interpretability. Indeed, one common criticism of artificial neural networks is that
they are a black-box [297], that is the input-output mapping they learn is not straightforward
to interpret, and so even if they can perform the same tasks as well as us, then we will not
be able to understand how they do it. One the contrary, we believe that artificial neural net-
works are transparent models as their architectures, their objective functions, their learning
rules, the data used to train them, and all their weights are easily accessible. The reason
why these models are called a black-box is because, as scientists, we are tightly attached
to the idea that any model should develop representations that are human interpretable to
be useful, otherwise it does not have any explanatory power. We strongly disagree with
this view, and we believe that as suggested from the school of ecological psychology [298],
and recently beautifully restated [294], we should put less emphasis on representations and
more on the ability of neural networks to model our behaviour in naturalistic environments.
We believe we are just scratching the surface of what these models can do, as we are still
in a low data and computational regimes, but these two aspects are growing and so does the
ability of artificial neural network to match biological one.
Appendix A
Figure A.1: Linear layer spatial activity maps from the supervised learning experiment.
Spatial activity plots for all 512 units in the linear layer ~gt . Units exhibit spatial
activity patterns resembling grid cells, border cells, and place cells — head
direction tuning was also present but is not shown.
123
Figure A.2: Grid-like units did not emerge in the linear layer when dropout was not applied.
Linear layer spatial activity maps (n=512) generated from a supervised network
trained without dropout. The maps do not exhibit the regular periodic structure
diagnostic of grid cells.
124
Figure A.3: Robustness of grid cell agent and performance of other agents.
a-c) AUC performance gives the robustness to hyperparameters (i.e. learning rate, baseline
cost, entropy cost - see Table 2 in Supplementary Methods for details of the range) and seeds
(see Methods). For each environment we run 60 agent replicas (see Methods). Light purple
is the grid agent, blue is the place cell agent and dark purple is A3C. a) Square arena b) Goal-
driven c) Goal Doors. In all cases the grid cell agent shows higher robustness to variations
in hyper-parameters and seeds. d-i Performance of place cell prediction/NavMemNet/DNC
125
agents (see Methods) against grid cell agent. Dark blue is the grid cell agent (Extended
Data Figure 5), green is the place cell prediction agent (Extended Data Figure 9a), purple
is the DNC agent, light blue is the NavMemNet agent (Extended Data Figure 9b). The
gray band displays the 68% confidence interval based on 5000 bootstrapped samples. d-f)
Performance in goal-driven. g-i) Performance in goal-doors. Note that the performance of
the place cell agent (Extended Data Figure 8b, lower panel) is shown in Figure 3.
126
Figure B.2: Comparison between MEMO + REINFORCE and MEMO + ACT on length 3
PAI task.
MEMO wit REINFORCE shows more data efficiency that the one where the
adaptive computation is done with ACT.
B.2 bAbI
B.2.1 Task-wise results
Further, it is worth noting that MEMO is linear with respect to the number of sentences
of the input, whereas the Universal Transformer has quadratic complexity.
With respect to spatial complexity, MEMO holds information of all the weights con-
stant, apart from all the context information that needs to be used to answer a particular
query. Since the context information is the only one that is input dependent, the spatial
complexity is in this case O(I · S · d), which is the size of our memory. In all our experi-
ments, such size is fixed.
B.4. MEMO training details and hyper-parameters 130
Table B.1: bAbI Results - average over 5 hyper-parameters with lower loss on the valiation
set
Trial MEMO
MEMO
Type top 5 seeds
1 - Single Supporting Fact 100.00 100.00(0.00)
2 - Two Supporting Facts 100.00 99.13(1.78)
3 - Three Supporting Facts 97.05 94.15(6.35)
4 - Two Arg. Relations 100.00 100.00(0.00)
5 - Three Arg. Relations 100.00 100.00(0.00)
6 - Yes/No Questions 100.00 100.00(0.00)
7 - Counting 100.00 96.69(3.57)
8 - Lists/Sets 100.00 99.13(1.94)
9 - Simple Negation 100.00 100.00(0.00)
10 - Indefinite Knowledge 100.00 99.35(1.44)
11 - Basic Coreference 100.00 100.00(0.00)
12 - Conjunction 100.00 100.00(0.00)
13 - Compound Coref 100.00 100.00(0.00)
14 - Time Reasoning 100.00 100.00(0.00)
15 - Basic Deduction 100.00 100.00(0.00)
16 - Basic Induction 98.75 95.05(5.12)
17 - Positional Reasoning 100.00 100.00(0.00)
18 - Size Reasoning 100.00 99.13(1.94)
19 - Path Finding 100.00 100.00(0.00)
20 - Agents Motivations 100.00 100.00(0.00)
Mean error 0.21 0.86(1.11)
Solved tasks (¿95% accuracy) 20/20 n/a
(Mean and standard deviation of test errors for the best 5 hyperparameters (chosen
according the validation loss).
The fixed parameters are reported in Table B.2 and the one we sweep over are reported in
B.3.
B.5. Baselines hyper-parameters 131
Tasks
Parameter PAI PAI PAI Shortest path Shortest path Shortest path
bAbI
name length 3 length 4 length 5 10-2-2 20-3-3 20-5-3
I 32 48 64 20 60 100 320
S 3 3 3 2 2 2 11
O 1000 1000 1000 1000 1000 1000 177
dc 128 128 128 128 128 128 128
d 256 256 256 512 512 512 512
da 128 128 128 128 256 256 256
DropOuta 0.1 0.1 0.1 0.1 0.1 0.1 0.1
DropOuto 0 0 0 0 0 0 0.5
GRUR
256 256 256 256 256 256 256
hidden size
MLPR
1 1 1 1 1 1 1
number of layers
MLPR
64 64 64 64 64 64 64
hidden size
ht = σ (πt ) (B.1)
B.6. ACT description 133
where πt is the binary policy of MEMO. This is slightly different than the original ACT
which represents such unit with:
ht = σ (Wh st + bh ) (B.2)
where Wh and bh are trainable weights and biases respectively, and st is the previous ob-
served state. We argue that this slight change increases the fairness of the comparison, for
two reasons: firstly, πt (a|st , θ ) depends on st , but it uses several non-linearities to do so,
rather than it being a simple linear projection, so it should enable more powerful represen-
tations. Secondly, this makes it much more similar to our model while still being able to
evaluate the feasibility of this halting mechanism.
From this point we proceed as in the original work by defining the halting probability:
R if t = T
pt = (B.3)
h
t otherwise
where
t0
0
T = min{t : ∑ ht >= 1 − ε} (B.4)
t=1
T −1
R = 1− ∑ ht (B.5)
t=1
T
a = ∑ pt at (B.6)
t=1
[2] Norbert Wiener. Cybernetics: Control and communication in the animal and the
machine–2nd. 1961.
[3] Allen Newell, Herbert Alexander Simon, et al. Human problem solving, volume 104.
Prentice-hall Englewood Cliffs, NJ, 1972.
[4] Norbert Wiener. Cybernetics or Control and Communication in the Animal and the
Machine. Technology Press, 1948.
[5] Claude Elwood Shannon and John McCarthy. Automata Studies.(AM-34), vol-
ume 34. Princeton University Press, 1956.
[6] Marvin Minsky. Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8–
30, 1961.
[8] George A Miller. The magical number seven, plus or minus two: Some limits on our
capacity for processing information. Psychological review, 63(2):81, 1956.
[9] Noam Chomsky and David W Lightfoot. Syntactic structures. Walter de Gruyter,
1957.
[10] Christof Koch and Gilles Laurent. Complexity and the nervous system. Science,
284(5411):96–98, 1999.
[11] Frank H Eeckman and Walter J Freeman. Asymmetric sigmoid non-linearity in the
rat olfactory system. Brain Research, 557(1-2):13–21, 1991.
BIBLIOGRAPHY 135
[13] Milan Palus. Nonlinearity in normal human eeg: Cycles and randomness, not chaos.
In Biological Cybernetics. Citeseer, 1994.
[17] Dean Buonomano. Your brain is a time machine: The neuroscience and physics of
time. WW Norton & Company, 2017.
[18] John W Krakauer, Asif A Ghazanfar, Alex Gomez-Marin, Malcolm A MacIver, and
David Poeppel. Neuroscience needs behavior: correcting a reductionist bias. Neuron,
93(3):480–490, 2017.
[21] David Marr. Vision: A computational investigation into the human representation
and processing of visual information. MIT press, 1982.
[22] Paul Werbos. Beyond regression:” new tools for prediction and analysis in the be-
havioral sciences. Ph. D. dissertation, Harvard University, 1974.
[23] David B Parker. Learning logic technical report tr-47. Center of Computational
Research in Economics and Management Science, Massachusetts Institute of Tech-
nology, Cambridge, MA, 1985.
[24] Yann LeCun. Une procedure d’apprentissage ponr reseau a seuil asymetrique. Pro-
ceedings of Cognitiva 85, pages 599–604, 1985.
BIBLIOGRAPHY 136
[25] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning represen-
tations by back-propagating errors. nature, 323(6088):533–536, 1986.
[26] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature,
521(7553):436–444, 2015.
[27] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural net-
works, 61:85–117, 2015.
[29] Paul J Werbos et al. Backpropagation through time: what it does and how to do it.
Proceedings of the IEEE, 78(10):1550–1560, 1990.
[30] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
[31] Hervé Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and
singular value decomposition. Biological cybernetics, 59(4-5):291–294, 1988.
[32] Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length
and helmholtz free energy. In Advances in neural information processing systems,
pages 3–10, 1994.
[33] Geoffrey E Hinton, Terrence Joseph Sejnowski, and Tomaso A Poggio. Unsupervised
learning: foundations of neural computation. MIT press, 1999.
[34] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv
preprint arXiv:1312.6114, 2013.
[36] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness,
Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg
Ostrovski, et al. Human-level control through deep reinforcement learning. Nature,
518(7540):529, 2015.
BIBLIOGRAPHY 137
[37] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George
Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree
search. nature, 529(7587):484, 2016.
[39] Wolfram Schultz, Peter Dayan, and P Read Montague. A neural substrate of predic-
tion and reward. Science, 275(5306):1593–1599, 1997.
[40] Wolfram Schultz. Predictive reward signal of dopamine neurons. Journal of neuro-
physiology, 80(1):1–27, 1998.
[41] Richard S Sutton and Andrew G Barto. Toward a modern theory of adaptive net-
works: expectation and prediction. Psychological review, 88(2):135, 1981.
[43] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew
Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel,
et al. A general reinforcement learning algorithm that masters chess, shogi, and go
through self-play. Science, 362(6419):1140–1144, 2018.
[44] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in
nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
[45] Donald Olding Hebb and DO Hebb. The organization of behavior, volume 65. Wiley
New York, 1949.
[46] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 65(6):386, 1958.
[47] John Haugeland. Artificial intelligence: The very idea. MIT press, 1989.
[48] James L McClelland, David E Rumelhart, PDP Research Group, et al. Parallel dis-
tributed processing, volume 2. MIT press Cambridge, MA:, 1987.
BIBLIOGRAPHY 138
[49] Jay L McClelland, Mark St. John, and Roman Taraban. Sentence comprehension:
A parallel distributed processing approach. Language and cognitive processes, 4(3-
4):SI287–SI335, 1989.
[50] Asher Cohen, Richard I Ivry, and Steven W Keele. Attention and structure in se-
quence learning. Journal of Experimental Psychology: Learning, Memory, and Cog-
nition, 16(1):17, 1990.
[52] Daniel M Wolpert and Mitsuo Kawato. Multiple paired forward and inverse models
for motor control. Neural networks, 11(7-8):1317–1329, 1998.
[54] James L McClelland and David E Rumelhart. Distributed memory and the repre-
sentation of general and specific information. Journal of Experimental Psychology:
General, 114(2):159, 1985.
[55] Tomaso Poggio and Christian R Shelton. Machine learning, machine vision, and the
brain. AI Magazine, 20(3):37–37, 1999.
[56] Richard A Andersen, Greg K Essick, and Ralph M Siegel. Encoding of spatial loca-
tion by posterior parietal neurons. Science, 230(4724):456–458, 1985.
[58] Geoffrey E Hinton and Tim Shallice. Lesioning an attractor network: Investigations
of acquired dyslexia. Psychological review, 98(1):74, 1991.
[59] David Zipser. Identification models of the nervous system. Neuroscience, 47(4):853–
862, 1992.
[60] Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends® in
Machine Learning, 2(1):1–127, 2009.
BIBLIOGRAPHY 139
[61] Yoshua Bengio, Olivier Delalleau, and Nicolas L Roux. The curse of highly variable
functions for local kernel machines. In Advances in neural information processing
systems, pages 107–114, 2006.
[62] Andy Clark et al. Associative engines: Connectionism, concepts, and representa-
tional change. MIT Press, 1993.
[63] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the
number of linear regions of deep neural networks. In Advances in neural information
processing systems, pages 2924–2932, 2014.
[64] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural
probabilistic language model. Journal of machine learning research, 3(Feb):1137–
1155, 2003.
[65] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
tributed representations of words and phrases and their compositionality. In Advances
in neural information processing systems, pages 3111–3119, 2013.
[66] Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natu-
ral scenes and natural language with recursive neural networks. In Proceedings of
the 28th international conference on machine learning (ICML-11), pages 129–136,
2011.
[67] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine transla-
tion by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[68] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805, 2018.
[69] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E
Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recogni-
tion with a back-propagation network. In Advances in neural information processing
systems, pages 396–404, 1990.
[70] Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models to
understand sensory cortex. Nature neuroscience, 19(3):356, 2016.
BIBLIOGRAPHY 140
[71] David C Van Essen and John HR Maunsell. Hierarchical organization and functional
streams in the visual cortex. Trends in neurosciences, 6:370–375, 1983.
[72] Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert,
and James J DiCarlo. Performance-optimized hierarchical models predict neural
responses in higher visual cortex. Proceedings of the National Academy of Sciences,
111(23):8619–8624, 2014.
[73] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm
for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
[74] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet:
A large-scale hierarchical image database. In 2009 IEEE conference on computer
vision and pattern recognition, pages 248–255. Ieee, 2009.
[75] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
Salakhutdinov. Improving neural networks by preventing co-adaptation of feature
detectors. arXiv preprint arXiv:1207.0580, 2012.
[76] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural information pro-
cessing systems, pages 1097–1105, 2012.
[77] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk-
thankar, and Li Fei-Fei. Large-scale video classification with convolutional neural
networks. In Proceedings of the IEEE conference on Computer Vision and Pattern
Recognition, pages 1725–1732, 2014.
[78] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn,
and Dong Yu. Convolutional neural networks for speech recognition. IEEE/ACM
Transactions on audio, speech, and language processing, 22(10):1533–1545, 2014.
[79] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals,
Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet:
A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
[80] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward
networks are universal approximators. Neural networks, 2(5):359–366, 1989.
BIBLIOGRAPHY 141
[81] John K Kruschke. Toward a unified model of attention in associative learning. Jour-
nal of mathematical psychology, 45(6):812–863, 2001.
[82] Steven E Petersen and Michael I Posner. The attention system of the human brain:
20 years after. Annual review of neuroscience, 35:73–89, 2012.
[83] Marisa Carrasco. Visual attention: The past 25 years. Vision research, 51(13):1484–
1525, 2011.
[84] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with
neural networks. In Advances in neural information processing systems, pages 3104–
3112, 2014.
[85] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual
attention. In Advances in neural information processing systems, pages 2204–2212,
2014.
[86] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural im-
age caption generation with visual attention. In International conference on machine
learning, pages 2048–2057, 2015.
[87] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan
Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint
arXiv:1502.04623, 2015.
[88] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele,
and Honglak Lee. Generative adversarial text to image synthesis. arXiv preprint
arXiv:1605.05396, 2016.
[89] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks
for machine reading. arXiv preprint arXiv:1601.06733, 2016.
[90] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances
in neural information processing systems, pages 5998–6008, 2017.
BIBLIOGRAPHY 142
[91] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.
arXiv preprint arXiv:1607.06450, 2016.
[92] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 770–778, 2016.
[93] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. Language models are unsupervised multitask learners. -, 2019.
[94] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew
Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko
Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement
learning. Nature, 575(7782):350–354, 2019.
[95] Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. Why self-attention?
a targeted evaluation of neural machine translation architectures. arXiv preprint
arXiv:1808.08946, 2018.
[96] Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli.
Pay less attention with lightweight and dynamic convolutions. arXiv preprint
arXiv:1901.10430, 2019.
[98] Richard C Atkinson and Richard M Shiffrin. Human memory: A proposed system
and its control processes. In Psychology of learning and motivation, volume 2, pages
89–195. Elsevier, 1968.
[99] Nelson Cowan. Short-term memory based on activated long-term memory: A review
in response to norris (2017). 2019.
[100] Alan D Baddeley and Graham Hitch. Working memory. In Psychology of learning
and motivation, volume 8, pages 47–89. Elsevier, 1974.
BIBLIOGRAPHY 143
[101] Alan Baddeley. The episodic buffer: a new component of working memory? Trends
in cognitive sciences, 4(11):417–423, 2000.
[102] Endel Tulving. How many memory systems are there? American psychologist,
40(4):385, 1985.
[103] Endel Tulving. Episodic memory: From mind to brain. Annual review of psychology,
53(1):1–25, 2002.
[104] Larry R Squire. Declarative and nondeclarative memory: Multiple brain systems
supporting learning and memory. Journal of cognitive neuroscience, 4(3):232–243,
1992.
[105] Daniel L Schacter. Implicit memory: History and current status. Journal of experi-
mental psychology: learning, memory, and cognition, 13(3):501, 1987.
[106] Larry R Squire, Craig EL Stark, and Robert E Clark. The medial temporal lobe.
Annu. Rev. Neurosci., 27:279–306, 2004.
[107] Larry R Squire, Lisa Genzel, John T Wixted, and Richard G Morris. Memory con-
solidation. Cold Spring Harbor perspectives in biology, 7(8):a021766, 2015.
[108] Edmond Teng and Larry R Squire. Memory for places learned long ago is intact after
hippocampal damage. Nature, 400(6745):675–677, 1999.
[109] John J Hopfield. Neural networks and physical systems with emergent collec-
tive computational abilities. Proceedings of the national academy of sciences,
79(8):2554–2558, 1982.
[110] Michael I Jordan. Serial order: A parallel distributed processing approach. In Ad-
vances in psychology, volume 121, pages 471–495. Elsevier, 1997.
[111] Hava T Siegelmann and Eduardo D Sontag. Turing computability with neural nets.
Applied Mathematics Letters, 4(6):77–80, 1991.
[112] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al. Learning long-term depen-
dencies with gradient descent is difficult. IEEE transactions on neural networks,
5(2):157–166, 1994.
BIBLIOGRAPHY 144
[113] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training
recurrent neural networks. ICML (3), 28:1310–1318, 2013.
[114] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural compu-
tation, 9(8):1735–1780, 1997.
[115] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Contin-
ual prediction with lstm. 1999.
[116] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11):2673–2681, 1997.
[117] Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidi-
rectional lstm and other neural network architectures. Neural networks, 18(5-6):602–
610, 2005.
[118] Alex Graves and Jürgen Schmidhuber. Offline handwriting recognition with multidi-
mensional recurrent neural networks. In Advances in neural information processing
systems, pages 545–552, 2009.
[119] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with re-
current neural networks. In International conference on machine learning, pages
1764–1772, 2014.
[120] Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey
Hinton. Grammar as a foreign language. In Advances in neural information process-
ing systems, pages 2773–2781, 2015.
[121] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representa-
tions using rnn encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014.
[122] Klaus Greff, Rupesh K Srivastava, Jan Koutnı́k, Bas R Steunebrink, and Jürgen
Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks
and learning systems, 28(10):2222–2232, 2016.
BIBLIOGRAPHY 145
[123] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration
of recurrent network architectures. In International Conference on Machine Learn-
ing, pages 2342–2350, 2015.
[124] Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. In
Proceedings of the ninth annual conference of the Cognitive Science Society, pages
177–186, 1987.
[125] Charles F Stevens and Yanyan Wang. Facilitation and depression at single central
synapses. Neuron, 14(4):795–802, 1995.
[126] Jimmy Ba, Geoffrey E Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu.
Using fast weights to attend to the recent past. In Advances in Neural Information
Processing Systems, pages 4331–4339, 2016.
[127] Thomas Miconi. Learning to learn with backpropagation of hebbian plasticity. arXiv
preprint arXiv:1609.02228, 2016.
[128] James L McClelland, Bruce L McNaughton, and Randall C O’Reilly. Why there
are complementary learning systems in the hippocampus and neocortex: insights
from the successes and failures of connectionist models of learning and memory.
Psychological review, 102(3):419, 1995.
[132] Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical
report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993.
[133] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experi-
ence replay. arXiv preprint arXiv:1511.05952, 2015.
BIBLIOGRAPHY 146
[134] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter
Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba.
Hindsight experience replay. In Advances in Neural Information Processing Systems,
pages 5048–5058, 2017.
[135] Máté Lengyel and Peter Dayan. Hippocampal contributions to control: the third way.
In Advances in neural information processing systems, pages 889–896, 2008.
[136] Samuel J Gershman and Nathaniel D Daw. Reinforcement learning and episodic
memory in humans and animals: an integrative framework. Annual review of psy-
chology, 68:101–128, 2017.
[137] Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman,
Joel Z Leibo, Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic
control. arXiv preprint arXiv:1606.04460, 2016.
[138] Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia,
Oriol Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural
episodic control. In Proceedings of the 34th International Conference on Machine
Learning-Volume 70, pages 2827–2836. JMLR. org, 2017.
[139] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Match-
ing networks for one shot learning. In Advances in neural information processing
systems, pages 3630–3638, 2016.
[140] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv
preprint arXiv:1410.5401, 2014.
[141] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Ag-
nieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette,
Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network
with dynamic external memory. Nature, 538(7626):471, 2016.
[142] Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gre-
gory Wayne, Alex Graves, and Timothy Lillicrap. Scaling memory-augmented neu-
ral networks with sparse reads and writes. In Advances in Neural Information Pro-
cessing Systems, pages 3621–3629, 2016.
BIBLIOGRAPHY 147
[143] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint
arXiv:1410.3916, 2014.
[144] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory net-
works. In Advances in neural information processing systems, pages 2440–2448,
2015.
[145] Junhyuk Oh, Valliappa Chockalingam, Satinder Singh, and Honglak Lee. Con-
trol of memory, active perception, and action in minecraft. arXiv preprint
arXiv:1605.09128, 2016.
[146] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan
Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything:
Dynamic memory networks for natural language processing. In Proceedings of The
33rd International Conference on Machine Learning, pages 1378–1387, 2016.
[147] Juan Pavez, Héctor Allende, and Héctor Allende-Cid. Working memory networks:
Augmenting memory networks with a relational reasoning module. In ACL, 2018.
[148] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pas-
canu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for
relational reasoning. In Advances in neural information processing systems, pages
4967–4976, 2017.
[149] Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, and Yann Le-
Cun. Tracking the world state with recurrent entity networks. arXiv preprint
arXiv:1612.03969, 2016.
[150] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan
Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything:
Dynamic memory networks for natural language processing. In International con-
ference on machine learning, pages 1378–1387, 2016.
[151] R Evans, J Jumper, J Kirkpatrick, L Sifre, TFG Green, C Qin, A Zidek, A Nelson,
A Bridgland, H Penedones, et al. De novo structure prediction with deeplearning
based scoring. Annu Rev Biochem, 77:363–382, 2018.
BIBLIOGRAPHY 148
[152] Hui Y Xiong, Babak Alipanahi, Leo J Lee, Hannes Bretschneider, Daniele Merico,
Ryan KC Yuen, Yimin Hua, Serge Gueroussov, Hamed S Najafabadi, Timothy R
Hughes, et al. The human splicing code reveals new insights into the genetic deter-
minants of disease. Science, 347(6218):1254806, 2015.
[153] Terry Anderson. Towards a theory of online learning. Theory and practice of online
learning, 2:109–119, 2004.
[154] Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration
of deep learning and neuroscience. Frontiers in computational neuroscience, 10:94,
2016.
[155] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David
Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by
gradient descent by gradient descent. In Advances in neural information processing
systems, pages 3981–3989, 2016.
[156] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search:
A survey. Journal of Machine Learning Research, 20(55):1–21, 2019.
[157] James L McClelland, Matthew M Botvinick, David C Noelle, David C Plaut, Tim-
othy T Rogers, Mark S Seidenberg, and Linda B Smith. Letting structure emerge:
connectionist and dynamical systems approaches to cognition. Trends in cognitive
sciences, 14(8):348–356, 2010.
[159] Torkel Hafting, Marianne Fyhn, Sturla Molden, May-Britt Moser, and Edvard I
Moser. Microstructure of a spatial map in the entorhinal cortex. Nature,
436(7052):801, 2005.
[160] Dagmar Zeithamova, Margaret L Schlichting, and Alison R Preston. The hippocam-
pus and inferential reasoning: building memories to navigate future decisions. Fron-
tiers in human neuroscience, 6:70, 2012.
BIBLIOGRAPHY 149
[161] Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap,
Piotr Mirowski, Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Mo-
dayil, et al. Vector-based navigation using grid-like representations in artificial
agents. Nature, 557(7705):429–433, 2018.
[162] Charles R Gallistel. The organization of learning. The MIT Press, 1990.
[163] Edward C Tolman. Cognitive maps in rats and men. Psychological review, 55(4):189,
1948.
[164] John O’Keefe and Jonathan Dostrovsky. The hippocampus as a spatial map: prelim-
inary evidence from unit activity in the freely-moving rat. Brain research, 1971.
[165] Arne D Ekstrom, Michael J Kahana, Jeremy B Caplan, Tony A Fields, Eve A Isham,
Ehren L Newman, and Itzhak Fried. Cellular networks underlying human spatial
navigation. Nature, 425(6954):184, 2003.
[166] Nachum Ulanovsky and Cynthia F Moss. Hippocampal cellular and network activity
in freely moving echolocating bats. Nature neuroscience, 10(2):224–233, 2007.
[167] Jeffrey S Taube, Robert U Muller, and James B Ranck. Head-direction cells recorded
from the postsubiculum in freely moving rats. i. description and quantitative analysis.
Journal of Neuroscience, 10(2):420–435, 1990.
[168] Colin Lever, Stephen Burton, Ali Jeewajee, John O’Keefe, and Neil Burgess. Bound-
ary vector cells in the subiculum of the hippocampal formation. Journal of Neuro-
science, 29(31):9771–9777, 2009.
[169] Trygve Solstad, Charlotte N Boccara, Emilio Kropff, May-Britt Moser, and Ed-
vard I Moser. Representation of geometric borders in the entorhinal cortex. Science,
322(5909):1865–1868, 2008.
[170] James R Hinman, G William Chapman, and Michael E Hasselmo. Neuronal repre-
sentation of environmental boundaries in egocentric coordinates. Nature communi-
cations, 10(1):1–8, 2019.
BIBLIOGRAPHY 150
[171] Robert U Muller, John L Kubie, and James B Ranck. Spatial firing patterns of
hippocampal complex-spike cells in a fixed environment. Journal of Neuroscience,
7(7):1935–1950, 1987.
[172] Matthew A Wilson and Bruce L McNaughton. Dynamics of the hippocampal ensem-
ble code for space. Science, 261(5124):1055–1058, 1993.
[173] LT Thompson and PJ Best. Long-term stability of the place-field activity of single
units recorded from the dorsal hippocampus of freely behaving rats. Brain research,
509(2):299–308, 1990.
[174] Elizabeth Bostock, Robert U Muller, and John L Kubie. Experience-dependent mod-
ifications of hippocampal place cell firing. Hippocampus, 1(2):193–205, 1991.
[175] Michael I Anderson and Kathryn J Jeffery. Heterogeneous modulation of place cell
firing by changes in context. Journal of Neuroscience, 23(26):8827–8835, 2003.
[176] Stefan Leutgeb, Jill K Leutgeb, Alessandro Treves, May-Britt Moser, and Edvard I
Moser. Distinct ensemble codes in hippocampal areas ca3 and ca1. Science,
305(5688):1295–1298, 2004.
[177] David Marr, David Willshaw, and Bruce McNaughton. Simple memory: a theory for
archicortex. In From the Retina to the Neocortex, pages 59–128. Springer, 1991.
[178] Kenneth A Norman and Randall C O’Reilly. Modeling hippocampal and neocortical
contributions to recognition memory: a complementary-learning-systems approach.
Psychological review, 110(4):611, 2003.
[180] Matthew L Shapiro and David S Olton. Hippocampal function and interference.
Memory systems, 1994:141–146, 1994.
[182] Randall C O’Reilly and Kenneth A Norman. Hippocampal and neocortical contri-
butions to memory: Advances in the complementary learning systems framework.
Trends in cognitive sciences, 6(12):505–510, 2002.
[184] Robert W Stackman and Jeffrey S Taube. Firing properties of head direction cells
in the rat anterior thalamic nucleus: dependence on vestibular input. Journal of
Neuroscience, 17(11):4349–4358, 1997.
[186] Caswell Barry, Robin Hayman, Neil Burgess, and Kathryn J Jeffery. Experience-
dependent rescaling of entorhinal grids. Nature neuroscience, 10(6):682, 2007.
[187] Hanne Stensola, Tor Stensola, Trygve Solstad, Kristian Frøland, May-Britt Moser,
and Edvard I Moser. The entorhinal grid map is discretized. Nature, 492(7427):72–
78, 2012.
[188] Neil Burgess, Caswell Barry, and John O’keefe. An oscillatory interference model
of grid cell firing. Hippocampus, 17(9):801–812, 2007.
[189] Lisa M Giocomo, Eric A Zilli, Erik Fransén, and Michael E Hasselmo. Temporal
frequency of subthreshold oscillations scales with entorhinal grid cell field spacing.
Science, 315(5819):1719–1722, 2007.
[190] Michael E Hasselmo, Lisa M Giocomo, and Eric A Zilli. Grid cell firing may arise
from interference of theta frequency membrane potential oscillations in single neu-
rons. Hippocampus, 17(12):1252–1271, 2007.
[191] Michael E Hasselmo and Mark P Brandon. Linking cellular mechanisms to behavior:
entorhinal persistent spiking and membrane potential oscillations may underlie path
integration, grid cell firing, and episodic memory. Neural plasticity, 2008, 2008.
BIBLIOGRAPHY 152
[192] Mark C Fuhs and David S Touretzky. A spin glass model of path integration in rat
medial entorhinal cortex. Journal of Neuroscience, 26(16):4266–4276, 2006.
[193] Bruce L McNaughton, Francesco P Battaglia, Ole Jensen, Edvard I Moser, and May-
Britt Moser. Path integration and the neural basis of the’cognitive map’. Nature
Reviews Neuroscience, 7(8):663, 2006.
[194] Alexis Guanella, Daniel Kiper, and Paul Verschure. A model of grid cells based on
a twisted torus topology. International journal of neural systems, 17(04):231–240,
2007.
[195] Yoram Burak and Ila R Fiete. Accurate path integration in continuous attractor net-
work models of grid cells. PLoS computational biology, 5(2):e1000291, 2009.
[196] Ila R Fiete, Yoram Burak, and Ted Brookings. What grid cells convey about rat
location. Journal of Neuroscience, 28(27):6858–6871, 2008.
[197] Uğur M Erdem and Michael Hasselmo. A goal-directed spatial navigation model
using forward trajectory planning based on grid cells. European Journal of Neuro-
science, 35(6):916–931, 2012.
[198] Daniel Bush, Caswell Barry, Daniel Manson, and Neil Burgess. Using grid cells for
navigation. Neuron, 87(3):507–520, 2015.
[199] Joshua P Bassett and Jeffrey S Taube. Neural correlates for angular head velocity in
the rat dorsal tegmental nucleus. Journal of Neuroscience, 21(15):5740–5751, 2001.
[200] Caswell Barry and Neil Burgess. Neural mechanisms of self-location. Current Biol-
ogy, 24(8):R330–R339, 2014.
[201] Emilio Kropff, James E Carmichael, May-Britt Moser, and Edvard I Moser. Speed
cells in the medial entorhinal cortex. Nature, 523(7561):419–424, 2015.
[202] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research, 15(1):1929–1958, 2014.
BIBLIOGRAPHY 153
[203] Tom J Wills, Francesca Cacucci, Neil Burgess, and John O’keefe. Development of
the hippocampal cognitive map in preweanling rats. Science, 328(5985):1573–1576,
2010.
[204] Rosamund F Langston, James A Ainge, Jonathan J Couey, Cathrin B Canto, Tale L
Bjerknes, Menno P Witter, Edvard I Moser, and May-Britt Moser. Development of
the spatial representation system in the rat. Science, 328(5985):1576–1580, 2010.
[205] Sheng-Jia Zhang, Jing Ye, Chenglin Miao, Albert Tsao, Ignas Cerniauskas, Deb-
ora Ledergerber, May-Britt Moser, and Edvard I Moser. Optogenetic dissection of
entorhinal-hippocampal functional connectivity. Science, 340(6128):1232627, 2013.
[206] Kiah Hardcastle, Surya Ganguli, and Lisa M Giocomo. Environmental boundaries
as an error correction mechanism for grid cells. Neuron, 86(3):827–839, 2015.
[207] Florian Raudies and Michael E Hasselmo. Modeling boundary vector cell firing
given optic flow as a cue. PLoS computational biology, 8(6):e1002553, 2012.
[208] John S. Bridle. Training stochastic model recognition algorithms as networks can
lead to maximum mutual information estimation of parameters. In D. S. Touret-
zky, editor, Advances in Neural Information Processing Systems 2, pages 211–217.
Morgan-Kaufmann, 1990.
[209] Jeffrey L Elman and James L McClelland. Exploiting lawful variability in the speech
wave. Invariance and variability in speech processes, 1:360–380, 1986.
[210] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a run-
ning average of its recent magnitude. COURSERA: Neural Networks for Machine
Learning, 2012.
[212] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright,
Heinrich Küttler, Andrew Lefrancq, Simon Green, Vı́ctor Valdés, Amir Sadik, Ju-
lian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian
Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Pe-
tersen. Deepmind lab. CoRR, abs/1612.03801, 2016.
BIBLIOGRAPHY 154
[213] Christian F Doeller, Caswell Barry, and Neil Burgess. Evidence for grid cells in a
human memory network. Nature, 463(7281):657–661, 2010.
[214] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timo-
thy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous
methods for deep reinforcement learning. In Proceedings of the 33nd International
Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24,
2016, pages 1928–1937, 2016.
[215] Serge Beucher. Use of watersheds in contour detection. In Proceedings of the Inter-
national Workshop on Image Processing. CCETT, 1979.
[216] Rebecca Knight, Caitlin E Piette, Hector Page, Daniel Walters, Elizabeth Marozzi,
Marko Nardini, Simon Stringer, and Kathryn J Jeffery. Weighted cue integration in
the rodent head direction system. Philosophical Transactions of the Royal Society of
London B: Biological Sciences, 369(1635):20120512, 2014.
[217] Michael M Yartsev, Menno P Witter, and Nachum Ulanovsky. Grid cells without
theta oscillations in the entorhinal cortex of bats. Nature, 479(7371):103, 2011.
[218] Caswell Barry and Neil Burgess. To be a grid cell: Shuffling procedures for deter-
mining gridness. BiorXiv, 2017.
[219] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461–
464, 1978.
[220] Lewis G Halsey, Douglas Curran-Everett, Sarah L Vowler, and Gordon B Drum-
mond. The fickle p value generates irreproducible results. Nature methods,
12(3):179, 2015.
[221] Stephen Olejnik and James Algina. Measures of effect size for comparative studies:
Applications, interpretations, and limitations. Contemporary educational psychol-
ogy, 25(3):241–286, 2000.
[222] L.V. Hedges and I. Olkin. Statistical Methods for Meta-analysis. Academic Press,
1985.
BIBLIOGRAPHY 155
[223] Martin Stemmler, Alexander Mathis, and Andreas VM Herz. Connecting multi-
ple spatial scales to decode the population activity of grid cells. Science advances,
1(11):e1500816, 2015.
[224] Ingmar Kanitscheider and Ila Fiete. Training recurrent networks to generate hy-
potheses about how the brain solves hard navigation problems. arXiv preprint
arXiv:1609.09059, 2016.
[225] Michael J Milford and Gordon F Wyeth. Mapping a suburb with a single camera us-
ing a biologically inspired slam system. IEEE Transactions on Robotics, 24(5):1038–
1053, 2008.
[226] Alexander Mathis, Andreas VM Herz, and Martin Stemmler. Optimal population
codes for space: grid cells outperform place cells. Neural Computation, 24(9):2280–
2317, 2012.
[227] Guifen Chen, John A King, Neil Burgess, and John O’Keefe. How vision and move-
ment combine in the hippocampal place code. Proceedings of the National Academy
of Sciences, 110(1):378–383, 2013.
[228] Ayelet Sarel, Arseny Finkelstein, Liora Las, and Nachum Ulanovsky. Vectorial rep-
resentation of spatial goals in the hippocampus of bats. Science, 355(6321):176–180,
2017.
[229] Martin J Chadwick, Amy EJ Jolly, Doran P Amos, Demis Hassabis, and Hugo J
Spiers. A goal direction signal in the human entorhinal/subicular region. Current
Biology, 25(1):87–92, 2015.
[230] David S Touretzky, A David Redish, et al. Theory of rodent navigation based on
interacting representations of space. Hippocampus, 6(3):247–270, 1996.
[231] DJ Foster, RGM Morris, Peter Dayan, et al. A model of hippocampally dependent
navigation, using the temporal difference learning rule. Hippocampus, 10(1):1–16,
2000.
[232] John L Kubie and André A Fenton. Linear look-ahead in conjunctive cells: an en-
torhinal mechanism for vector-based navigation. Frontiers in neural circuits, 6:20,
2012.
BIBLIOGRAPHY 156
[233] Nicholas J Gustafson and Nathaniel D Daw. Grid cells, place cells, and geodesic gen-
eralization for spatial reinforcement learning. PLoS Comput Biol, 7(10):e1002235,
2011.
[234] Kimberly L Stachenfeld, Matthew Botvinick, and Samuel J Gershman. Design prin-
ciples of the hippocampal cognitive map. In Advances in neural information pro-
cessing systems, pages 2528–2536, 2014.
[235] Yedidyah Dordek, Daniel Soudry, Ron Meir, and Dori Derdikman. Extracting grid
cell characteristics from place cell inputs using non-negative principal component
analysis. eLife, 5:e10094, 2016.
[236] John Widloski and Ila Fiete. How does the brain solve the computational problems
of spatial navigation? In Space, Time and Memory in the Hippocampal Formation,
pages 373–407. Springer, 2014.
[237] Yoshua Bengio, Dong-Hyun Lee, Jorg Bornschein, Thomas Mesnard, and Zhouhan
Lin. Towards biologically plausible deep learning. arXiv preprint arXiv:1502.04156,
2015.
[238] MWM Gamini Dissanayake, Paul Newman, Steve Clark, Hugh F. Durrant-Whyte,
and Michael Csorba. A solution to the simultaneous localization and map building
(slam) problem. IEEE Transactions on Robotics and Automation, 17(3):229–241,
2001.
[239] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, An-
drea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al.
Learning to navigate in complex environments. International Conference on Learn-
ing Representations, 2017.
[240] Tejas D. Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J. Gershman. Deep
successor reinforcement learning. CoRR, abs/1606.02396, 2016.
[241] Branka Milivojevic and Christian F Doeller. Mnemonic networks in the hippocampal
formation: From spatial maps to temporal and conceptual codes. Journal of Experi-
mental Psychology: General, 142(4):1231, 2013.
BIBLIOGRAPHY 157
[242] Howard Eichenbaum. The role of the hippocampus in navigation is memory. Journal
of neurophysiology, 117(4):1785–1796, 2017.
[244] Jacob LS Bellmund, Peter Gärdenfors, Edvard I Moser, and Christian F Doeller. Nav-
igating cognition: Spatial codes for human thinking. Science, 362(6415):eaat6766,
2018.
[245] Andrea Banino, Raphael Koster, Demis Hassabis, and Dharshan Kumaran. Retrieval-
based model accounts for striking profile of episodic memory and generalization.
Scientific reports, 6:31330, 2016.
[247] Andrea Banino, Adrià Puigdomènech Badia, Raphael Köster, Martin J Chadwick,
Vinicius Zambaldi, Demis Hassabis, Caswell Barry, Matthew Botvinick, Dharshan
Kumaran, and Charles Blundell. Memo: A deep network for flexible combination of
episodic memories. arXiv preprint arXiv:2001.10913, 2020.
[248] Howard Eichenbaum and Neal J Cohen. Can we reconcile the declarative memory
and spatial navigation views on hippocampal function? Neuron, 83(4):764–770,
2014.
[250] Jeffery A Dusek and Howard Eichenbaum. The hippocampus and memory for
orderly stimulus relations. Proceedings of the National Academy of Sciences,
94(13):7109–7114, 1997.
BIBLIOGRAPHY 158
[251] Dharshan Kumaran, Andrea Banino, Charles Blundell, Demis Hassabis, and Peter
Dayan. Computations underlying social hierarchy learning: distinct neural mecha-
nisms for updating and representing self-relevant information. Neuron, 92(5):1135–
1147, 2016.
[252] Howard Eichenbaum and Neal J Cohen. From conditioning to conscious recollec-
tion: Memory systems of the brain. Oxford University Press on Demand, 2004.
[253] Michael A Yassa and Craig EL Stark. Pattern separation in the hippocampus. Trends
in neurosciences, 34(10):515–525, 2011.
[254] Dharshan Kumaran and James L McClelland. Generalization through the recurrent
interaction of episodic memories: a model of the hippocampal system. Psychological
Review, 119(3):573, 2012.
[256] Raphael Koster, Martin J Chadwick, Yi Chen, David Berron, Andrea Banino, Emrah
Düzel, Demis Hassabis, and Dharshan Kumaran. Big-loop recurrence within the
hippocampal system supports integration of information across episodes. Neuron,
99(6):1342–1354, 2018.
[257] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz
Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
[258] Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. Query-reduction
networks for question answering. arXiv preprint arXiv:1606.04582, 2016.
[259] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merrin-
boer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering:
A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.
[260] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv
preprint arXiv:1603.08983, 2016.
BIBLIOGRAPHY 159
[261] Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive
neural networks for efficient inference. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages 527–536. JMLR. org, 2017.
[263] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recur-
rent neural networks. arXiv preprint arXiv:1609.01704, 2016.
[264] Adams Wei Yu, Hongrae Lee, and Quoc V Le. Learning to skim text. arXiv preprint
arXiv:1704.06877, 2017.
[265] Victor Campos Camunez, Brendan Jou, Xavier Giró Nieto, Jordi Torres Viñals, and
Shih-Fu Chang. Skip rnn: learning to skip state updates in recurrent neural networks.
In Sixth International Conference on Learning Representations: Monday April 30-
Thursday May 03, 2018, Vancouver Convention Center, Vancouver:[proceedings],
pages 1–17, 2018.
[266] Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to
stop reading in machine comprehension. In Proceedings of the 23rd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pages 1047–
1055. ACM, 2017.
[267] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural
networks through l 0 regularization. arXiv preprint arXiv:1712.01312, 2017.
[268] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated
feedback recurrent neural networks. In International Conference on Machine Learn-
ing, pages 2067–2075, 2015.
[271] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
[272] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-
Scale Hierarchical Image Database. In CVPR09, 2009.
[273] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural
network. arXiv preprint arXiv:1503.02531, 2015.
[274] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse,
trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
[275] Howard Eichenbaum et al. Memory, amnesia, and the hippocampal system. MIT
press, 1993.
[276] James Whittington, Timothy Muller, Shirely Mark, Caswell Barry, and Tim Behrens.
Generalisation of structural knowledge in the hippocampal-entorhinal system. In
Advances in neural information processing systems, pages 8484–8495, 2018.
[277] James CR Whittington, Timothy H Muller, Shirley Mark, Guifen Chen, Caswell
Barry, Neil Burgess, and Timothy EJ Behrens. The tolman-eichenbaum machine:
Unifying space and relational memory through generalisation in the hippocampal
formation. bioRxiv, page 770495, 2019.
[278] Carl Safina. Beyond words: What animals think and feel. Macmillan, 2015.
[279] Philip N Johnson-Laird. Mental models and human reasoning. Proceedings of the
National Academy of Sciences, 107(43):18243–18250, 2010.
[280] Keith James Holyoak and Robert G Morrison. The Cambridge handbook of thinking
and reasoning, volume 137. Cambridge University Press Cambridge, 2005.
[281] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman.
Building machines that learn and think like people. Behavioral and brain sciences,
40, 2017.
[282] Igor Mordatch. Concept learning with energy-based models. arXiv preprint
arXiv:1811.02486, 2018.
BIBLIOGRAPHY 161
[283] Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher Burgess,
Matthew Botvinick, Demis Hassabis, and Alexander Lerchner. Scan: learning ab-
stract hierarchical compositional visual concepts. stat, 1050:11, 2017.
[284] Peter Gärdenfors. Conceptual spaces: The geometry of thought. MIT press, 2004.
[285] György Buzsáki and Edvard I Moser. Memory, navigation and theta rhythm in the
hippocampal-entorhinal system. Nature neuroscience, 16(2):130, 2013.
[286] Qiang Du, Vance Faber, and Max Gunzburger. Centroidal voronoi tessellations: Ap-
plications and algorithms. SIAM review, 41(4):637–676, 1999.
[287] Emilio Kropff and Alessandro Treves. The emergence of grid cells: Intelligent design
or just adaptation? Hippocampus, 18(12):1256–1269, 2008.
[288] Robert M Mok and Bradley C Love. A non-spatial account of place and grid cells
based on clustering models of concept learning. Nature communications, 10(1):1–9,
2019.
[289] Dmitriy Aronov, Rhino Nevers, and David W Tank. Mapping of a non-spatial di-
mension by the hippocampal–entorhinal circuit. Nature, 543(7647):719–722, 2017.
[290] Wilhelm Von Humboldt, Wilhelm Freiherr von Humboldt, et al. Humboldt:’On Lan-
guage’: On the Diversity of Human Language Construction and Its Influence on the
Mental Development of the Human Species. Cambridge University Press, 1999.
[292] Joseph R Manns and Howard Eichenbaum. Evolution of declarative memory. Hip-
pocampus, 16(9):795–808, 2006.
[293] David H Wolpert and William G Macready. No free lunch theorems for optimization.
IEEE transactions on evolutionary computation, 1(1):67–82, 1997.
[294] Uri Hasson, Samuel A Nastase, and Ariel Goldstein. Direct fit to nature: An evolu-
tionary perspective on biological and artificial neural networks. Neuron, 105(3):416–
434, 2020.
BIBLIOGRAPHY 162
[295] Blake A Richards, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal
Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker,
Surya Ganguli, et al. A deep learning framework for neuroscience. Nature neuro-
science, 22(11):1761–1770, 2019.
[296] Eshin Jolly and Luke J Chang. The flatland fallacy: Moving beyond low–dimensional
thinking. Topics in cognitive science, 11(2):433–454, 2019.
[297] Michael McCloskey. Networks and theories: The place of connectionism in cognitive
science. Psychological science, 2(6):387–395, 1991.
[298] James J Gibson. The ecological approach to visual perception: classic edition. Psy-
chology Press, 1979.