0% found this document useful (0 votes)
85 views21 pages

Neural Cognitive Architectures For Never-Ending Learning: E.a.platanios@cs - Cmu.edu

1. The document proposes a neural cognitive architecture (NCA) that can continuously learn to solve multiple problems across different modalities from various sources of supervision and self-supervision. The NCA aims to leverage experience from past problems to help solve future ones. 2. Current deep learning approaches cannot well integrate different modalities of information and cannot utilize learning from one task to help others. The proposed NCA performs reasoning recursively in a shared latent space across problems and modalities. 3. The goal is to test the NCA using simulated and real-world case studies to evaluate its ability for never-ending learning of dynamic, evolving problems as a step towards general intelligence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views21 pages

Neural Cognitive Architectures For Never-Ending Learning: E.a.platanios@cs - Cmu.edu

1. The document proposes a neural cognitive architecture (NCA) that can continuously learn to solve multiple problems across different modalities from various sources of supervision and self-supervision. The NCA aims to leverage experience from past problems to help solve future ones. 2. Current deep learning approaches cannot well integrate different modalities of information and cannot utilize learning from one task to help others. The proposed NCA performs reasoning recursively in a shared latent space across problems and modalities. 3. The goal is to test the NCA using simulated and real-world case studies to evaluate its ability for never-ending learning of dynamic, evolving problems as a step towards general intelligence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Neural Cognitive Architectures for Never-Ending Learning

Author Committee
Emmanouil Antonios Platanios Tom Mitchell†
www.platanios.org Eric Horvitz‡
[email protected] Rich Caruana‡
Graham Neubig†
Abstract
Allen Newell argued that the human mind functions as a single system and proposed the notion of a unified theory of cognition
(UTC). Most existing work on UTCs has focused on symbolic approaches, such as the Soar architecture (Laird, 2012) and the
ACT-R (Anderson et al., 2004) system. However, such approaches limit a system’s ability to perceive information of arbitrary
modalities, require a significant amount of human input, and are restrictive in terms of the learning mechanisms they support
(supervised learning, semi-supervised learning, reinforcement learning, etc.). For this reason, researchers in machine learning
have recently shifted their focus towards subsymbolic processing with methods such as deep learning. Deep learning systems
have become a standard for solving prediction problems in multiple application areas including computer vision, natural
language processing, and robotics. However, many real-world problems require integrating multiple, distinct modalities of
information (e.g., image, audio, language, etc.) in ways that machine learning models cannot currently handle well. Moreover,
most deep learning approaches are not able to utilize information learned from solving one problem to directly help in solving
another. They are also not capable of never-ending learning, failing on problems that are dynamic, ever-changing, and not
fixed a priori, which is true of problems in the real world due to the dynamicity of nature. In this thesis, we aim to bridge the
gap between UTCs, deep learning, and never-ending learning. To that end, we propose a neural cognitive architecture (NCA)
that is inspired by human cognition and that can learn to continuously solve multiple problems that can grow in number over
time, across multiple distinct perception and action modalities, and from multiple noisy sources of supervision combined with
self-supervision. Furthermore, its experience from learning to solve past problems can be leveraged to learn to solve future
ones. The problems the proposed NCA is learning to solve are ever-evolving and can also be automatically generated by the
system itself. In our NCA, reasoning is performed recursively in a subsymbolic latent space that is shared across all problems
and modalities. The goal of this architecture is to take us a step closer towards general learning and intelligence. We have
also designed, implemented, and plan to extend an artificial simulated world that allows us to test for all the aforementioned
properties of the proposed architecture, in a controllable manner. We propose to perform multiple case studies—within this
simulated world and with real-world applications—that will allow us to evaluate our architecture.

1 Introduction have shifted their focus towards methods like deep learning.
Cognitive architectures were first introduced by Newell Deep learning systems have become the de facto standard
(1990) who argued that the human mind functions as a sin- for solving prediction problems in a multitude of application
gle system, and proposed the notion of a unified theory of areas including computer vision, natural language process-
cognition (UTC). They often consist of constructs that reflect ing, and robotics. Driven by progress in deep learning, the
assumptions about human cognition and that are based on machine learning community is now able to tackle increas-
facts derived from psychology experiments (e.g., problem ingly more complex problems—ranging from multi-modal
solving, decision making, routine action, memory, learning, reasoning (Hu et al., 2017) to dexterous robotic manipula-
skill, perception, motor behavior, language, motivation, emo- tion (OpenAI et al., 2018)—many of which typically involve
tion, imagination, and dreaming). In fact, Newell believed solving combinations of tasks. However, many real-world
that cognitive architectures are the way to answer one of problems require integrating multiple, distinct modalities of
the ultimate scientific questions: “How can the human mind information (e.g., image, audio, language) in ways that ma-
occur in the physical universe?”. Most existing work on chine learning models cannot currently handle well. Further-
UTCs has focused on symbolic approaches, such as the Soar more, most of these approaches are also not able to utilize
architecture (Laird, 2012) and the ACT-R (Anderson et al., information learned from solving one problem to directly
2004) system. However, such approaches limit a system’s help in solving another—something at which human intel-
ability to perceive information of arbitrary modalities, require ligence excels. There have been some limited attempts to
a significant amount of human input, and are restrictive in train a single model to solve multiple problems jointly (e.g.,
terms of the learning mechanisms they support (supervised Kaiser et al., 2017), but the resulting systems generally under-
learning, semi-supervised learning, reinforcement learning, perform those trained separately for each problem. Moreover,
etc.). For this reason, researchers in machine learning (ML) most of the existing approaches are also not capable of never-

ending learning (NEL); namely a machine learning paradigm
Carnegie Mellon University. in which an algorithm learns from examples continuously

Microsoft Research. over time, in a largely self-supervised fashion, where its ex-
Neural Cognitive Architectures for Never-Ending Learning

perience from past examples can be leveraged to learn future using multiple case studies over different learning settings.
examples (Mitchell et al., 2018). Current ML systems fail One such setting is the artificial Jelly Bean World that we
when the problems that need to be learned are not fixed a have created, and where we can control the kinds of prob-
priori, but are rather dynamic and keep changing as part lems the agent needs to solve, and their interactions. We
of the environment where the learning agents operate. For have designed this world in a way that renders never-ending
example, humans do not just learn to solve a fixed set of learning necessary, and plan to extend it so that it allows
problems, but they rather adapt and by solving one problem, us to test all parts of our hypothesis, in a controllable man-
they become better able to tackle new problems that they ner. After testing our hypothesis in this artificial world, we
may even have been previously unaware of1 . Furthermore, also plan to perform experiments on real world problems,
humans are capable of creating problems to learn, on their related to natural language processing, computer vision,
own, something that current ML systems are not designed and potentially healthcare. Healthcare applications are in-
to achieve. Never-ending learning is thus also something at teresting because they present a real-world setting where
which human intelligence excels. To achieve true intelligence, such an architecture would be useful. This is due to the low
a learning agent that interacts with the real world needs to be amount of training data and large number of interconnected
able to adapt in such a continuous fashion (i.e., due to the real problems that underlie many healthcare applications.
world’s dynamic nature). In fact, such an ability is crucial for
This proposal is meant to describe our way of thinking about
never-ending learning, because learning forever only really
the design space for this problem as a whole. We are propos-
makes sense if the learning objectives are ever-evolving.
ing to make progress towards confirming and exploring the
We aim to bridge the gap between UTCs, deep learning, and aforementioned thesis statement, rather than being exhaus-
never-ending learning. To that end, we propose a neural cog- tive. In the following section we discuss our main motivation
nitive architecture that allows for a tighter coupling between for this thesis. Then, in Section 3 we describe the proposed
problems, as well as a higher-level of abstraction over distinct approach along with background and related work for each
modalities of information. We thus aim to test the following of its components, and in Section 4 we describe our planned
hypothesis in this thesis: evaluation case studies. Finally, in Section 5 we present a
tentative timeline for the proposed work.
A computer system with an architecture inspired by human
cognition can learn to continuously solve multiple problems
that can grow in number over time, across multiple distinct 2 Motivation
perception and action modalities, and from multiple noisy
sources of supervision combined with self-supervision. Fur- A long-standing goal in the fields of artificial intelligence and
thermore, its experience from learning to solve past problems machine learning is to develop algorithms that can be applied
can be leveraged to learn to solve future ones. across domains and that can efficiently handle multiple prob-
lems, just like the human mind does. Even though research
Our main goals can be summarized as follows: in multi-task learning has a long history (Caruana, 1997),
there has been a resurgence of interest in fundamental ques-
Formalizing never-ending learning and the notion of a neu-
tions related to: (i) algorithmic frameworks for multi-task
ral cognitive architecture. This includes defining the notion
learning, such as learning-to-learn or meta-learning (Thrun
of an ever-evolving set of learning problems, whether the
and Pratt, 1998; Finn et al., 2017; Franceschi et al., 2018)
problems are provided externally or generated by the learn-
and never-ending/lifelong learning (Mitchell et al., 2018), (ii)
ing system itself, as well as ways to handle this setting.
establishing best practices for building reliable systems that
Designing a neural cognitive architecture that is inspired
can handle multiple tasks at scale, such as federated learning
from the Hub-and-Spoke model of human cognition
for model personalization (Smith et al., 2017) or multi-agent
(Rogers et al., 2004; Ralph et al., 2017) and that also ac-
coordination (Cao et al., 2013; Samarakoon et al., 2018), and
counts for human goal-priming (Custers and Aarts, 2005;
(iii) learning deep representations (Bengio et al., 2013) that
Aarts et al., 2008; Takarada and Nozaki, 2018). It is a novel
support multi-tasking and enable transfer learning in multiple
modular architecture that contains perception and action
domains, such as computer vision (Yosinski et al., 2014) or
spokes (i.e., modules), and a common reasoning hub for
natural language processing (Collobert and Weston, 2008;
all problems, that is independent of data modalities. The
Peters et al., 2018; Devlin et al., 2018).
reasoning hub enables human-inspired capabilities such
as associative memory (Fanselow and Poulos, 2005; Ran- Our interest in these questions started while working on the
ganath and Ritchey, 2012) and world simulation. It makes Never-Ending Language Learner (NELL) (Mitchell et al.,
use of contextual parameter generation (Platanios et al., 2018). NELL is a system that learns to read the web and
2018) to emulate goal-priming. extract knowledge from websites, in a never-ending fash-
Evaluating the capabilities of the proposed architecture ion. One of the core mechanisms employed in NELL is
1
co-training, which was originally proposed by Blum and
For example, after humans managed to build heart monitoring Mitchell (1998). Co-training is a semi-supervised learning
devices, new unsolved problems became available, such as discover-
ing the relationship between heart rate or blood pressure and specific algorithm where multiple models are being trained together
health problems. and each model can use as training examples the most con-
fident predictions made by the other models. If any of the

2
Neural Cognitive Architectures for Never-Ending Learning

models produces wrong but confident predictions, these can lect multiple annotations per example in order to reduce the
propagate to the other models and eventually hinder learning. amount of noise, (ii) aggregate these annotations into a single
This motivated us to develop several algorithms for estimat- label per example that represents an estimate of the ground
ing accuracies of classifiers from unlabeled data (Platanios truth (e.g., using majority voting), and (iii) train machine
et al., 2014; 2016; 2017). The key idea behind all these meth- learning systems using the resulting labeled examples. This
ods is that agreement among multiple models implies that results in both redundant annotations and potentially noisy
the agreed upon prediction is more likely correct than wrong. ground truth labels. We propose a novel approach that en-
However, we also observed that once we have multiple inter- ables us to merge the steps of aggregating noisy annotations
acting tasks that are being learned jointly, we can perform and training machine learning systems, by allowing a system
accuracy estimation in a more robust manner by also account- to be trained directly from multiple noisy annotations. Our
ing for inconsistencies between the tasks. For example, if approach also learns models of the difficulty of each exam-
one classifier predicts that Pittsburgh is a city and another ple and the competence of each annotator in a generalizable
one predicts that it is a person, and we know that something manner (i.e., these models can make predictions for previ-
cannot be both a city and a person at the same time, then ously unseen examples and annotators). This enables us to
we can infer that at least one of these two classifiers must be more optimally assign annotators to examples, thus driving
wrong. Finally, this work pointed out an important pattern in the cost of crowdsourcing down, while improving the quality
how current machine learning systems are trained. Training of the resulting datasets. Our approach can also be used to
data is often obtained by collecting multiple noisy labels for perform ensemble learning and to estimate the accuracies
samples through crowdsourcing that are then aggregated to of classifiers from unlabeled data. The latter has become
produce a single “denoised” label per sample. To this end, especially relevant with recent advances in weak supervision
we adapted our accuracy estimation methods resulting in a and self-supervision (e.g., Ratner et al., 2017).
learning framework for general machine learning systems
The problems of ensemble learning, aggregating and denois-
that allows them to be trained from multiple noisy labels
ing crowdsourced data, and estimating accuracy from un-
directly—without requiring an explicit label aggregation step
labeled data, all share the same underlying core problem:
(Platanios et al., 2019). Through this and other experiences
learning from multiple noisy labels. More specifically, there
from working in NELL, we observed that: (i) learning multi-
is a common setting among all these problems where: (i)
ple tasks jointly while also accounting for their interactions,
there exists an underlying ground truth, (ii) we only get to
and (ii) learning from multiple noisy sources of supervision,
observe multiple, possibly overlapping, noisy views of that
are both crucial to building successful NEL systems.
truth, and (iii) we want to be able to estimate that truth. The
3 Approach noisy views can have arbitrary form, such as: (i) human anno-
tators in a crowdsourcing platform, that may make mistakes
We structure the proposed work in four main parts: (e.g., Zhou et al., 2015), or (ii) classifiers that have already
1. Learning from Multiple Noisy Labels: Mechanisms for been trained (e.g., Platanios et al., 2014; 2016; 2017). To give
learning from multiple noisy sources (e.g., obtained using a concrete example, consider the problem of medical pathol-
a crowdsourcing platform), including self-supervision. ogy diagnostics, where learning-based models are becoming
2. Contextual Parameter Generation: Methods that en- increasingly popular (e.g., Gulshan et al., 2016). Training
hance the model capacity of neural networks allowing models by imitating expert decisions is not as straightforward
them to learn functions that are conditioned on some of in such a scenario: the true diagnosis is unknown a priori,
their inputs (i.e., the context), thus enabling more effective while the diagnostic concordance between experts is often far
multi-task learning architectures. from perfect (Elmore et al., 2015). If we assume that the ex-
3. Self-Reflection: Mechanisms that allow a system to self- pert decisions are the ground truth, the model may overfit to
evaluate and improve without external supervision. This is their mistakes. Therefore, this practical setup requires a prin-
an important property for never-ending learning systems cipled learning framework that takes into account potential
as the extent of external supervision is often limited, but discrepancies or disagreements in the observations.
the system needs to keep learning.
4. Unified Architecture: A unified neural cognitive archi- 3.1.1 R ELATED W ORK
tecture that puts together all aforementioned components, Learning binary classifiers from examples with noisy labels
along with several new ones, and is able to perform large was first introduced and theoretically characterized by An-
scale multi-modal and multi-task learning. gluin and Laird (1988). In that work, the noise model was
based on independent random flips of the labels with some
3.1 Learning from Multiple Noisy Labels
probability η < 0.5. Kearns (1998) later characterized a class
Machine learning systems often rely on large amounts of of robust learning algorithms for such types of label noise.
annotated examples to be trained. This is especially true for Nettleton et al. (2010) studied empirically the behavior of (at
never-ending learning systems. Perhaps the most common the time) popular learning algorithms under different magni-
way to collect such training examples is using noisy crowd- tudes of noise. Natarajan et al. (2013) proposed to modify
sourcing platforms like Amazon Mechanical Turk (AMT). surrogate loss functions to obtain unbiased estimators and
Practitioners typically adopt the following process: (i) col- obtained performance bounds for empirical risk minimization

3
Neural Cognitive Architectures for Never-Ending Learning

in the presence of noisy labels. More recently, Frénay and and predictor competences. Thus, our approach allows us
Verleysen (2014) surveyed several notable methods and vari- to predict which predictors are likely to perform better for
ations of this problem. This whole line of work differs from specific instances, enabling us to allocate predictors more
our setting in that each example only gets a single noisy label. optimally and reduce costs.
On the contrary, we assume that each example is labeled
multiple times using independent labeling processes, which 3.1.2 P ROPOSED M ETHOD
we refer to as predictors, each of which is characterized by
an unknown confusion matrix. Let us denote the observed data by D = {xi , Ŷi }N i=1 , where
Ŷi = {Mi , {ŷij }j∈Mi }, Mi is the set of predictors that
This problem has also been previously framed as estimating made predictions for instance xi , and ŷij is the output of
accuracy from unlabeled data, or as aggregating worker pre- predictor fˆj for instance xi . Our goal is to learn functions
dictions in the context of crowdsourcing. Similar settings representing the underlying ground truth and predictor quali-
were previously explored by Collins and Singer (1999), Das- ties, given our observations D.
gupta et al. (2001), Bengio and Chapados (2003), Madani
et al. (2004), Schuurmans et al. (2006), Balcan et al. (2013), Ground Truth. We define the ground truth as a function
and Parisi et al. (2014), among others. However, none of the hθ (xi ) that is parameterized by θ and that approximates
previous approaches considers explicitly modeling the ground the true distributionPof the label given xi . In our setting,
truth; they rather assume some form of independence or hθ (xi ) ∈ RC≥0 and j [hθ (xi )]j = 1, where C is number of
knowledge of the true label distribution. Collins and Huynh values the label can take (i.e., assuming categorical labels).
(2014) review many methods that were proposed for estimat- More specifically, [hθ (xi )]k , P(yi = k | xi ), where we use
ing the accuracy of medical tests in the absence of a gold square brackets and subscripts to denote indexing of vectors,
standard. Previously we proposed formulating the problem as matrices, and tensors. For example, hθ could be a deep neural
an optimization problem that uses agreement rates between network that would normally be trained in isolation using
multiple noisy labelers over unlabeled data (Platanios et al., the cross-entropy loss function. In our method the network
2014). Dawid and Skene (1979), Moreno et al. (2015), and is trained using the Expectation-Maximization algorithm, as
us (Platanios et al., 2016) have also previously formulated the described in the next section.
problem in terms of probabilistic graphical models. Tian and Predictor Qualities. We define the predictor qualities as the
Zhu (2015) proposed a max-margin majority voting scheme confusion matrices Qij ∈ RC×C≥0 , for each instance xi and
applied to crowdsourcing. More recently, we introduced a ˆ P
predictor fj , where l [Qij ]kl = 1, for all k ∈ {1, . . . , C}.
method that is able to use information provided in the form of
[Qij ]kl represents the probability that predictor fˆj outputs
logical constraints between the noisy labels (Platanios et al.,
label l given that the true label of instance xi is k. We
2017), and Khetan et al. (2017) proposed using a paramet-
define these confusion matrix in a way that generalizes the
ric function to model the ground truth. However, previous
successful approach of Zhou et al. (2015)2 :
approaches were outperformed by Zhou et al. (2015) who
formulated the problem as a form of regularized minimax Qij = Di •3 Cj , (1)
conditional entropy and used their method in crowdsourcing. th
where •i represents an inner product along the i dimension
Our approach is a generalization of the approaches proposed of the two tensors, and:
by Zhou et al. (2015), Platanios et al. (2016), and Khetan – Di = dφ (xi ) represents the difficulty tensor for instance
et al. (2017). Similar to our prior work (Platanios et al., xi , where d is a function parameterized by φ, Di ∈
2016) we define a generative process for our observations. RC×C×L , and L is a latent dimension (it is a hyperparame-
However, our approach is also able to handle categorical ter of our model). [Di ]kl− is an L-dimensional embedding
labels, as opposed to just binary labels. Also, similar to representing the likelihood of confusing xi as having label
Zhou et al. (2015) we define the confusion matrix for each l instead of k, when k is its true label.
instance-predictor pair as a function of instance difficulty – Cj = cψ (rj ) represents the competence tensor for predic-
and predictor competence. However, in our approach we tor fˆj , where c is a function parameterized by ψ, rj is some
explicitly learn the difficulty and competence functions, al- representation of fˆj (e.g., could be a one-hot encoding of
lowing us to generalize to previously unseen instances and the predictor, in the simplest case), and Cj ∈ RC×C×L .
predictors. Interestingly, the inference algorithm for our gen- [Cj ]kl− is an L-dimensional embedding representing the
erative probabilistic model has a similar form to that of Zhou likelihood that predictor fˆj confuses label k for l, when k
et al. (2015) (except for the explicit learning of a ground truth is the true label.
function, as well as of difficulty and competence functions).
In fact, the algorithm of Zhou et al. (2015) can be derived Using L > 1 allows the instance difficulties and predictor
as an Expectation-Maximization (EM) inference algorithm competences to encode more information. An intuitive way
for a generative model, that is a simplified version of the to think about this is that we are embedding difficulties and
one that we are proposing. Finally, similar to Khetan et al. 2
We also perform a normalization step such that all elements of
(2017) we propose to use a parametric function to model the Qij are non-negative and such that each row sums to 1 (thus making
ground truth, but we go a step further and also propose to each row a valid probability distribution).
use parametric functions to model the instance difficulties

4
Neural Cognitive Architectures for Never-Ending Learning

competencies in a common latent space, which can be thought We argue that, for most existing neural network architectures,
of as jointly clustering them. This is in fact very similar to it is hard or even impossible to encode assumptions about
how matrix factorization methods are used for collaborative the contexts (e.g., tasks) in which they are used, to share
filtering in recommender systems. information across these contexts, and to “personalize” them
for each context. As we discuss in the end of this section, this
Our goal is to learn functions hθ , dφ , and cψ , given observa-
limitation could be attributed to the fact that most existing
tions D. To do that, we propose a generative process for our
architectures are only able to represent additive interactions
observations. For i = 1, . . . , N , we first sample the true label
between their inputs. Previously, there has been some suc-
for xi , yi ∼ Categorical(hθ (xi )). Then, for j ∈ Mi , we
cess in encoding this kind of assumptions using probabilistic
sample the predictor output ŷij ∼ Categorical([Qij ]yi − ),
graphical models (PGMs). When working with PGMs, re-
where [Qij ]yi − represents the yi th row of Qij . We derive
searchers typically first define a prior probabilistic model over
an EM algorithm for performing inference that is presented
how the data observations are generated and then perform
in (Platanios et al., 2019). The resulting approach can be
inference to obtain a posterior distribution over the model pa-
thought of as introducing a new loss function for training the
rameters and possibly also latent variables. These generative
model hθ using multiple noisy labels per training instance,
models are often hierarchical, meaning that the parameters
each coming from a distinct noisy predictor. This new loss
of the distribution from which the observations are sampled,
function introduces latent variables representing the ground
are also often sampled themselves from a higher-level dis-
truth labels, as well as a couple of auxiliary models that are
tribution. This results in an interesting type of information
learned, and which represent the instance difficulties and
sharing across all the different distributions, and has been
predictor competences. Perhaps most interestingly, a key dif-
behind many successful models, such as latent Dirichlet allo-
ference between this approach and previous work is that we
cation (Blei et al., 2003) and hierarchical Dirichlet processes
are able to explicitly learn functions that output the likelihood
(Teh et al., 2005). There have been efforts to combine such
that a predictor will label a specific instance correctly. This
approaches with neural networks (e.g., Tran et al., 2018), but
enables using this method to perform crowdsourcing actively
they are often expensive and impractical for large scale prob-
by assigning annotators to instances they are likely to label
lems. Furthermore, in order to make probabilistic inference
correctly, thus reducing redundancy and driving costs down.
tractable they often limit model expressivity.
3.2 Contextual Parameter Generation This motivated us to develop a method called contextual
In order to present the second major component of the pro- parameter generation (CPG) (Platanios et al., 2018). The
posed work we need to first provide some background. We core idea behind this method is that, given a network, fθ ,
refer to parameterized functions as networks. We denote a instead of learning θ directly while training, we define it as:
network by a lowercase English letter with a lowercase Greek θ = gφ (c), (3)
letter subscript (e.g., fθ ), where the Greek letter refers to where c is a description of the context in which we are apply-
the network parameters. Therefore, given some input, x, the ing the model (for example, if we are encoding text written in
output of the network is simply defined as: English as part of a multilingual machine translation model,
y = fθ (x). (2) the context could simply be a one-hot encoding of the English
Most deep learning models can be seen as networks. For language). The parameters we learn during training are just
example, we can have a convolutional neural network (CNN) those of gφ , which we refer to as the parameter generation
that takes images as input, transforms them using convolu- network. This allows us to share information across instances
tional filters (i.e., parameters), and produces distributions of fθ used in different contexts. While we previously had to
over labels (e.g., cat or dog). Research in deep learning has learn and use different parameters for each context in which
resulted in multiple network architectures that can success- fθ is used, they are now all generated as a function of the
fully learn to solve various problems, and that each makes context. For example, instead of using different encoders for
different assumptions about its input space. For example, text written in English and text written in German, we can
CNNs assume that there is some periodical structure in the now use one encoder and simply generate its parameters as
input space, whereas recurrent neural networks (RNNs) as- a function of a language representation. Note that we can
sume that each part of an input sequence can be processed simply define gφ as a lookup table over different contexts,
using the same network parameters. and this would reduce to the previous setting in which there
is no information sharing. However, the CPG formulation al-
Never-ending learning requires a system to be able to perform lows us to impose arbitrary information sharing structures by
multiple tasks; perhaps even previously unseen tasks that can manipulating the functional form of the parameter generation
be formulated in terms of other previously learned tasks. This network, gφ . For example, we could learn embeddings for
means that traditional multi-task neural network architectures all language families and have all Romance language embed-
that use a different output layer for each task (e.g., Caruana, dings be defined as linear transforms of the corresponding
1997) cannot be used in this context. That is because the Romance family embedding. When performing multi-task
set of tasks the system is learning to perform is not known a learning, we can think of each task as a context in which a
priori, when the neural network architecture is chosen. This network processes its inputs. Given a representation of this
motivates us to treat tasks as separate inputs.

5
Neural Cognitive Architectures for Never-Ending Learning

context, we can generate the parameters of a single universal x1 . For example, assuming x is a vector, then x0 and
network that is used for all tasks. The way contexts are de- x1 are vectors such that when concatenated, they form x.
fined and processed to generate parameters can thus allow for Most neural network architectures currently in use only
controlled information sharing across multiple tasks. We re- allow for interactions of the following form:
fer to networks that employ CPG as contextualized networks, y = fθ (h0φ0 (x0 ) + h1φ1 (x1 )), (4)
and we let them optionally have some of their parameters be
generated by a CPG component, and some be directly learned where f , h0 , and h1 are arbitrary functions, and
(e.g., we may not want to generate the parameters of a batch y is the output of the neural network. This form
normalization layer using CPG). Note that contextualized is very restrictive. For example, it cannot be used
networks also have better generalization properties than plain to represent simple if-then-else rules such as
networks because they can be used with previously unseen “if x0 = 2,then 2x else 5x1 ”. This is especially im-
contexts, as long as the new contexts can be composed out of portant for multi-task learning because we can think of
previously seen contexts (refer to Section 3 for more details). x0 as the description of some task and we might want
to condition on that task while processing the rest of the
Previous Uses of CPG. We first introduced CPG in a multi- model inputs. This would be the case for a mixture of
task setting, as a means to tackle the multilingual machine experts model, for example. In order to represent this kind
translation (MT) problem (Platanios et al., 2018). Multi- of interactions we have to explicitly encode them in the
lingual MT is challenging due to the low-resource nature neural network architecture. Ideally, we want the model
of many languages (i.e., it is hard and often impossible to to be able to learn these interactions on its own, if they
collect the massive amounts of parallel sentences required are necessary, instead of having them be hardcoded as
for training a neural MT system). It is therefore crucial to part of the model architecture. CPG in fact allows for
share information across languages, rather than train multiple multiplicative or even polynomial interactions between
pairwise translation models in isolation. In neural MT, when h0φ0 (x0 ) and h1φ1 (x1 ), which would allows us to represent
translating a sentence from English to German, we first en- if-then-else rules.
code the source sentence to some intermediate representation, 3. Deployment: CPG allows us to generate a context-specific
and then we decode it into some target language. Applying model and later use it without involving the parameter
CPG to this problem, we used the source language (i.e., En- generation network. This can be beneficial for deployment.
glish) as the context of the encoder and the target language For example, if we only care about English-to-German
(i.e., German) as the context of the decoder. We learned lan- translation due to an upcoming vacation trip, we could
guage embeddings and used linear transforms to obtain the en- have Google Translate generate a translation model for
coder/decoder parameters from these embeddings. We were this language pair and then store that model on a mobile
able to show significant performance gains over the same device for offline use.
networks without CPG; especially so for the low-resource set- 4. Modeling Assumptions: Different neural network archi-
ting. Furthermore, we showed that CPG allows us to perform tectures make different assumptions. For example, CNNs
zero-shot translation (meaning to translate between pairs of assume spatial invariance for the inputs (meaning that they
languages that were not observed in the training data), thus contain repeating patterns in different locations). This as-
indicating that it can be used to generate network parameters sumption is unlikely to hold for an arbitrary context vector
for new, previously unseen tasks. We also applied CPG in the and so it is unreasonable to just concatenate an arbitrary
problem of question answering over graphs (Platanios* et al., context vector with an image and feed the result to a CNN.
2019), also known as link prediction. In this case, we are CPG avoids this problem because the architecture modifi-
given a source entity (e.g., Pittsburgh) and a relation (e.g., cation it entails does not affect the assumptions made by a
CityInCountry), and are asked to predict the target entity network about the input data.
(e.g., USA). Here, different questions correspond to different
problems and can be used as the context in which to generate 3.2.1 R ELATED W ORK
the parameters of a universal question-answering model. We
were able to show that CPG outperforms all existing methods Ha et al. (2018) are probably the first to introduce a similar
and thus establishes a new state-of-the-art for this problem. idea to that of having one network (called a hypernetwork)
More recently, we have also started applying CPG to develop generate the parameters of another. However, in that work,
better image and video compression methods, and also as a the input to the hypernetwork are structural features of the
means to handle task composition. original network (e.g., layer size and index). Al-Shedivat et al.
(2017) also propose a related method where a neural network
Feeding Context as an Additional Input. Why not just
generates the parameters of a linear model. Their focus is
feed the context as another network input?
mostly on interpretability (i.e., knowing which features the
1. Structured Information Sharing: Similar to the motivation network considers important). Dumoulin et al. (2018) provide
for PGMs described earlier in this section, CPG provides a comprehensive review of some more related work from
a structured way to share information across contexts. other fields, such as computer vision, that was published
2. Additive Interactions: Without loss of generality, let us concurrently to our machine translation work. Furthermore,
split the input x to a neural network in two parts, x0 and CPG is generic enough so that many existing methods can

6
Neural Cognitive Architectures for Never-Ending Learning

be formulated as CPG variants. One such example is model-


agnostic meta-learning by Finn et al. (2017), where a model Sound Praxis
Processing Unit
is pre-trained over a large number of tasks and then fine-tuned Spoke
on new tasks drawn from the same task distribution. In this
case, the parameter generation network consists of taking a
gradient descent step, using only the new task’s data. Modality-Invariant
Speech Hub in the ATL Func�on

3.2.2 P ROPOSED W ORK


We propose to divide our proposed work on CPG in two parts:
(i) understand why CPG works and what the fundamental Valence Vision
limitation of neural networks is that CPG tackles, and (ii)
develop novel methodology that will allow models to learn
what to condition on, rather than learn to condition on a
specific pre-specified context.
3.3 Self-Reflection
In order to achieve never-ending learning, a system needs
to be able to learn in a largely unsupervised fashion. This
requires self-reflective behavior. We propose to introduce a
novel mechanism that allows a system to self-evaluate when Figure 1: The original Hub-and-Spoke model (Rogers et al., 2004).
there is no external supervision and to self-improve by maxi-
mizing its own self-evaluation metric. Some of our prior work
discussed in Section 3.1 can be used to self-evaluate, but it is rationality, which would imply the ability to use all avail-
not directly clear how to add a self-improvement mechanism. able knowledge for every task that the system encounters.
To this end, we plan to introduce a differentiable intrinsic The primary principle at the base of Soar’s design is that
reward function that can be used for both self-evaluation and “all decisions are made through the combination of relevant
self-improvement. This is a parametric function that is up- knowledge at runtime. In Soar, every decision is based on
dated whenever a supervision signal is provided and that is the current interpretation of sensory data, the contents of
otherwise used directly to perform model updates whenever working memory created by prior problem solving, and any
there is no supervision signal. This is mostly relevant to relevant knowledge retrieved from long-term memory.” Soar
reward shaping in reinforcement learning and represents a relies on multiple learning mechanisms (chunking, and re-
more long-term goal for this thesis. In Section 3.4.7 we pro- inforcement, episodic, and semantic learning), and on many
pose how to integrate such a mechanism in a unified neural representations of long-term knowledge (procedural knowl-
cognitive architecture. edge productions, semantic memory, and episodic memory).
ACT-R. Anderson et al. (2004) propose an alternative cog-
3.4 Unified Architecture
nitive architecture, ACT-R, aimed at simulating and under-
The final part of our work consists of putting all the previ- standing human cognition. ACT-R consists of constructs that
ously presented pieces together in a single neural cognitive reflect assumptions about human cognition and that are based
architecture. In order to present that, we first provide some on facts derived from psychology experiments. An important
background on cognitive architectures. feature of ACT-R that distinguishes it from other UTCs is
that it directly allows researchers to compare the system’s
3.4.1 C OGNITIVE A RCHITECTURES performance to that of human participants.
Cognitive architectures can be broadly divided in symbolic, In this thesis, we propose a novel cognitive architecture that
subsymbolic, and hybrid architectures. Symbolic systems rely also reflects assumptions about human cognition—inspired
on sets of rules and reason over discrete spaces (e.g., using from the high-level design of the aforementioned systems—
first-order logic). Subsymbolic systems specify no such rules but that is subsymbolic and enables the use of neural net-
a priori and rely instead on emergent properties of several works and end-to-end training. It contains components that
distinct processing units (e.g., neural networks). Hybrid ap- correspond to perception, action, reasoning, memory, world
proaches are a combination of the symbolic and subsymbolic simulation, and learning.
approaches. Most past work on UTCs has focused on sym-
bolic systems. In the following paragraphs, we describe two 3.4.2 T HE H UB - AND -S POKE T HEORY
such successful systems.
Rogers et al. (2004) proposed the Hub-and-Spoke theory
Soar. Laird (2012) designed Soar, a general cognitive archi- of human cognition, which assimilates two important ideas:
tecture for developing systems that exhibit intelligent behav- (i) multi-modal experiences provide the main “ingredients”
ior, that has been in use since 1983. The design of Soar can for constructing concepts and they are encoded in modality-
be seen as an investigation of an approximation to complete specific cortices, or spokes, that are distributed across the

7
Neural Cognitive Architectures for Never-Ending Learning

brain, and (ii) cross-modal interactions between the modality workflow is that for each problem researchers build large
specific spokes are mediated by a single trans-modal hub that deep neural models that pool together information from dif-
is located bilaterally in the anterior temporal lobes (ATLs) of ferent sources and that are trained independently of each
the human brain. A visualization is shown in Figure 1. This other. An alternative approach is to pre-train large models in
model of the human brain serves as one of the main inspira- a problem-independent manner and then fine-tune them for
tions for the high-level design of the proposed architecture. each problem (e.g., Peters et al., 2018; Devlin et al., 2018;
Finn et al., 2017). However, most of these approaches do
3.4.3 P ROPOSED A RCHITECTURE not allow for information learned from solving one problem
to directly help solve another—something at which human
We propose a novel neural cognitive architecture (NCA) for intelligence excels. For example, BERT is pre-trained as a lan-
general learning and intelligence. The proposed architecture guage model and is then fine-tuned separately for problems
is inspired from the Hub-and-Spoke model for human cog- such as question answering and textual entailment. There-
nition (Rogers et al., 2004; Ralph et al., 2017), as well as fore, learning to answer questions well does not affect how
human goal priming (Custers and Aarts, 2005; Aarts et al., well BERT reasons about textual entailment. This motivates
2008; Papies, 2016; Takarada and Nozaki, 2018). It consists us to find ways to couple the learning of multiple problems
of the following parts (an overview is shown in Figure 2): in a way that results in constructive interference between
Perception and Action Spokes: Sensing input data consists the different problems, meaning that learning to solve one
of converting them to a common reasoning space, that is in- well, helps the system learn to solve others faster. It further
dependent of the data modality. Much of the complexity of motivates us to treat perception (i.e., learning informative rep-
models like BERT3 (Devlin et al., 2018), lies in perception, resentations of the input data) and reasoning (i.e., learning to
rather than reasoning. In fact, for BERT, reasoning often solve each task in the latent space of learned representations)
consists of a single linear layer, while perception consists separately, as most deep neural networks that are trained end-
of a Transformer (Vaswani et al., 2017). Similarly, taking to-end to solve multiple tasks effectively do that, and most of
an action consists of converting a common reasoning rep- their complexity is often related to perception (e.g., in BERT
resentation to some output data. This can include taking problem-specific reasoning is performed by a linear layer).
actions in some environment, or generating data of some Kernel Methods. Before deep learning was popular, some
structure (e.g., probabilistic distribution over labels). of the most successful machine learning methods were mak-
Reasoning Hub: Reasoning is performed in a latent space ing use of kernels (Hofmann et al., 2008), by formulating
that is independent of the data modalities and the problem learning problems in a reproducing kernel Hilbert space
being solved. We argue that this is necessary for general (RKHS) of functions defined on the data domain, expanded
learning and intelligence, as it allows for flexible sharing in terms of a kernel. These kernels effectively project the data
of information across different modalities and problems. to a space where reasoning is modeled as a linear problem.
Moreover, memory and simulations of the external world Such a projection can be thought of as a perception module,
are all defined over the same latent space, abstracting away in terms of our formulation. Given the success of kernel
details about the perceived data that are not relevant to methods, this further motivates separating the treatment of
reasoning. Reasoning is described in detail in Section 3.4.5. perception and reasoning.
Goal Contextualization: The problems that the system is
learning to solve are processed such that they can contex- Neuroscience. Neuroscientists have also observed that infor-
tualize any part of the neural cognitive architecture. This mation processing in the human brain goes from low-level
allows for the behavior of the system to vary across dif- (i.e., sensory input processing) to high-level (i.e., reasoning).
ferent problems, while still sharing information between There is ample evidence to support this both for both audi-
them, similar to how it was done for machine translation, tory (Kaas and Hackett, 1998; Rauschecker, 1998; Romanski
as described in Section 3.2. It further allows the system et al., 1999; Wessinger et al., 2001; Warren and Griffiths,
to generate its own target problems that it learns to solve. 2003; Zatorre and Belin, 2001; Zatorre et al., 2004) and vi-
This is perhaps the most novel aspect of the proposed archi- sual information (Mishkin et al., 1983; Felleman and Van,
tecture and, as shown in the following paragraphs, derives 1991). Furthermore, there has been evidence that the devel-
its inspiration from human goal priming in psychology, and opment of primary visual cortical networks is more rapid
is described in more detail in Section 3.4.6. than the development of primary motor networks in humans
(Gervan et al., 2011). This motivates the idea that perception
This is inspired from work in multiple areas: is a low-level functionality that is not necessarily problem-
specific and that can be learned before learning to reason and
Deep Learning. Deep neural networks are very effective at
take actions. In addition to this, there is evidence that the
learning abstract representations for arbitrary data modali-
brain relies on a set of canonical neural computations that are
ties, that can then be used to perform multiple diverse tasks
reused for different problems (Carandini and Heeger, 2012).
(e.g., Simonyan and Zisserman, 2015; He et al., 2016; Peters
For example, normalization of neural responses is one such
et al., 2018; Devlin et al., 2018). The typical deep learning
operation that is thought to underlie multiple other operations
3
BERT is the current state-of-the-art model for a multitude of such as the representation of odours, the modulatory effects
natural language processing tasks. of visual attention, the encoding of value, and the integration

8
Neural Cognitive Architectures for Never-Ending Learning

Building Blocks
Network Contextualized Network Context Compiler
Compiles a specifica�on to a
Parameters Context Parameter Generator
func�on (poten�ally a
Parameters
composi�on) that outputs a
Input Layer Output Input Layer Output Specifica�on single context vector. Output

Represents any machine learning network Represents a network that has been modified a priori Represents a compiler that takes a context
(e.g., recurrent neural network). to use contextual parameter genera�on (CPG). specifica�ons in a predefined language and
produces contexts that can be used by CPG.

Neural Cogni�ve Architecture (NCA)

CONTEXTUALIZATION REASONING LEGEND


Can be fed
λ Problem Compiler to any of
the blue Reasoning is recursive and can involve arbitrarily many steps. Flow of data whose type is
Goal/Problem Transforms a specifica�on to a boxes.
modality/problem dependent
Specifica�on context while handling problem
Memory Flow of data whose type is
Can also be dependencies (e.g., composi�on).
provided by a read Memory operates in the reasoning space and is modality/problem independent
problem genera�on
ac�on modality important for se�ngs where the system Flow of contexts that can
con�nually interacts with some outside world. contextualize all blue networks
PERCEPTION
read/write
Red boxes indicate the modules of
Sensor Reasoning Unit the architecture that are inspired by
The percep�on modality's the Hub-and-Spoke model and by
Performs a single step of reasoning. The same
sensor network maps input data goal-priming in human cogni�on.
Input unit can be applied recursively mul�ple �mes,
into the latent reasoning space. thus performing mul�ple reasoning steps.

ACTION World Simulator


An abstract model of the world that is defined in Learning Mechanism
Effector the reasoning space and does not contain
Updates the parameters of the
The ac�on modality's effector detailed informa�on about percep�on and
architecture components based
network maps the reasoner ac�on. It can be used to simulate the outside
Output world when projected to the reasoning space.
on the supervision signal.
result to output data.

Note: Sensors and effectors can be


defined in a composi�onal manner, and thus may not consist of a
single modality network (e.g., when combining image and text inputs).

Supervision

Example Percep�on Modality: Pair[Text,Image]


λ Compiler Ac�on Modality: BinaryLabelProbability
City
Problem Specifica�on:
Classify[City∧¬Person] ∧ Classify
The memory component
Person ¬ Memory here could also be a Learning Mechanism
Input: Linear knowledge-base (KB).
Pair[ Note that for this Gradient Descent
Sensor Parameters
problem we do not need
"Washington",
Memory Network a world simulator or a
Split
recurrent reasoner.

] Text Sensor Image Sensor


read/write
Linear Linear
Reasoning Unit Context ignored here!
Parameters Parameters
Linear Effector
Transformer CNN Parameters
Parameters
Output:
Pool Multi-Layer Perceptron Multi-Layer Perceptron Probability
0.95

Figure 2: Overview of the proposed Neural Cognitive Architecture (NCA), its main building blocks, and a simple example showing an
instance of this architecture for a classification problem. In the example, the noun “Washington” alone without the provided image, is
ambiguous and would probably not result to a high probability of referring to a city and not a person, at the same time.

of multi-sensory information. This also supports the idea of ing is a technique where exposure to one stimulus influences
abstracting over reasoning, by making the operations used to the brain’s response to a subsequent stimulus. For example,
perform various tasks common across all tasks and finding the word “dog” is recognized more quickly after having seen
other ways to specialize them. the word “animal”. Priming can be perceptual, semantic,
and conceptual. Perhaps most importantly for this thesis is
Psychology. There has been significant evidence that prim-
goal priming (Custers and Aarts, 2005; Aarts et al., 2008;
ing is characteristic of human behavior (Tulving et al., 1982;
Papies, 2016; Takarada and Nozaki, 2018). Goal primes
Bargh and Chartrand, 2014; Weingarten et al., 2016). Prim-
are cues that trigger goal-directed cognition and behaviour.

9
Neural Cognitive Architectures for Never-Ending Learning

Here, a goal refers to a state or behaviour that has reward 3.4.4 P ERCEPTION AND ACTION S POKES
value and therefore motivates a person to pursue it. For ex-
ample, priming the concept of drinking can increase soda We define perception and action spokes using two kinds of
consumption(Veltkamp et al., 2008), or priming the goal of data modalities: (i) perception modalities that represent data
impression formation leads to better memory organization types that a model can receive as input, and (ii) action modal-
and recall compared to a mere memorization goal (Chartrand ities that represent data types that a model can produce as
and Bargh, 1996). Goal contextualization in our architec- output. Each kind of modality has a different specification:
ture is the computational equivalent of goal priming, in that Perception Modalities: Input space modalities are de-
having specific goals changes the way in which the different fined as tuples (DataType, SensorNetwork), where
architecture parts function. DataType is the type of data supported by this modal-
Benefits of Modularity. An important outcome of the Hub- ity (e.g., String), SensorNetwork is a contextualized
and-Spoke architecture design is reducing the per-problem network that takes inputs of type DataType and pro-
sample complexity. This means reducing the amount train- duces vectors of size Ls , and Ls is the reasoning input
ing data required to learn to solve each problem. This is representation size. Given some data of type DataType
because, for multiple existing machine learning models, most (e.g., a string of characters with type String), and op-
of the model complexity lies in perception (e.g., BERT). This tionally, a context (described in the next section), the
becomes more prevalent in reinforcement learning systems SensorNetwork produces a vector of size Ls , that the
playing video games where they receive as input the raw reasoning module can understand.
pixel values of video frames as they are being rendered while Action Modalities: Output space modalities are defined
playing, and they are tasked with learning to extract infor- as tuples (DataType, EffectorNetwork), where
mation from these raw values (e.g., Bellemare et al., 2013; DataType is the type of data supported by this
Bhonker et al., 2016; Vinyals et al., 2019). Such systems modality (e.g., scalar number in the interval [0, 1]),
require massive amounts of training data to learn, and we EffectorNetwork is a contextualized network that
argue that this is mostly due to their perception components. takes as input vectors of size Le and produces outputs
If these components were shared across multiple problems of type DataType (e.g., a linear transformation followed
then their effective per-task sample complexity would be by a sigmoid activation function), and Le is the reasoning
reduced significantly. In fact, Parisotto et al. (2015) show output representation size.
that pre-training agents on some arcade games, oftentimes Note that a modality can act as both a perception and an action
helps them learn faster when deployed to play other, new, modality, as long as both a sensor and an effector network
arcade games. Thus, assuming we can share the perception are provided. In this case, we also allow the sensor and the
component across different problems, we only need problem- effector networks to optionally share some or all of their
specific training data for the reasoning component. Moreover, parameters. Examples of various modalities are shown in
due to the shared reasoning hub, the per-problem sample com- Table 1. Modalities are defined such that, for any given input
plexity can be further reduced, because the same reasoning (or output) data type, there is a single matching perception
component is used for solving all problems. An interesting (or action) modality that will be used.
setting is one where the perception component can be trained
using supervised tasks with differentiable loss functions, and, Due to their generic definition, modalities can be composed.
at the same time, be shared with reinforcement learning (RL) For example, given perception modalities P1 and P2 , we can
tasks where the reward function is unknown and certainly construct a pair modality Pair[P1 , P2 ], whose data type
not differentiable. We believe that this would significantly is a pair of P1 .DataType and P2 .DataType, and whose
reduce the sample complexity of the RL tasks. In Section 4, sensor network is a function of the two modalities’ sensor
we propose a case study for testing this hypothesis. networks. For example:
x 7→ Pool(P1 .SensorNetwork(x[0]),
The proposed architecture components reflect assumptions
about human cognition that are based on facts derived from P2 .SensorNetwork(x[1])).
psychology experiments, thus rendering the proposed archi- Compositionality gives the proposed NCA high expressive
tecture, a cognitive architecture. In the following sections we power with respect to the kinds of data it can handle. Compo-
describe the different architecture components in more detail. sitionality, more generally (e.g., also at the problem space), is
Finally, in Sections 3.4.7 and 3.4.8, we describe how learning a core aspect of the proposed architecture and it is discussed
is performed. Note that, not all architectural components in more detail in Section 3.4.6.
that we describe in the following sections are necessary for
all problems. Therefore, for some problems, some of the Communication and Language. An interesting direction
components may be ignored (e.g., a world simulator may not that we wish to explore in the long term is to add support
be relevant for a text classification task). for a modality that corresponds to communication with other
agents (i.e., an artificial learned language). This modality
would act as both a perception and an action modality and
we could define its data type as a fixed-size vector containing
numeric values, for example. We can test for the ability of

10
Neural Cognitive Architectures for Never-Ending Learning

Modality Examples
Data Type Sensor Network Effector Network Description
String BERT Encoder RNN Decoder Text
Image CNN Deep Convolutional GAN Image
Scalar[0,1] – MLP→Sigmoid Binary Distribution
Vector[0,1] – MLP→Softmax Categorical Distribution

Table 1: Example modalities. RNN stands for Recurrent Neural Network, CNN for Convolutional Neural Network, GAN for Generative
Adversarial Network, MLP for Multi-Layer Perceptron, Scalar[0,1] for a single number in the interval [0, 1], and Vector[0,1] for a
vector containing numbers in the interval [0, 1].

agents to learn a language and communicate effectively by ing unit input at time t which comes from the perception
conducting experiments in a multi-agent setting where solv- component (note that if the system operates in a real-time
ing certain problems requires coordination and collaboration. environment, this may be different across different reasoning
This is related to the work of Sukhbaatar et al. (2016) and steps), and STOPt is a boolean flag representing the decision
Andreas et al. (2017). of the reasoning unit about whether or not to stop reason-
ing at time t. Finally, aT is fed to the action component,
3.4.5 R EASONING H UB where T is such that STOPT = True. Enhancing the rea-
soning unit with a state significantly increases its modeling
The reasoning component of the proposed architecture con- capacity; it can now even perform a search with backtracking
sists of a few parts. At the core lies the reasoning unit. This support (e.g., dynamic programming). This initial approach
unit transforms the perception component output to an input is inspired by the work of Graves (2016).
for the action component, and is represented as a contextual-
ized network. It is generally accepted that not all problems Memory. In designing general learning architectures, we
require the same amount of reasoning (Kahneman and Egan, need allow for an explicit way for learning systems to re-
2011). For example, solving an algebra problem requires member experiences. This can happen implicitly, through
more thinking than recalling your own name. Therefore, the learned model parameters (assuming high capacity net-
we argue that the ability to reason for arbitrary amounts of works), but it can also be modeled explicitly by equipping
time, depending on the problem being solved, is an important the agent with a memory component. Cognitive architectures
aspect of general learning and intelligence. Most existing ma- often use some form of memory that is symbolic, such as a
chine learning approaches do not allow for a variable amount knowledge-base (KB) that contains learned facts. We propose
of reasoning, as the amount of computation is predefined and to add a memory component to our architecture, where all
fixed, as part of the network architecture. The few attempts memories are represented in the latent reasoning space, rather
that do allow for this have been limited to very specific prob- than being grounded in the perception or action modality
lems and have only shown small gains over preexisting fixed data types. This allows the memory to abstract away details
computation time approaches (Graves, 2016; Dehghani et al., about the data that are not relevant to the reasoning process.
2019). In order to enable this capability in the proposed neu- The way memory is added to our architecture is through the
ral cognitive architecture, we decided to make the reasoning reasoning unit, which is enhanced such that it can read and
unit recursive, meaning that its output can optionally be fed write to memory, while performing its transformation. More
back as input again, to recurse over the reasoning transforma- formally, we define the memory component as consisting of
tion. Each application of the reasoning transformation can two functions, MREAD : K 7→ V and MWRITE : (K, V ) 7→ (),
be thought of as a reasoning step. The reasoning unit also where K and V correspond to the memory key and value
outputs a decision on whether or not to stop, so that it can types, respectively, and the “7→” notation is used to denote
stop reasoning and produce an output at some point. The the function input and output types5 . Possible design choices
recursive nature of this unit introduces several challenges for the memory include memory networks (Sukhbaatar et al.,
with respect to how it should be trained. Our initial plan is to 2015), or even KBs defined over the latent reasoning space.
incur a pondering cost, which is proportional to the number We propose to start with a simple, yet novel6 , attention-based
of reasoning steps used, and add that cost to the loss function memory mechanism. In this case, the memory is defined
used to train the reasoning unit. as a pair of matrices, Mk ∈ RM ×Dk , and Mv ∈ RM ×Dv ,
Recursion. More formally, at each time step t, the reasoning where M is the memory size, Dk is the dimensionality of
unit performs the following transformation4 : the keys, and Dv is the dimensionality of the values stored in
the memory. Mk contains the memory keys and Mv contains
[at+1 , st+1 , STOPt+1 ] = R(pt , at , st ), (5)
the corresponding memory values. Let us refer to the K-
where R represents the reasoning unit transformation, at rep- valued input of MREAD and MWRITE as the query. Queries are
resents the reasoning unit output at time t, st represents the
5
internal state of the unit at time t, pt represents the reason- We use “()” to represent the “void” type, meaning that the
function returns no values, and is only used for its side effects.
4 6
This is not equivalent to simply using a recurrent neural network Novel because we are not aware of prior work that learns a
(RNN), because the number of recursion steps is not predetermined. memory indexing mechanism.

11
Neural Cognitive Architectures for Never-Ending Learning

defined as vectors of size Dk . When a component wants to World Simulator. An important aspect of human reasoning
access a value stored in memory, it needs to provide a query is simulating the external world. Jay Wright Forrester, the fa-
“describing” that value7 . We also define an indexing function, ther of system dynamics, described a mental model as: “The
I : K 7→ ∆M , where ∆M denotes the M -simplex, which image of the world around us, which we carry in our head,
contains all vectors of size M whose elements are in [0, 1] is just a model. Nobody in his head imagines all the world,
and sum to 1. Intuitively, the indexing function maps from a government or country. He has only selected concepts, and
query to a distribution over memory locations The indexing relationships between them, and uses those to represent the
function that we plan to use initially is the scaled dot-product real system.” (Forrester, 1971). There is significant evidence
attention by Vaswani et al. (2017): of the importance of simulation in neuroscience (Singer et al.,

qM T
 2018). For example, Nijhawan (1994) shows that to strike a
I(q) = Softmax √ k , (6) cricket ball one must estimate its future location, rather than
Dk where it is now. Bialek et al. (2001) show that prediction has
which effectively measures the similarity between the query the fundamental theoretical advantage that a system which
and all the memory keys. Then, the memory read function is parsimoniously predicts future inputs from their past, and that
defined as (in pseudocode): generalizes well to new inputs, is likely to contain representa-
MREAD (q) : return I(q)Mv , (7) tions that reflect their underlying causes. Furthermore, they
show that much of sensory processing involves discarding
which returns a convex combination of all stored values, irrelevant information, such as that which is not predictive of
based on the computed index. The memory write function is the future, to arrive at a representation of what is important
similarly defined as: in the environment for guiding action. Another related line of
MWRITE (q, v) : Mv := λI(q)v + (1 − λI(q))Mv , (8) work is in the importance of auditory feedback (i.e., when we
where := is used to denote assignment, and λ is an M -sized hear ourselves speaking). The study of neural mechanisms
vector with values in [0, 1] that denotes the strength of the underlying audio-vocal integration has shown that auditory
write operation. If λ is closer to 1, then old values are forgot- feedback may be used for updating internal representations
ten faster. λ can be set adaptively, based on how often each of mappings between voice feedback and speech motor con-
value is being read. For example, it can be set closer to 1 for trol. One of the earliest demonstrations of the role of auditory
values that are rarely read. The learnable parameters of this feedback in voice control is the Lombard effect, where people
learning mechanism consist of the parameters of I, and the raise their voice amplitude to overcome environmental noise
memory keys, Mk . We can initialize Mv with zeros. (Lombard, 1911; Lane and Tranel, 1971). A related phe-
nomenon is side-tone amplification, in which people increase
Allowing the memory indexing mechanism to be learnable, their voice loudness when their self-perceived loudness is
by using separate keys and values8 , enables associative learn- too quiet to achieve a communication goal, and vice versa
ing and memories, which have been shown to be important (Lane and Tranel, 1971). Given this strong evidence from
aspects of human cognition (Fanselow and Poulos, 2005; neuroscience, we argue that in an interactive setting, where
Ranganath and Ritchey, 2012). In psychology, associative the learning agent keeps interacting with an outside world—
memory is defined as the ability to learn and remember the which may also include other agents—being able to simulate
relationship between unrelated items (e.g., remembering the that world can be very important. For example, this abil-
name of someone or the aroma of a particular perfume). This ity could enable a search over the potential implications its
is enabled by our indexing mechanism because it allows for decisions will have on that outside world.
two unrelated values to have similar keys. This is mainly
because we learn keys separately from the values they corre- We thus propose to add a world simulator component to
spond to. Note that our proposed memory mechanism also our neural cognitive architecture. Formally, the simulator S
allows for a natural way of forgetting, where the keys of un- performs the following prediction:
used values change while learning to the point where they p̂t+1 = S(pt , at+1 ), (9)
may be used for storing other unrelated values instead. where p̂t+1 is a prediction estimate of pt+1 . Furthermore, we
We also allow the sensor and effector networks to option- allow the world simulator to read from memory (as defined
ally read from this memory. This can be important in cases in the previous paragraph), but not write to it. Intuitively,
where perception depends on past experiences. Tulving et al. the world simulator is trying to predict the next perception
(1982) provides some evidence supporting that this has been input, given the current perception input and action output,
observed to be true of human perception (this is known as while operating only in the latent reasoning space. Similar to
priming in psychology literature). the memory component, this allows the simulator to abstract
away information that is not relevant to the problems the
7
Note that, the querying mechanisms are also learned, similar to system is learning to solve. This type of world simulation in
the indexing mechanism. a latent reasoning space is also supported by neuroscientific
8
As opposed to indexing by comparing queries to values as done
evidence (e.g., Keller et al., 2012).
in memory networks.
Recently, Ha and Schmidhuber (2018) proposed using an
RNN-based world simulator for playing games in an RL set-

12
Neural Cognitive Architectures for Never-Ending Learning

ting. They use a variational auto-encoder (VAE) to compress given a fixed number of pre-specified problems the system
the input images to a smaller vector representation and then may learn vector embeddings to represent them. The main
learn a model that simulates the environment in this vector disadvantage of this approach is that the vector representa-
space. This differs from our proposal in that, we are simulat- tions of learning problems may not be interpretable.
ing the world in the latent reasoning space that our system Natural Language: This could be a problem description
learns. This should help us obtain a representation, that has that is provided as input to the system (e.g., “Identify human
higher information content that is relevant for the reasoner. faces in the input image.”). This is the approach taken, for
example, by McCann et al. (2018).
3.4.6 G OAL C ONTEXTUALIZATION Structured Language: This could be first-order logic
(e.g., “Collect[JellyBean]∧¬Collect[Onion]”), or
Even though deep learning methods are very effective at more general (e.g., “If[JellyBean]Then[Collect]
learning representations for arbitrary data modalities, they Else[Avoid]”, or even a Python program).
are often treated as black-box methods offering little control
over how information is shared across different tasks, and Problem Compilation. Given a problem specification, we
over what exactly the networks are learning. For example, need to define a compiler that takes it as input and produces
we can rarely guarantee that a network will generalize well a composition of learnable functions that, when evaluated,
to new tasks, and we often also have to keep training the results in a single structured representation for the problem
network with new problem-specific data, in order for it to (e.g., a set of vectors). This representation can then be used to
generalize better. Furthermore, deep learning approaches contextualize different parts of the proposed architecture (e.g.,
often render generalizing to new tasks, for which we might sensor or effector networks, or parts of the reasoner that are
have no data at all, impossible. However, most real-world discussed in the next section). Given that the representation
problems can be defined in terms of simpler problems (e.g., can potentially be a set of vectors, we could use different
translating sentences relies on first being able to translate sin- parts of that structure to contextualize different parts of the
gle words). Therefore, we argue that the ability to represent architecture. For example, text sensor networks could be
problems in a way such that they can be transformed and contextualized using an embedding of the language in which
composed out of other problems, is an important aspect of the text is written. Note that contextualizing networks is
general learning and intelligence. As discussed in Section 3.2, optional, as it is sometimes not necessary (e.g., the effector
this motivated our recent work in contextual parameter gener- network used in the bottom of Figure 2 is not contextualized).
ation (CPG) for machine translation (Platanios et al., 2018) The choice of the problem compiler is important. For fixed-
and question answering (Platanios* et al., 2019), and forms size vectors and natural language specifications the compiler
the basis of contextualization. In the proposed neural cogni- could be as simple as just a neural network (e.g., a multi-layer
tive architecture, contextualization plays the important role perceptron, a recurrent neural network, or a Transformer net-
of emulating the goal priming mechanism that is inherent in work). However, for other structured languages the compiler
human intelligence and learning. We now describe how this would be something more similar to programming languages
is achieved, in three parts: (i) we first describe how problems compilers. Some examples of representations and their corre-
(or goals) are specified through some language, (ii) we then sponding compiled forms are shown in Table 2. Following
define an architectural component that compiles the problem from the previous section examples, given a problem speci-
specification to a representation that can be used to contex- fication that is written as a Python program, we could also
tualize other parts of the NCA by using CPG, and (iii) we compile it into a composition of learnable functions.
describe how this allows for the learning system to generate
its own target problems (or goals) that it aims to learn. This definition of problem specifications and problem com-
pilers allows us to make the contextualization mechanism
As shown in Figure 2, we also allow the sensor and effector very flexible and extensible by introducing operators that
networks to be contextualized because perception and action compose compiled forms in arbitrary ways. For example, we
are often not independent of the problem being solved. This could have two problem specifications, each with their own
is motivated by the fact that priming in humans can be percep- compiler, and a separate operator that allows us to merge the
tual, semantic, and conceptual (Bargh and Chartrand, 2014). two compiled forms, resulting in a single final context vector.
From a machine learning perspective, we have also shown the
usefulness of contextualizing equivalents of perception and Problem Generation. An important aspect of human learn-
action modules, when we proposed using CPG for universal ing is that, even though nature provides us some reward
neural machine translation (Platanios et al., 2018). signals for our actions (e.g., eating resolves hunger), we often
“invent” new problems that we learn to solve. We could argue
Problem Specification. We first need to define a represen- that this is a way of structuring much larger overarching prob-
tation for problems. We propose to use a fixed language for lems into multiple subproblems. This human behavior aspect
this representation, which could take multiple forms: is very interesting and, at the same time, not really tackled at
Fixed-Size Vector: Problems could be represented as all by current machine learning systems. Therefore, we pro-
continuous-valued, fixed-size, vectors (e.g., Snell et al., pose to let our learning system “invent” problems on its own.
2017; Wang et al., 2017b; Grover et al., 2018). For example, For this section, we will use a reinforcement learning setting
where a learning agent can perceive certain things about the

13
Neural Cognitive Architectures for Never-Ending Learning

Problem Compiler Examples


Specification Compiled Form Explanation
Classify[City] gClassify (cCity ) Predict if the input (e.g., “Washington”) is a city.
Classify[City∧¬Person] gClassify (g∧ (cCity , g¬ (cPerson ))) Predict if the input is a city and not a person.
Caption[Image,English] gCaption (cEnglish ) Generate a short English sentence describing the
input. E.g., generate captions for images.
Translate[English,German] gTranslate (cEnglish , cGerman ) Translate the input from English to German. This
is interesting because the modularity of our archi-
tecture means that this problem specification could
even be used to translate images containing text,
for example.

Table 2: Example uses of the problem compiler. We use c with different subscripts to denote context vectors representing primitives in
the problem specification language, and g with different subscripts to denote transformation functions for context vectors (which could be
defined as learnable neural networks, for example).

environment in which it “lives” and take actions. Oftentimes, to their parameters10 . Under this assumption, we define our
the agent receives a reward, but it may not know why. Thus, learning mechanism as follows:
in such a setting, it would make sense for the agent to try
1. Each action modality can optionally provide a feedback
and “invent” problems to solve, that would result in higher
mechanism. Let us denote the output of the modality’s
collected rewards. We propose to introduce one additional
effector network as a function, fθ (x), where x represents
action modality that allows the agent to generate problem
all inputs that it depends on. In this case, f represents the
specifications, that are directly fed in the problem compiler9 ,
composition of all architecture modules that participated
and can contextualize multiple parts of the architecture.
in producing this output (i.e., this includes the reasoning
For the fixed-size vector specification format, this could be module, the goal contextualization module, and the rele-
implemented by having the effector network output a vector vant perception modalities). Then, we define the feedback
representing the problem. Perhaps more interestingly though, mechanism as a function, h, of fθ (x) and the external envi-
we could define a structured language that only depends on ronment. For example, if fθ (x) is producing a distribution
the agent’s perception and action modalities. This would over classes (for a multi-class classification problem), h
allow the agent to generate arbitrary problem specifications could be defined as:
that only depend on what it is able to perceive and how h(fθ (x), y) = fθ (x) − y, (10)
it can act. For example, given a perception modality that
identifies the types of items in the environment, and an action where y represents a one-hot representation of the true
modality that can collect items, we could define the problem class assignment provided by the environment. The main
specification language to be: constraint on h is that it should produce an output that can
be multiplied with ∇θ fθ (x).
(¬)Collect[<Item>](∧(¬)Collect[<Item>])*, 2. Whenever an action modality produces an output and a
where ¬ denotes the logical NOT operation, ∧ the logical corresponding feedback signal is returned from the envi-
AND operation, parenthesis denote optional parts, <Item> ronment, a gradient-based parameter update is performed
denotes any item type that can be sensed by the item identi- along the following direction:
fication perception modality, and * denotes that the term in Dθ , h(fθ (x), E)∇θ fθ (x), (11)
parenthesis preceding it can be repeated zero or more times. ↓ ↓
Note that Collect[·] acts as a logic predicate that can be External Internal
applied on any item type. An example specification in this
language is Collect[JellyBean]∧¬Collect[Onion]. where E represents the external environment. Note that
the first part, shown in blue, is provided from the external
We propose to formalize this problem generation mechanism environment, whereas the second part, shown in red, can
and allow learning systems to decide on the problems they be computed internally from the learning system itself.
are learning to solve. This separation is interesting from a human cognition per-
spective because, intuitively: (i) a human would know
3.4.7 L EARNING M ECHANISMS how to tweak their brain to move their hand further for-
The architecture components presented so far depend on ward (internal update), while (ii) the external environment
parameters that need to be learned (e.g., the weights of neural could tell them that to achieve a particular goal they would
network layers used). Learning consists of setting the values need to move their hand forward (external update). The
of these parameters so that the system as a whole can solve model update could be a stochastic gradient descent step:
the target problems. We assume that all components are θt+1 = θt + λt Dθt , (12)
formulated as functions that are differentiable with respect
10
Note that this is a very general assumption that holds for most
9
In this case, we assume that no problem specification is pro- deep learning models, and a lot of machine learning models, more
vided to the agent as input. generally.

14
Neural Cognitive Architectures for Never-Ending Learning

where λt represents the learning rate, or it could be a more explore other interesting directions such as staged learning.
elaborate update such as when using Adam (Kingma and
Staged Learning. The aforementioned reinforcement learn-
Ba, 2014) or AMSGrad (Reddi et al., 2018).
ing example on experience replay demonstrates the idea of
Equation 11 is interesting because it can be used to unified staged learning. In staged learning, we freeze the learning of
multiple different learning paradigms, such as supervised, the perception and action modalities early on during training
semi-supervised, unsupervised, and reinforcement learning, (e.g., by significantly lowering the corresponding learning
under one formulation. For example: rate), and then focus more on training the reasoning module.
As discussed in the beginning of Section 3, this would be
Supervised Learning: In this case, the gradient-based
more similar to how human learning works. Assuming that
updates as computed by differentiating a loss function,
the perception and action modality networks have already
L(fθ (x), E). This fits in our formulation by defining the
been trained using a diverse set of learning goals, freezing
feedback mechanism using the chain rule of differentiation:
them should allow for the reasoning module to tackle new
∂L(fθ (x), E) learning goals in a fixed latent space, determined by these
h(fθ (x), E) , . (13)
∂fθ (x) pretrained networks. We believe that this will result in signif-
For example, for L2 loss we have h(fθ (x), y) , fθ (x) − y, icantly faster training times.
and for the cross-entropy classification loss we have
Mixed-Paradigm Learning. As shown earlier, our learning
h(fθ (x), y) , y/fθ (x).
mechanism is a generalization of multiple existing learning
Semi-Supervised Learning: Can often also be formulated
paradigms thus allowing us to mix them together by simply
in terms of minimizing a differentiable loss function and
intertwining their gradient-based updates. For example, we
thus Equation 13 also applies here.
can take a gradient descent step towards minimizing a su-
Unsupervised Learning: In this case, h(fθ (x), E) does not
pervised cross-entropy classification loss, and then take a
depend on E at all and could be defined internally as well.
gradient descent step that improves the current Q-function
More specifically, h could be used to perform some sort of
estimate, in a reinforcement learning setting. This introduces
self-reflection. This is a direction we wish to explore more
multiple challenges that we will have to overcome, including,
in the future, but may be outside the scope of this thesis
but not limited to:How do we properly balance the gradient
and is described in a bit more detail in the last section.
contribution from each learning problem? How do we set the
Reinforcement Learning: In the case of Q-learning
per-learning-goal and per-parameter learning rates? How do
(Watkins and Dayan, 1992), we can have an action modality
we make the learning mechanism scale? How do we properly
that predicts the Q-function value (Mnih et al., 2013) and
batch the training data? Other learning paradigms, such as ac-
then the learning mechanism can use a supervised learning
tive learning and curriculum learning, can also be supported
feedback function, h, to learn it using the rewards provided
by designing appropriate perception and action modalities.
by the environment. In the case of policy gradient methods
(Sutton et al., 2000), h can be defined as the advantage
3.4.8 N EVER -E NDING L EARNING
function being used, or even some function of the advan-
tage for more complex methods (Mnih et al., 2016; Wang A never-ending learning system must be highly modular and
et al., 2017a; Schulman et al., 2017). More interestingly, allow for the addition and removal of modules without requir-
if we want to use experience replay, as done by (Mnih ing a complete retraining from scratch. For this reason, we
et al., 2013), we could develop a variant where: (i) the per- plan to implement the proposed architecture in a highly mod-
ception and action modality parameters are fixed and we ular manner, with each module being completely independent
are training only the problem compiler and the reasoning of the rest and having a fixed, well-defined, and generic inter-
modules, and (ii) the stored experiences that are replayed face. This will allow for adding and removing perception and
are not represented in the original data space, but rather in action modalities and for extending the problem specifica-
the more abstract and compact reasoning space. This has tion language, without requiring a complete retraining from
the significant advantage of being able to store a lot more scratch every time such a modification is made. Furthermore,
experiences, as memory is typically the bottleneck when each module will be solely responsible for persisting its state,
using experience replay. Furthermore, we would only be so that we can keep extending the architecture and avoiding
storing information that is relevant to reasoning. training restarts, as much as possible. Our goal by the end
of this project is for the proposed architecture to have been
Our learning mechanism manages all feedback mechanisms
training for the duration of this thesis, with some modules
and determines how to apply the corresponding updates and
having been trained for a year and some newer ones only
what learning rate to use for each one. Initially, we plan to
for a few days. This will allow us to provide convincing
use the same learning rate for all parameters and feedback
evidence for its never-ending learning capabilities. Moreover,
mechanisms with exponential decay over time. However,
unlike NELL (Mitchell et al., 2018), we aim for this system
our definition allows us to use potentially different learning
to fully avoid complete training restarts throughout its life-
rates for each parameter and for each learning goal (defined
time. Finally, we want to explore directions where the latent
by corresponding feedback mechanisms). Next, we plan to
reasoning representation is also extensible without requiring
integrate the ideas presented in Sections 3.1 and 3.3, to this
complete training restarts. This is a long term goal that goes
learning mechanism. In the long term, we would like to

15
Neural Cognitive Architectures for Never-Ending Learning

beyond the scope of this thesis.


LEGEND
4 Evaluation Agent
In order to test our hypothesis from Section 1, we propose Jelly bean
to perform multiple case studies. Some of these case studies
Bananas
are performed over a simulated world that we have designed
and built, called the Jelly-Bean World (JBW)11 . We designed Onion
this world specifically for enabling us to test the properties of Wall
never-ending learning systems, that are otherwise hard and
Agent vision
very expensive to evaluate using real-world datasets. In the
following section we describe this simulated world, and then Scent diffusion
we provide a list of the proposed case studies.
Figure 3: Jelly-Bean World example.
4.1 Jelly Bean World (JBW)
Vision has a limited range, but very high precision, mean-
The JBW offers a controlled environment where a learning ing that the agent knows exactly what color each grid cell
agent “lives”, and which defines the problems that the agent has, within its visual field. On the other hand, scent has in-
can solve and the reward it obtains for solving each problem. finite range (through diffusion it can propagate to very long
The JBW is a procedurally generated two-dimensional grid distances), but very low precision (it can be very hard to de-
world, where items of various types can be placed on each compose the scent at the current grid cell into all the items
grid cell. An example illustration is shown in Figure 3. In this that contribute to it, and their distance from the agent). At
world, time is discrete and measured in terms of simulation the same time, knowing the scent of an item and computing
steps. Each item has a color and a scent, each represented by the difference in the agent’s perceived scent between one
a fixed-size continuous-valued vector. The learning agents cell and another, can provide a lot of information about
have a visual field range within which they can see the colors which direction an item’s scent is coming from.
of the items in each cell. They can also smell the scent of their Ever-Changing Learning Problems: The mechanics of the
current cell. The scent of each cell is computed by simulating JBW allow us to construct conditions for ever-evolving
the diffusion of scent across all items in the world12 . Items learning problems. For example, let us assume that
are also allowed to have other properties. For example, walls some of the items are notes that contain learning prob-
can block agent movement and onions may tend to cluster lem descriptions, such as Collect[JellyBean] → 10 ∧
together. The JBW has the following desirable properties: Collect[Onion] → −10. This describes an item collec-
Multiple Problems: We can define multiple learning prob- tion and avoidance problem, along with associated rewards.
lems. For example, we can have the agent learn to collect Such notes can be spread around the world and, until the
jelly beans and avoid onions. This results in a reinforce- agent finds them, he cannot tackle the learning problems
ment learning setting where the agent receives sparse re- they describe. Some of these notes may contain recipes
wards. We can then have the same agent learn to predict the for building new items, that give the agent unique abilities
color and scent of each item, and also classify which item a (e.g., binoculars to extend its visual range). Furthermore,
specific color or scent corresponds to. These problems can some items may be invisible to the agent (both in terms of
be learned in a supervised fashion, using a differentiable color and scent) when it starts learning. For example, some
loss function defined over the items the agents has “stepped of the color vector dimensions may only become unmasked
over” so far. When combined with the original RL problem, if the agent manages to obtain a specific item, such as an
they may help reduce its sample complexity. X-ray machine. This creates conditions similar to human
Continual Learning: In the JBW, the learning agent never learning, where learning problems exist abundantly in the
“dies”. It is instead learning continually, in a never-ending world (and may also be generated in a procedural manner),
fashion. The world is generated in a procedural manner, but are not available until certain other problems are solved.
and thus, no matter how far the agent decides to explore, In JBW, we can also control for the relationship between
the JBW imposes no limits. This allows us to observe how learning problem difficulty versus reward. This can create
fast the agent learns to solve different problems, and to also interesting situations where solving more difficult problems
observe how its learning rate changes as time progresses does not necessarily imply collecting a higher reward. The
and the agent learns to solve other problems. In fact, these JBW can thus get arbitrarily complex and difficult, while
conditions are also closer to human learning, in that it is remaining controllable. It therefore allows us to test several
also never-ending and of a non-episodic nature. aspects of our hypothesis, as well as test whether the various
Multiple Modalities: There exist two perception modalities, parts of the proposed neural cognitive architecture are benefi-
scent and vision, which have very different characteristics. cial to learning, or not (e.g., we can design problems where
11
https://github.com/eaplatanios/nel_framework memory is necessary, such as ones that require counting
12
Scent diffusion means that strength of an item’s scent decays items). We propose to use simple metrics to measure learning
with distance from the item. performance in the JBW, such as the performance for each

16
Neural Cognitive Architectures for Never-Ending Learning

problem and across simulation steps. Performance could be that was briefly discussed in Section 3.3.
measured in terms of a metric computed over a validation set, 4. Unified Architecture [09/19-05/20]: We plan to
or simply in terms of cumulative reward collected, and its rate work towards developing a unified architecture as pre-
of change. Furthermore, our JBW simulator already supports sented in Section 3.4, in the following order:
multiple agents interacting with the same grid-world and with i. Goal Contextualization: Can we use contextual pa-
each other, and thus also allows us to conduct multi-agent rameter generation to achieve the equivalent of goal
experiments (e.g., test for agent communication). priming in a machine learning system? We have al-
ready shown that we can do that in a couple specific
4.2 Case Studies
applications (Platanios et al., 2018; Platanios* et al.,
We propose to perform the following case studies: 2019). However, it will be challenging to extend
JBW #1: The agent gets a positive reward for col- that to a multi-problem setting with a problem spec-
lecting some items and a negative rewards for collect- ification language handling structured information
ing some other items. We will test performance when sharing across the different problems.
only being provided a single problem specification (e.g., ii. Module Sharing: Can we effectively share the same
Collect[JellyBean]∧¬Collect[Onion]), and when perception, reasoning, and action modules across
also trying to classify or recognize items based on their multiple problems? There has been some early work
color or scent. The latter case should help us test whether in this direction (e.g., Kaiser et al., 2017), but perfor-
the mixed learning paradigm scenario results in better learn- mance generally drops for problems for which we
ing performance for our architecture. have a lot of data available. It will be challenging to
JBW #2: Same as JBW #1, except that we let the agent overcome this issue, but succeeding would pave the
generate the problem specification, rather than having it be way for more general learning systems.
provided as input from the environment. This should help iii. Unified Learning Paradigm: Can we design a uni-
us test the problem generation and goal contextualization fied learning paradigm that encompasses supervised,
capabilities of the proposed architecture. semi-supervised, unsupervised, and reinforcement
JBW #3: Design some tasks in the JBW that require the learning, and can be successfully used to learn in
use of memory (e.g., counting items) and world simulation, mixed-paradigm settings? We proposed a first step
so that we can test the relevant parts of the reasoner. in this direction in Section 3.4.7, but it may be chal-
Atari Games: Learn to play multiple Atari games using a lenging to successfully deploy such a system in real-
single learning system. This should helps us test whether world applications. If successful, this could poten-
modularizing and sharing perception and action modali- tially lead to a merge between some seemingly dis-
ties across games can help reduce the sample complexity tinct ideas about machine learning.
of learning to play a new game, after having learned to iv. Goal Generation: Can we design and implement a
play some others. In this case, the problem specification system that can successfully generate its own learn-
language will consist simply of an Atari game identifier. ing goals and let them guide its learning process?
NLP: Tackle multiple natural language processing (NLP) How do we evaluate such a capability?
problems using a single NCA learning system. An exam- v. Self-Reflection: Can the proposed neural cognitive
ple would be to try and outperform BERT in the problems architecture achieve self-reflection capabilities? Self-
Devlin et al. (2018) tackle, or to compete in the decaNLP reflection capabilities could form a basis for unsuper-
challenge (McCann et al., 2018). It will also be interesting vised learning and could potentially be achieved by
to explore multi-modal NLP problems such as visual ques- designing appropriate self-sensors and self-effectors
tion answering and problems involving knowledge graphs. (i.e., designing appropriate perception and action
This will allows us to test for multi-modal learning aspects. modalities). In fact, we hope that we may be able
to model the learning mechanism itself as the result
5 Proposed Timeline of the interaction between certain self-sensors and
self-effectors. Given our framing of the learning
We propose to structure the proposed thesis work in four main mechanism using Equation 11, we believe we may
chapters, as discussed in Section 3: be able to model self-reflection by using the internal
1. Learning from Multiple Noisy Labels [DONE]: Pub- component of the feedback direction, and having the
lished in (Platanios et al., 2014; 2016; 2017; 2019). feedback mechanism be provided by self-sensors,
2. Contextual Parameter Generation [01/18-09/19]: rather than by the external environment.
We have already performed extensive empirical evalu- Points (iii), (iv), and (v) above, are long-term goals that
ations of the core idea behind contextual parameter gener- may not be finished during the indicated time frame. How-
ation (e.g., Platanios et al., 2018; Platanios* et al., 2019). ever, we hope to make some first steps before defending
In the next couple of months we aim to obtain a theoreti- this thesis in May 2020.
cal understanding of when and why contextual parameter
generation works and of the limitations it addresses.
3. Self-Reflection [07/19-12/19]: We plan to develop
and evaluate the differentiable intrinsic reward mechanism

17
Neural Cognitive Architectures for Never-Ending Learning

References Caruana, R. (1997). Multitask Learning. Machine Learning,


28(1):41–75.
Aarts, H., Custers, R., and Veltkamp, M. (2008). Goal Priming and
the Affective-Motivational Route to Nonconscious Goal Pursuit. Chartrand, T. L. and Bargh, J. A. (1996). Automatic Activation of
Social Cognition, 26(5):555–577. Impression Formation and Memorization Goals: Nonconscious
Goal Priming Reproduces Effects of Explicit Task Instructions.
Al-Shedivat, M., Dubey, A., and Xing, E. P. (2017). Contextual Journal of Personality and Social Psychology, 71(3):464.
Explanation Networks. CoRR, abs/1705.10301.
Collins, J. and Huynh, M. (2014). Estimation of Diagnostic Test
Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, Accuracy Without Full Verification: A Review of Latent Class
C., and Qin, Y. (2004). An Integrated Theory of the Mind. Psy- Methods. Statistics in Medicine, 33(24):4141–4169.
chological Review, 111(4):1036.
Collins, M. and Singer, Y. (1999). Unsupervised Models for Named
Andreas, J., Dragan, A., and Klein, D. (2017). Translating Neuralese. Entity Classification. In Joint Conference on Empirical Methods
In Proceedings of the 55th Annual Meeting of the Association for in Natural Language Processing and Very Large Corpora.
Computational Linguistics (Volume 1: Long Papers), volume 1,
pages 232–242. Collobert, R. and Weston, J. (2008). A unified architecture for nat-
ural language processing: Deep neural networks with multitask
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016a). Learn- learning. In Proceedings of the 25th international conference on
ing to compose neural networks for question answering. arXiv Machine learning, pages 160–167. ACM.
preprint arXiv:1601.01705.
Custers, R. and Aarts, H. (2005). Positive Affect as Implicit Mo-
Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016b). Neu- tivator: On the Nonconscious Operation of Behavioral Goals.
ral module networks. In Proceedings of the IEEE Conference on Journal of Personality and Social Psychology, 89(2):129.
Computer Vision and Pattern Recognition, pages 39–48.
Dasgupta, S., Littman, M. L., and McAllester, D. (2001). PAC
Angluin, D. and Laird, P. (1988). Learning from Noisy Examples. Generalization Bounds for Co-training. In Neural Information
Machine Learning, 2(4):343–370. Processing Systems, pages 375–382.
Balcan, M.-F., Blum, A., and Mansour, Y. (2013). Exploiting Ontol- Dawid, A. P. and Skene, A. M. (1979). Maximum Likelihood
ogy Structures and Unlabeled Data for Learning. International Estimation of Observer Error-Rates Using the EM Algorithm.
Conference on Machine Learning, pages 1112–1120. Journal of the Royal Statistical Society. Series C (Applied Statis-
tics), 28(1):20–28.
Bargh, J. A. and Chartrand, T. L. (2014). The Mind in the Middle:
A Practical Guide to Priming and Automaticity Research. Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser, Ł.
(2019). Universal Transformers. In International Conference on
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). Learning Representations.
The Arcade Learning Environment: An Evaluation Platform
for General Agents. Journal of Artificial Intelligence Research, Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert:
47:253–279. Pre-training of deep bidirectional transformers for language un-
derstanding. arXiv preprint arXiv:1810.04805.
Bengio, Y. and Chapados, N. (2003). Extensions to Metric-Based
Model Selection. Journal of Machine Learning Research, 3:1209– Dumoulin, V., Perez, E., Schucher, N., Strub, F., Vries, H. d.,
1227. Courville, A., and Bengio, Y. (2018). Feature-wise Trans-
formations. Distill. https://distill.pub/2018/feature-wise-
Bengio, Y., Courville, A., and Vincent, P. (2013). Representation transformations.
learning: A review and new perspectives. IEEE transactions on
pattern analysis and machine intelligence, 35(8):1798–1828. Elmore, J. G., Longton, G. M., Carney, P. A., Geller, B. M., Onega,
T., Tosteson, A. N., Nelson, H. D., Pepe, M. S., Allison, K. H.,
Bhonker, N., Rozenberg, S., and Hubara, I. (2016). Playing Schnitt, S. J., et al. (2015). Diagnostic Concordance among
SNES in the Retro Learning Environment. arXiv preprint Pathologists Interpreting Breast Biopsy Specimens. Journal of
arXiv:1611.02205. the American Medical Association, 313(11):1122–1132.
Bialek, W., Nemenman, I., and Tishby, N. (2001). Predictability, Fanselow, M. S. and Poulos, A. M. (2005). The Neuroscience of
Complexity, and Learning. Neural Computation, 13(11):2409– Mammalian Associative Learning. Annu. Rev. Psychol., 56:207–
2463. 234.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Felleman, D. J. and Van, D. E. (1991). Distributed Hierarchical
Allocation. Journal of Machine Learning Research, 3(Jan):993– Processing in the Primate Cerebral Cortex. Cerebral cortex (New
1022. York, NY: 1991), 1(1):1–47.
Blum, A. and Mitchell, T. (1998). Combining Labeled and Unla- Finn, C., Abbeel, P., and Levine, S. (2017). Model-Agnostic Meta-
beled Data with Co-training. In Conference on Computational Learning for Fast Adaptation of Deep Networks. arXiv preprint
Learning Theory (COLT), pages 92–100. arXiv:1703.03400.

Cao, Y., Yu, W., Ren, W., and Chen, G. (2013). An overview of re- Forrester, J. W. (1971). Counterintuitive Behavior of Social Systems.
cent progress in the study of distributed multi-agent coordination. Technological Forecasting and Social Change, 3:1–22.
IEEE Transactions on Industrial informatics, 9(1):427–438.
Franceschi, L., Frasconi, P., Salzo, S., and Pontil, M. (2018). Bilevel
Carandini, M. and Heeger, D. J. (2012). Normalization as a Canoni- programming for hyperparameter optimization and meta-learning.
cal Neural Computation. Nature Reviews Neuroscience, 13(1):51. arXiv preprint arXiv:1806.04910.

18
Neural Cognitive Architectures for Never-Ending Learning

Frénay, B. and Verleysen, M. (2014). Classification in the Pres- Lane, H. and Tranel, B. (1971). The Lombard Sign and the Role
ence of Label Noise: A Survey. IEEE Transactions on Neural of Hearing in Speech. Journal of Speech and Hearing Research,
Networks and Learning Systems, 25(5):845–869. 14(4):677–709.
Gervan, P., Berencsi, A., and Kovacs, I. (2011). Vision First? The Lombard, E. (1911). Le signe de l’elevation de la voix. Ann. Mal.
Development of Primary Visual Cortical Networks is More Rapid de L’Oreille et du Larynx, pages 101–119.
than the Development of Primary Motor Networks in Humans.
PloS one, 6(9). Madani, O., Pennock, D., and Flake, G. (2004). Co-Validation:
Using Model Disagreement on Unlabeled Data to Validate Classi-
Graves, A. (2016). Adaptive Computation Time for Recurrent fication Algorithms. In Neural Information Processing Systems.
Neural Networks. arXiv preprint arXiv:1603.08983.
McCann, B., Keskar, N. S., Xiong, C., and Socher, R. (2018). The
Grover, A., Al-Shedivat, M., Gupta, J. K., Burda, Y., and Edwards, Natural Language Decathlon: Multitask Learning as Question
H. (2018). Learning Policy Representations in Multiagent Sys- Answering. arXiv preprint arXiv:1806.08730.
tems. arXiv preprint arXiv:1806.06464.
Mishkin, M., Ungerleider, L. G., and Macko, K. A. (1983). Object
Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Vision and Spatial Vision: Two Cortical Pathways. Trends in
Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, Neurosciences, 6:414–417.
T., Cuadros, J., et al. (2016). Development and Validation of a
Deep Learning Algorithm for Detection of Diabetic Retinopathy Mitchell, T. M., Cohen, W. W., Hruschka Jr, E. R., Pratim Talukdar,
in Retinal Fundus Photographs. Journal of the American Medical P., Betteridge, J., Carlson, A., Dalvi, B., Gardner, M., Kisiel,
Association, 316(22):2402–2410. B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T. P.,
Nakashole, N., Platanios, E. A., Ritter, A., Samadi, M., Settles,
Ha, D., Dai, A., and Le, Q. V. (2018). HyperNetworks. In Interna- B., Wang, R. C., Wijaya, D., Gupta, A., Chen, X., Saparov, A.,
tional Conference on Learning Representations. Greaves, M., and Welling, J. (2018). Never-Ending Learning.
Communications of the ACM, 61(5):103–115.
Ha, D. and Schmidhuber, J. (2018). World Models. arXiv preprint
arXiv:1803.10122. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley,
T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous Meth-
Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., and Levine, ods for Deep Reinforcement Learning. In International Confer-
S. (2018). Composable deep reinforcement learning for robotic ence on Machine Learning, pages 1928–1937.
manipulation. arXiv preprint arXiv:1803.06773.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I.,
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual Wierstra, D., and Riedmiller, M. (2013). Playing Atari with Deep
learning for image recognition. In Proceedings of the IEEE Reinforcement Learning. arXiv preprint arXiv:1312.5602.
conference on computer vision and pattern recognition, pages
770–778. Moreno, P. G., Artés-Rodríguez, A., Teh, Y. W., and Perez-Cruz,
F. (2015). Bayesian Nonparametric Crowdsourcing. Journal of
Hofmann, T., Schölkopf, B., and Smola, A. J. (2008). Kernel Machine Learning Research, 16.
Methods in Machine Learning. The Annals of Statistics, pages
1171–1220. Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A.
(2013). Learning with Noisy Labels. In Advances in Neural
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., and Saenko, K. Information Processing Systems, pages 1196–1204.
(2017). Learning to reason: End-to-end module networks for
visual question answering. CoRR, abs/1704.05526, 3. Nettleton, D. F., Orriols-Puig, A., and Fornells, A. (2010). A Study
of the Effect of Different Types of Noise on the Precision of
Kaas, J. H. and Hackett, T. A. (1998). Subdivisions of Audito- Supervised Learning Techniques. Artificial Intelligence Review,
ryCortex and Levels of Processing in Primates. Audiology and 33(4):275–306.
Neurotology, 3(2-3):73–85.
Newell, A. (1990). Unified Theories of Cognition. Harvard Univer-
Kahneman, D. and Egan, P. (2011). Thiking, Fast and Slow, vol- sity Press, Cambridge, MA, USA.
ume 1. Farrar, Straus and Giroux New York.
Nijhawan, R. (1994). Motion Extrapolation in Catching. Nature.
Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N.,
Jones, L., and Uszkoreit, J. (2017). One Model to Learn them OpenAI et al. (2018). Learning dexterous in-hand manipulation.
All. arXiv preprint arXiv:1706.05137. arXiv preprint arXiv:1808.00177.

Kearns, M. (1998). Efficient Noise-tolerant Learning from Statistical Papies, E. K. (2016). Health Goal Priming as a Situated Intervention
Queries. Journal of the ACM (JACM), 45(6):983–1006. Tool: How to Benefit from Nonconscious Motivational Routes to
Health Behaviour. Health Psychology Review, 10(4):408–424.
Keller, G. B., Bonhoeffer, T., and Hübener, M. (2012). Sensorimotor
Mismatch Signals in Primary Visual Cortex of the Behaving Parisi, F., Strino, F., Nadler, B., and Kluger, Y. (2014). Ranking and
Mouse. Neuron, 74(5):809–815. combining multiple predictors without labeled data. Proceedings
of the National Academy of Sciences.
Khetan, A., Lipton, Z. C., and Anandkumar, A. (2017).
Learning from Noisy Singly-Labeled Data. arXiv preprint Parisotto, E., Ba, J. L., and Salakhutdinov, R. (2015). Actor-Mimic:
arXiv:1712.04577. Deep Multitask and Transfer Reinforcement Learning. arXiv
preprint arXiv:1511.06342.
Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic
Optimization. arXiv preprint arXiv:1412.6980. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C.,
Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word
Laird, J. E. (2012). The Soar Cognitive Architecture. MIT Press. representations. arXiv preprint arXiv:1802.05365.

19
Neural Cognitive Architectures for Never-Ending Learning

Platanios, E. A., Al-Shedivat, M., Xing, E., and Mitchell, T. M. Schuurmans, D., Southey, F., Wilkinson, D., and Guo, Y. (2006).
(2019). Learning from Multiple Noisy Labels. In Review for Metric-Based Approaches for Semi-Supervised Regression and
Advances in Neural Information Processing Systems. Classification. In Semi-Supervised Learning.

Platanios, E. A., Blum, A., and Mitchell, T. M. (2014). Estimating Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang,
Accuracy from Unlabeled Data. In Conference on Uncertainty in A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.
Artificial Intelligence, pages 1–10. (2017). Mastering the Game of Go without Human Knowledge.
Nature, 550(7676):354.
Platanios, E. A., Dubey, A., and Mitchell, T. M. (2016). Estimating
Accuracy from Unlabeled Data: A Bayesian Approach. In Inter- Simonyan, K. and Zisserman, A. (2015). Very Deep Convolutional
national Conference in Machine Learning, pages 1416–1425. Networks for Large-Scale Image Recognition. In International
Conference on Learning Representations.
Platanios, E. A., Poon, H., Horvitz, E., and Mitchell, T. M. (2017).
Singer, Y., Teramoto, Y., Willmore, B. D., Schnupp, J. W., King,
Estimating Accuracy from Unlabeled Data: A Probabilistic Logic
A. J., and Harper, N. S. (2018). Sensory Cortex is Optimized for
Approach. In Advances in Neural Information Processing Sys-
Prediction of Future Input. eLife, 7:e31557.
tems.
Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. (2017).
Platanios, E. A., Sachan, M., Neubig, G., and Mitchell, T. (2018). Federated multi-task learning. In Advances in Neural Information
Contextual Parameter Generation for Universal Neural Machine Processing Systems, pages 4424–4434.
Translation. In Conference on Empirical Methods in Natural
Language Processing (EMNLP), Brussels, Belgium. Snell, J., Swersky, K., and Zemel, R. (2017). Prototypical Networks
for Few-Shot Learning. In Advances in Neural Information
Platanios*, E. A., Stretcu*, O., Stoica*, G., Poczos, B., and Mitchell, Processing Systems, pages 4077–4087.
T. (2019). Contextual Parameter Generation for Question An-
swering. In Annual Conference of the North American Chapter Sukhbaatar, S., Fergus, R., et al. (2016). Learning Multiagent
of the Association for Computational Linguistics (NAACL). Communication with Backpropagation. In Advances in Neural
Information Processing Systems, pages 2244–2252.
Rajpurkar, P., Jia, R., and Liang, P. (2018). Know What You Don’t
Know: Unanswerable Questions for SQuAD. In Proceedings of Sukhbaatar, S., Weston, J., Fergus, R., et al. (2015). End-to-End
the 56th Annual Meeting of the Association for Computational Memory Networks. In Advances in Neural Information Process-
Linguistics (Volume 2: Short Papers), pages 784–789. Associa- ing Systems, pages 2440–2448.
tion for Computational Linguistics.
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y.
Ralph, M. A. L., Jefferies, E., Patterson, K., and Rogers, T. T. (2017). (2000). Policy Gradient Methods for Reinforcement Learning
The Neural and Computational Bases of Semantic Cognition. with Function Approximation. In Advances in Neural Information
Nature Reviews Neuroscience, 18(1):42. Processing Systems, pages 1057–1063.
Takarada, Y. and Nozaki, D. (2018). Motivational Goal-Priming
Ranganath, C. and Ritchey, M. (2012). Two Cortical Systems With or Without Awareness Produces Faster and Stronger Force
for Memory-Guided Behaviour. Nature Reviews Neuroscience, Exertion. Scientific Reports, 8(1):10135.
13(10):713.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2005).
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, Sharing Clusters Among Related Groups: Hierarchical Dirich-
C. (2017). Snorkel: Rapid Training Data Creation with Weak let Processes. In Advances in Neural Information Processing
Supervision. Proceedings of the VLDB Endowment, 11(3):269– Systems, pages 1385–1392.
282.
Thrun, S. and Pratt, L. (1998). Learning to learn. Springer.
Rauschecker, J. P. (1998). Cortical Processing of Complex Sounds.
Current opinion in neurobiology, 8(4):516–521. Tian, T. and Zhu, J. (2015). Max-Margin Majority Voting for Learn-
ing from Crowds. In Neural Information Processing Systems.
Reddi, S. J., Kale, S., and Kumar, S. (2018). On the Convergence
of Adam and Beyond. In International Conference on Learning Tran, D., Mike, D., van der Wilk, M., and Hafner, D. (2018).
Representations. Bayesian Layers: A Module for Neural Network Uncertainty.
arXiv preprint arXiv:1812.03973.
Rogers, T. T., Ralph, L., Matthew, A., Garrard, P., Bozeat, S., Mc-
Tulving, E., Schacter, D. L., and Stark, H. A. (1982). Priming Effects
Clelland, J. L., Hodges, J. R., and Patterson, K. (2004). Structure
in Word-Fragment Completion are Independent of Recognition
and Deterioration of Semantic Memory: A Neuropsycholog-
Memory. Journal of experimental psychology: learning, memory,
ical and Computational Investigation. Psychological Review,
and cognition, 8(4):336.
111(1):205.
Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep Rein-
Romanski, L. M., Bates, J. F., and Goldman-Rakic, P. S. (1999). forcement Learning with Double Q-Learning. In Thirtieth AAAI
Auditory Belt and Parabelt Projections to the Prefrontal Cor- Conference on Artificial Intelligence.
tex in the Rhesus Monkey. Journal of Comparative Neurology,
403(2):141–157. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention
Samarakoon, S., Bennis, M., Saady, W., and Debbah, M. (2018). is All You Need. In Advances in Neural Information Processing
Distributed federated learning for ultra-reliable low-latency ve- Systems, pages 5998–6008.
hicular communications. arXiv preprint arXiv:1807.08127.
Veltkamp, M., Aarts, H., and Custers, R. (2008). On the Emer-
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. gence of Deprivation-Reducing Behaviors: Subliminal Priming
(2017). Proximal Policy Optimization Algorithms. arXiv preprint of Behavior Representations turns Deprivation into Motivation.
arXiv:1707.06347. Journal of Experimental Social Psychology, 44(3):866–873.

20
Neural Cognitive Architectures for Never-Ending Learning

Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M.,
Czarnecki, W. M., Dudzik, A., Huang, A., Georgiev, P., Powell,
R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou,
J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets,
S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C.,
Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen,
J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C.,
Kavukcuoglu, K., Hassabis, D., and Silver, D. (2019). AlphaStar:
Mastering the Real-Time Strategy Game StarCraft II. https:
//deepmind.com/blog/alphastar-mastering-
real-time-strategy-game-starcraft-ii/.
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu,
K., and de Freitas, N. (2017a). Sample Efficient Actor-Critic with
Experience Replay. In International Conference on Learning
Representations.
Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G.,
and Heess, N. (2017b). Robust Imitation of Diverse Behaviors.
In Advances in Neural Information Processing Systems, pages
5320–5329.
Warren, J. D. and Griffiths, T. D. (2003). Distinct Mechanisms for
Processing Spatial Sequences and Pitch Sequences in the Human
Auditory Brain. Journal of Neuroscience, 23(13):5799–5804.
Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine learning,
8(3-4):279–292.
Weingarten, E., Chen, Q., McAdams, M., Yi, J., Hepler, J., and
Albarracín, D. (2016). From Primed Concepts to Action: A
Meta-Analysis of the Behavioral Effects of Incidentally Presented
Words. Psychological Bulletin, 142(5):472.
Wessinger, C., VanMeter, J., Tian, B., Van Lare, J., Pekar, J., and
Rauschecker, J. P. (2001). Hierarchical Organization of the Hu-
man Auditory Cortex Revealed by Functional Magnetic Reso-
nance Imaging. Journal of cognitive neuroscience, 13(1):1–7.
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How
transferable are features in deep neural networks? In Advances
in neural information processing systems, pages 3320–3328.
Zatorre, R. J. and Belin, P. (2001). Spectral and Temporal Processing
in Human Auditory Cortex. Cerebral Cortex, 11(10):946–953.
Zatorre, R. J., Bouffard, M., and Belin, P. (2004). Sensitivity to
Auditory Object Features in Human Temporal Neocortex. Journal
of Neuroscience, 24(14):3637–3642.
Zhou, D., Liu, Q., Platt, J. C., Meek, C., and Shah, N. B. (2015).
Regularized Minimax Conditional Entropy for Crowdsourcing.
CoRR, abs/1503.07240.

21

You might also like