Neural Cognitive Architectures For Never-Ending Learning: E.a.platanios@cs - Cmu.edu
Neural Cognitive Architectures For Never-Ending Learning: E.a.platanios@cs - Cmu.edu
Author Committee
Emmanouil Antonios Platanios Tom Mitchell†
www.platanios.org Eric Horvitz‡
[email protected] Rich Caruana‡
Graham Neubig†
Abstract
Allen Newell argued that the human mind functions as a single system and proposed the notion of a unified theory of cognition
(UTC). Most existing work on UTCs has focused on symbolic approaches, such as the Soar architecture (Laird, 2012) and the
ACT-R (Anderson et al., 2004) system. However, such approaches limit a system’s ability to perceive information of arbitrary
modalities, require a significant amount of human input, and are restrictive in terms of the learning mechanisms they support
(supervised learning, semi-supervised learning, reinforcement learning, etc.). For this reason, researchers in machine learning
have recently shifted their focus towards subsymbolic processing with methods such as deep learning. Deep learning systems
have become a standard for solving prediction problems in multiple application areas including computer vision, natural
language processing, and robotics. However, many real-world problems require integrating multiple, distinct modalities of
information (e.g., image, audio, language, etc.) in ways that machine learning models cannot currently handle well. Moreover,
most deep learning approaches are not able to utilize information learned from solving one problem to directly help in solving
another. They are also not capable of never-ending learning, failing on problems that are dynamic, ever-changing, and not
fixed a priori, which is true of problems in the real world due to the dynamicity of nature. In this thesis, we aim to bridge the
gap between UTCs, deep learning, and never-ending learning. To that end, we propose a neural cognitive architecture (NCA)
that is inspired by human cognition and that can learn to continuously solve multiple problems that can grow in number over
time, across multiple distinct perception and action modalities, and from multiple noisy sources of supervision combined with
self-supervision. Furthermore, its experience from learning to solve past problems can be leveraged to learn to solve future
ones. The problems the proposed NCA is learning to solve are ever-evolving and can also be automatically generated by the
system itself. In our NCA, reasoning is performed recursively in a subsymbolic latent space that is shared across all problems
and modalities. The goal of this architecture is to take us a step closer towards general learning and intelligence. We have
also designed, implemented, and plan to extend an artificial simulated world that allows us to test for all the aforementioned
properties of the proposed architecture, in a controllable manner. We propose to perform multiple case studies—within this
simulated world and with real-world applications—that will allow us to evaluate our architecture.
1 Introduction have shifted their focus towards methods like deep learning.
Cognitive architectures were first introduced by Newell Deep learning systems have become the de facto standard
(1990) who argued that the human mind functions as a sin- for solving prediction problems in a multitude of application
gle system, and proposed the notion of a unified theory of areas including computer vision, natural language process-
cognition (UTC). They often consist of constructs that reflect ing, and robotics. Driven by progress in deep learning, the
assumptions about human cognition and that are based on machine learning community is now able to tackle increas-
facts derived from psychology experiments (e.g., problem ingly more complex problems—ranging from multi-modal
solving, decision making, routine action, memory, learning, reasoning (Hu et al., 2017) to dexterous robotic manipula-
skill, perception, motor behavior, language, motivation, emo- tion (OpenAI et al., 2018)—many of which typically involve
tion, imagination, and dreaming). In fact, Newell believed solving combinations of tasks. However, many real-world
that cognitive architectures are the way to answer one of problems require integrating multiple, distinct modalities of
the ultimate scientific questions: “How can the human mind information (e.g., image, audio, language) in ways that ma-
occur in the physical universe?”. Most existing work on chine learning models cannot currently handle well. Further-
UTCs has focused on symbolic approaches, such as the Soar more, most of these approaches are also not able to utilize
architecture (Laird, 2012) and the ACT-R (Anderson et al., information learned from solving one problem to directly
2004) system. However, such approaches limit a system’s help in solving another—something at which human intel-
ability to perceive information of arbitrary modalities, require ligence excels. There have been some limited attempts to
a significant amount of human input, and are restrictive in train a single model to solve multiple problems jointly (e.g.,
terms of the learning mechanisms they support (supervised Kaiser et al., 2017), but the resulting systems generally under-
learning, semi-supervised learning, reinforcement learning, perform those trained separately for each problem. Moreover,
etc.). For this reason, researchers in machine learning (ML) most of the existing approaches are also not capable of never-
†
ending learning (NEL); namely a machine learning paradigm
Carnegie Mellon University. in which an algorithm learns from examples continuously
‡
Microsoft Research. over time, in a largely self-supervised fashion, where its ex-
Neural Cognitive Architectures for Never-Ending Learning
perience from past examples can be leveraged to learn future using multiple case studies over different learning settings.
examples (Mitchell et al., 2018). Current ML systems fail One such setting is the artificial Jelly Bean World that we
when the problems that need to be learned are not fixed a have created, and where we can control the kinds of prob-
priori, but are rather dynamic and keep changing as part lems the agent needs to solve, and their interactions. We
of the environment where the learning agents operate. For have designed this world in a way that renders never-ending
example, humans do not just learn to solve a fixed set of learning necessary, and plan to extend it so that it allows
problems, but they rather adapt and by solving one problem, us to test all parts of our hypothesis, in a controllable man-
they become better able to tackle new problems that they ner. After testing our hypothesis in this artificial world, we
may even have been previously unaware of1 . Furthermore, also plan to perform experiments on real world problems,
humans are capable of creating problems to learn, on their related to natural language processing, computer vision,
own, something that current ML systems are not designed and potentially healthcare. Healthcare applications are in-
to achieve. Never-ending learning is thus also something at teresting because they present a real-world setting where
which human intelligence excels. To achieve true intelligence, such an architecture would be useful. This is due to the low
a learning agent that interacts with the real world needs to be amount of training data and large number of interconnected
able to adapt in such a continuous fashion (i.e., due to the real problems that underlie many healthcare applications.
world’s dynamic nature). In fact, such an ability is crucial for
This proposal is meant to describe our way of thinking about
never-ending learning, because learning forever only really
the design space for this problem as a whole. We are propos-
makes sense if the learning objectives are ever-evolving.
ing to make progress towards confirming and exploring the
We aim to bridge the gap between UTCs, deep learning, and aforementioned thesis statement, rather than being exhaus-
never-ending learning. To that end, we propose a neural cog- tive. In the following section we discuss our main motivation
nitive architecture that allows for a tighter coupling between for this thesis. Then, in Section 3 we describe the proposed
problems, as well as a higher-level of abstraction over distinct approach along with background and related work for each
modalities of information. We thus aim to test the following of its components, and in Section 4 we describe our planned
hypothesis in this thesis: evaluation case studies. Finally, in Section 5 we present a
tentative timeline for the proposed work.
A computer system with an architecture inspired by human
cognition can learn to continuously solve multiple problems
that can grow in number over time, across multiple distinct 2 Motivation
perception and action modalities, and from multiple noisy
sources of supervision combined with self-supervision. Fur- A long-standing goal in the fields of artificial intelligence and
thermore, its experience from learning to solve past problems machine learning is to develop algorithms that can be applied
can be leveraged to learn to solve future ones. across domains and that can efficiently handle multiple prob-
lems, just like the human mind does. Even though research
Our main goals can be summarized as follows: in multi-task learning has a long history (Caruana, 1997),
there has been a resurgence of interest in fundamental ques-
Formalizing never-ending learning and the notion of a neu-
tions related to: (i) algorithmic frameworks for multi-task
ral cognitive architecture. This includes defining the notion
learning, such as learning-to-learn or meta-learning (Thrun
of an ever-evolving set of learning problems, whether the
and Pratt, 1998; Finn et al., 2017; Franceschi et al., 2018)
problems are provided externally or generated by the learn-
and never-ending/lifelong learning (Mitchell et al., 2018), (ii)
ing system itself, as well as ways to handle this setting.
establishing best practices for building reliable systems that
Designing a neural cognitive architecture that is inspired
can handle multiple tasks at scale, such as federated learning
from the Hub-and-Spoke model of human cognition
for model personalization (Smith et al., 2017) or multi-agent
(Rogers et al., 2004; Ralph et al., 2017) and that also ac-
coordination (Cao et al., 2013; Samarakoon et al., 2018), and
counts for human goal-priming (Custers and Aarts, 2005;
(iii) learning deep representations (Bengio et al., 2013) that
Aarts et al., 2008; Takarada and Nozaki, 2018). It is a novel
support multi-tasking and enable transfer learning in multiple
modular architecture that contains perception and action
domains, such as computer vision (Yosinski et al., 2014) or
spokes (i.e., modules), and a common reasoning hub for
natural language processing (Collobert and Weston, 2008;
all problems, that is independent of data modalities. The
Peters et al., 2018; Devlin et al., 2018).
reasoning hub enables human-inspired capabilities such
as associative memory (Fanselow and Poulos, 2005; Ran- Our interest in these questions started while working on the
ganath and Ritchey, 2012) and world simulation. It makes Never-Ending Language Learner (NELL) (Mitchell et al.,
use of contextual parameter generation (Platanios et al., 2018). NELL is a system that learns to read the web and
2018) to emulate goal-priming. extract knowledge from websites, in a never-ending fash-
Evaluating the capabilities of the proposed architecture ion. One of the core mechanisms employed in NELL is
1
co-training, which was originally proposed by Blum and
For example, after humans managed to build heart monitoring Mitchell (1998). Co-training is a semi-supervised learning
devices, new unsolved problems became available, such as discover-
ing the relationship between heart rate or blood pressure and specific algorithm where multiple models are being trained together
health problems. and each model can use as training examples the most con-
fident predictions made by the other models. If any of the
2
Neural Cognitive Architectures for Never-Ending Learning
models produces wrong but confident predictions, these can lect multiple annotations per example in order to reduce the
propagate to the other models and eventually hinder learning. amount of noise, (ii) aggregate these annotations into a single
This motivated us to develop several algorithms for estimat- label per example that represents an estimate of the ground
ing accuracies of classifiers from unlabeled data (Platanios truth (e.g., using majority voting), and (iii) train machine
et al., 2014; 2016; 2017). The key idea behind all these meth- learning systems using the resulting labeled examples. This
ods is that agreement among multiple models implies that results in both redundant annotations and potentially noisy
the agreed upon prediction is more likely correct than wrong. ground truth labels. We propose a novel approach that en-
However, we also observed that once we have multiple inter- ables us to merge the steps of aggregating noisy annotations
acting tasks that are being learned jointly, we can perform and training machine learning systems, by allowing a system
accuracy estimation in a more robust manner by also account- to be trained directly from multiple noisy annotations. Our
ing for inconsistencies between the tasks. For example, if approach also learns models of the difficulty of each exam-
one classifier predicts that Pittsburgh is a city and another ple and the competence of each annotator in a generalizable
one predicts that it is a person, and we know that something manner (i.e., these models can make predictions for previ-
cannot be both a city and a person at the same time, then ously unseen examples and annotators). This enables us to
we can infer that at least one of these two classifiers must be more optimally assign annotators to examples, thus driving
wrong. Finally, this work pointed out an important pattern in the cost of crowdsourcing down, while improving the quality
how current machine learning systems are trained. Training of the resulting datasets. Our approach can also be used to
data is often obtained by collecting multiple noisy labels for perform ensemble learning and to estimate the accuracies
samples through crowdsourcing that are then aggregated to of classifiers from unlabeled data. The latter has become
produce a single “denoised” label per sample. To this end, especially relevant with recent advances in weak supervision
we adapted our accuracy estimation methods resulting in a and self-supervision (e.g., Ratner et al., 2017).
learning framework for general machine learning systems
The problems of ensemble learning, aggregating and denois-
that allows them to be trained from multiple noisy labels
ing crowdsourced data, and estimating accuracy from un-
directly—without requiring an explicit label aggregation step
labeled data, all share the same underlying core problem:
(Platanios et al., 2019). Through this and other experiences
learning from multiple noisy labels. More specifically, there
from working in NELL, we observed that: (i) learning multi-
is a common setting among all these problems where: (i)
ple tasks jointly while also accounting for their interactions,
there exists an underlying ground truth, (ii) we only get to
and (ii) learning from multiple noisy sources of supervision,
observe multiple, possibly overlapping, noisy views of that
are both crucial to building successful NEL systems.
truth, and (iii) we want to be able to estimate that truth. The
3 Approach noisy views can have arbitrary form, such as: (i) human anno-
tators in a crowdsourcing platform, that may make mistakes
We structure the proposed work in four main parts: (e.g., Zhou et al., 2015), or (ii) classifiers that have already
1. Learning from Multiple Noisy Labels: Mechanisms for been trained (e.g., Platanios et al., 2014; 2016; 2017). To give
learning from multiple noisy sources (e.g., obtained using a concrete example, consider the problem of medical pathol-
a crowdsourcing platform), including self-supervision. ogy diagnostics, where learning-based models are becoming
2. Contextual Parameter Generation: Methods that en- increasingly popular (e.g., Gulshan et al., 2016). Training
hance the model capacity of neural networks allowing models by imitating expert decisions is not as straightforward
them to learn functions that are conditioned on some of in such a scenario: the true diagnosis is unknown a priori,
their inputs (i.e., the context), thus enabling more effective while the diagnostic concordance between experts is often far
multi-task learning architectures. from perfect (Elmore et al., 2015). If we assume that the ex-
3. Self-Reflection: Mechanisms that allow a system to self- pert decisions are the ground truth, the model may overfit to
evaluate and improve without external supervision. This is their mistakes. Therefore, this practical setup requires a prin-
an important property for never-ending learning systems cipled learning framework that takes into account potential
as the extent of external supervision is often limited, but discrepancies or disagreements in the observations.
the system needs to keep learning.
4. Unified Architecture: A unified neural cognitive archi- 3.1.1 R ELATED W ORK
tecture that puts together all aforementioned components, Learning binary classifiers from examples with noisy labels
along with several new ones, and is able to perform large was first introduced and theoretically characterized by An-
scale multi-modal and multi-task learning. gluin and Laird (1988). In that work, the noise model was
based on independent random flips of the labels with some
3.1 Learning from Multiple Noisy Labels
probability η < 0.5. Kearns (1998) later characterized a class
Machine learning systems often rely on large amounts of of robust learning algorithms for such types of label noise.
annotated examples to be trained. This is especially true for Nettleton et al. (2010) studied empirically the behavior of (at
never-ending learning systems. Perhaps the most common the time) popular learning algorithms under different magni-
way to collect such training examples is using noisy crowd- tudes of noise. Natarajan et al. (2013) proposed to modify
sourcing platforms like Amazon Mechanical Turk (AMT). surrogate loss functions to obtain unbiased estimators and
Practitioners typically adopt the following process: (i) col- obtained performance bounds for empirical risk minimization
3
Neural Cognitive Architectures for Never-Ending Learning
in the presence of noisy labels. More recently, Frénay and and predictor competences. Thus, our approach allows us
Verleysen (2014) surveyed several notable methods and vari- to predict which predictors are likely to perform better for
ations of this problem. This whole line of work differs from specific instances, enabling us to allocate predictors more
our setting in that each example only gets a single noisy label. optimally and reduce costs.
On the contrary, we assume that each example is labeled
multiple times using independent labeling processes, which 3.1.2 P ROPOSED M ETHOD
we refer to as predictors, each of which is characterized by
an unknown confusion matrix. Let us denote the observed data by D = {xi , Ŷi }N i=1 , where
Ŷi = {Mi , {ŷij }j∈Mi }, Mi is the set of predictors that
This problem has also been previously framed as estimating made predictions for instance xi , and ŷij is the output of
accuracy from unlabeled data, or as aggregating worker pre- predictor fˆj for instance xi . Our goal is to learn functions
dictions in the context of crowdsourcing. Similar settings representing the underlying ground truth and predictor quali-
were previously explored by Collins and Singer (1999), Das- ties, given our observations D.
gupta et al. (2001), Bengio and Chapados (2003), Madani
et al. (2004), Schuurmans et al. (2006), Balcan et al. (2013), Ground Truth. We define the ground truth as a function
and Parisi et al. (2014), among others. However, none of the hθ (xi ) that is parameterized by θ and that approximates
previous approaches considers explicitly modeling the ground the true distributionPof the label given xi . In our setting,
truth; they rather assume some form of independence or hθ (xi ) ∈ RC≥0 and j [hθ (xi )]j = 1, where C is number of
knowledge of the true label distribution. Collins and Huynh values the label can take (i.e., assuming categorical labels).
(2014) review many methods that were proposed for estimat- More specifically, [hθ (xi )]k , P(yi = k | xi ), where we use
ing the accuracy of medical tests in the absence of a gold square brackets and subscripts to denote indexing of vectors,
standard. Previously we proposed formulating the problem as matrices, and tensors. For example, hθ could be a deep neural
an optimization problem that uses agreement rates between network that would normally be trained in isolation using
multiple noisy labelers over unlabeled data (Platanios et al., the cross-entropy loss function. In our method the network
2014). Dawid and Skene (1979), Moreno et al. (2015), and is trained using the Expectation-Maximization algorithm, as
us (Platanios et al., 2016) have also previously formulated the described in the next section.
problem in terms of probabilistic graphical models. Tian and Predictor Qualities. We define the predictor qualities as the
Zhu (2015) proposed a max-margin majority voting scheme confusion matrices Qij ∈ RC×C≥0 , for each instance xi and
applied to crowdsourcing. More recently, we introduced a ˆ P
predictor fj , where l [Qij ]kl = 1, for all k ∈ {1, . . . , C}.
method that is able to use information provided in the form of
[Qij ]kl represents the probability that predictor fˆj outputs
logical constraints between the noisy labels (Platanios et al.,
label l given that the true label of instance xi is k. We
2017), and Khetan et al. (2017) proposed using a paramet-
define these confusion matrix in a way that generalizes the
ric function to model the ground truth. However, previous
successful approach of Zhou et al. (2015)2 :
approaches were outperformed by Zhou et al. (2015) who
formulated the problem as a form of regularized minimax Qij = Di •3 Cj , (1)
conditional entropy and used their method in crowdsourcing. th
where •i represents an inner product along the i dimension
Our approach is a generalization of the approaches proposed of the two tensors, and:
by Zhou et al. (2015), Platanios et al. (2016), and Khetan – Di = dφ (xi ) represents the difficulty tensor for instance
et al. (2017). Similar to our prior work (Platanios et al., xi , where d is a function parameterized by φ, Di ∈
2016) we define a generative process for our observations. RC×C×L , and L is a latent dimension (it is a hyperparame-
However, our approach is also able to handle categorical ter of our model). [Di ]kl− is an L-dimensional embedding
labels, as opposed to just binary labels. Also, similar to representing the likelihood of confusing xi as having label
Zhou et al. (2015) we define the confusion matrix for each l instead of k, when k is its true label.
instance-predictor pair as a function of instance difficulty – Cj = cψ (rj ) represents the competence tensor for predic-
and predictor competence. However, in our approach we tor fˆj , where c is a function parameterized by ψ, rj is some
explicitly learn the difficulty and competence functions, al- representation of fˆj (e.g., could be a one-hot encoding of
lowing us to generalize to previously unseen instances and the predictor, in the simplest case), and Cj ∈ RC×C×L .
predictors. Interestingly, the inference algorithm for our gen- [Cj ]kl− is an L-dimensional embedding representing the
erative probabilistic model has a similar form to that of Zhou likelihood that predictor fˆj confuses label k for l, when k
et al. (2015) (except for the explicit learning of a ground truth is the true label.
function, as well as of difficulty and competence functions).
In fact, the algorithm of Zhou et al. (2015) can be derived Using L > 1 allows the instance difficulties and predictor
as an Expectation-Maximization (EM) inference algorithm competences to encode more information. An intuitive way
for a generative model, that is a simplified version of the to think about this is that we are embedding difficulties and
one that we are proposing. Finally, similar to Khetan et al. 2
We also perform a normalization step such that all elements of
(2017) we propose to use a parametric function to model the Qij are non-negative and such that each row sums to 1 (thus making
ground truth, but we go a step further and also propose to each row a valid probability distribution).
use parametric functions to model the instance difficulties
4
Neural Cognitive Architectures for Never-Ending Learning
competencies in a common latent space, which can be thought We argue that, for most existing neural network architectures,
of as jointly clustering them. This is in fact very similar to it is hard or even impossible to encode assumptions about
how matrix factorization methods are used for collaborative the contexts (e.g., tasks) in which they are used, to share
filtering in recommender systems. information across these contexts, and to “personalize” them
for each context. As we discuss in the end of this section, this
Our goal is to learn functions hθ , dφ , and cψ , given observa-
limitation could be attributed to the fact that most existing
tions D. To do that, we propose a generative process for our
architectures are only able to represent additive interactions
observations. For i = 1, . . . , N , we first sample the true label
between their inputs. Previously, there has been some suc-
for xi , yi ∼ Categorical(hθ (xi )). Then, for j ∈ Mi , we
cess in encoding this kind of assumptions using probabilistic
sample the predictor output ŷij ∼ Categorical([Qij ]yi − ),
graphical models (PGMs). When working with PGMs, re-
where [Qij ]yi − represents the yi th row of Qij . We derive
searchers typically first define a prior probabilistic model over
an EM algorithm for performing inference that is presented
how the data observations are generated and then perform
in (Platanios et al., 2019). The resulting approach can be
inference to obtain a posterior distribution over the model pa-
thought of as introducing a new loss function for training the
rameters and possibly also latent variables. These generative
model hθ using multiple noisy labels per training instance,
models are often hierarchical, meaning that the parameters
each coming from a distinct noisy predictor. This new loss
of the distribution from which the observations are sampled,
function introduces latent variables representing the ground
are also often sampled themselves from a higher-level dis-
truth labels, as well as a couple of auxiliary models that are
tribution. This results in an interesting type of information
learned, and which represent the instance difficulties and
sharing across all the different distributions, and has been
predictor competences. Perhaps most interestingly, a key dif-
behind many successful models, such as latent Dirichlet allo-
ference between this approach and previous work is that we
cation (Blei et al., 2003) and hierarchical Dirichlet processes
are able to explicitly learn functions that output the likelihood
(Teh et al., 2005). There have been efforts to combine such
that a predictor will label a specific instance correctly. This
approaches with neural networks (e.g., Tran et al., 2018), but
enables using this method to perform crowdsourcing actively
they are often expensive and impractical for large scale prob-
by assigning annotators to instances they are likely to label
lems. Furthermore, in order to make probabilistic inference
correctly, thus reducing redundancy and driving costs down.
tractable they often limit model expressivity.
3.2 Contextual Parameter Generation This motivated us to develop a method called contextual
In order to present the second major component of the pro- parameter generation (CPG) (Platanios et al., 2018). The
posed work we need to first provide some background. We core idea behind this method is that, given a network, fθ ,
refer to parameterized functions as networks. We denote a instead of learning θ directly while training, we define it as:
network by a lowercase English letter with a lowercase Greek θ = gφ (c), (3)
letter subscript (e.g., fθ ), where the Greek letter refers to where c is a description of the context in which we are apply-
the network parameters. Therefore, given some input, x, the ing the model (for example, if we are encoding text written in
output of the network is simply defined as: English as part of a multilingual machine translation model,
y = fθ (x). (2) the context could simply be a one-hot encoding of the English
Most deep learning models can be seen as networks. For language). The parameters we learn during training are just
example, we can have a convolutional neural network (CNN) those of gφ , which we refer to as the parameter generation
that takes images as input, transforms them using convolu- network. This allows us to share information across instances
tional filters (i.e., parameters), and produces distributions of fθ used in different contexts. While we previously had to
over labels (e.g., cat or dog). Research in deep learning has learn and use different parameters for each context in which
resulted in multiple network architectures that can success- fθ is used, they are now all generated as a function of the
fully learn to solve various problems, and that each makes context. For example, instead of using different encoders for
different assumptions about its input space. For example, text written in English and text written in German, we can
CNNs assume that there is some periodical structure in the now use one encoder and simply generate its parameters as
input space, whereas recurrent neural networks (RNNs) as- a function of a language representation. Note that we can
sume that each part of an input sequence can be processed simply define gφ as a lookup table over different contexts,
using the same network parameters. and this would reduce to the previous setting in which there
is no information sharing. However, the CPG formulation al-
Never-ending learning requires a system to be able to perform lows us to impose arbitrary information sharing structures by
multiple tasks; perhaps even previously unseen tasks that can manipulating the functional form of the parameter generation
be formulated in terms of other previously learned tasks. This network, gφ . For example, we could learn embeddings for
means that traditional multi-task neural network architectures all language families and have all Romance language embed-
that use a different output layer for each task (e.g., Caruana, dings be defined as linear transforms of the corresponding
1997) cannot be used in this context. That is because the Romance family embedding. When performing multi-task
set of tasks the system is learning to perform is not known a learning, we can think of each task as a context in which a
priori, when the neural network architecture is chosen. This network processes its inputs. Given a representation of this
motivates us to treat tasks as separate inputs.
5
Neural Cognitive Architectures for Never-Ending Learning
context, we can generate the parameters of a single universal x1 . For example, assuming x is a vector, then x0 and
network that is used for all tasks. The way contexts are de- x1 are vectors such that when concatenated, they form x.
fined and processed to generate parameters can thus allow for Most neural network architectures currently in use only
controlled information sharing across multiple tasks. We re- allow for interactions of the following form:
fer to networks that employ CPG as contextualized networks, y = fθ (h0φ0 (x0 ) + h1φ1 (x1 )), (4)
and we let them optionally have some of their parameters be
generated by a CPG component, and some be directly learned where f , h0 , and h1 are arbitrary functions, and
(e.g., we may not want to generate the parameters of a batch y is the output of the neural network. This form
normalization layer using CPG). Note that contextualized is very restrictive. For example, it cannot be used
networks also have better generalization properties than plain to represent simple if-then-else rules such as
networks because they can be used with previously unseen “if x0 = 2,then 2x else 5x1 ”. This is especially im-
contexts, as long as the new contexts can be composed out of portant for multi-task learning because we can think of
previously seen contexts (refer to Section 3 for more details). x0 as the description of some task and we might want
to condition on that task while processing the rest of the
Previous Uses of CPG. We first introduced CPG in a multi- model inputs. This would be the case for a mixture of
task setting, as a means to tackle the multilingual machine experts model, for example. In order to represent this kind
translation (MT) problem (Platanios et al., 2018). Multi- of interactions we have to explicitly encode them in the
lingual MT is challenging due to the low-resource nature neural network architecture. Ideally, we want the model
of many languages (i.e., it is hard and often impossible to to be able to learn these interactions on its own, if they
collect the massive amounts of parallel sentences required are necessary, instead of having them be hardcoded as
for training a neural MT system). It is therefore crucial to part of the model architecture. CPG in fact allows for
share information across languages, rather than train multiple multiplicative or even polynomial interactions between
pairwise translation models in isolation. In neural MT, when h0φ0 (x0 ) and h1φ1 (x1 ), which would allows us to represent
translating a sentence from English to German, we first en- if-then-else rules.
code the source sentence to some intermediate representation, 3. Deployment: CPG allows us to generate a context-specific
and then we decode it into some target language. Applying model and later use it without involving the parameter
CPG to this problem, we used the source language (i.e., En- generation network. This can be beneficial for deployment.
glish) as the context of the encoder and the target language For example, if we only care about English-to-German
(i.e., German) as the context of the decoder. We learned lan- translation due to an upcoming vacation trip, we could
guage embeddings and used linear transforms to obtain the en- have Google Translate generate a translation model for
coder/decoder parameters from these embeddings. We were this language pair and then store that model on a mobile
able to show significant performance gains over the same device for offline use.
networks without CPG; especially so for the low-resource set- 4. Modeling Assumptions: Different neural network archi-
ting. Furthermore, we showed that CPG allows us to perform tectures make different assumptions. For example, CNNs
zero-shot translation (meaning to translate between pairs of assume spatial invariance for the inputs (meaning that they
languages that were not observed in the training data), thus contain repeating patterns in different locations). This as-
indicating that it can be used to generate network parameters sumption is unlikely to hold for an arbitrary context vector
for new, previously unseen tasks. We also applied CPG in the and so it is unreasonable to just concatenate an arbitrary
problem of question answering over graphs (Platanios* et al., context vector with an image and feed the result to a CNN.
2019), also known as link prediction. In this case, we are CPG avoids this problem because the architecture modifi-
given a source entity (e.g., Pittsburgh) and a relation (e.g., cation it entails does not affect the assumptions made by a
CityInCountry), and are asked to predict the target entity network about the input data.
(e.g., USA). Here, different questions correspond to different
problems and can be used as the context in which to generate 3.2.1 R ELATED W ORK
the parameters of a universal question-answering model. We
were able to show that CPG outperforms all existing methods Ha et al. (2018) are probably the first to introduce a similar
and thus establishes a new state-of-the-art for this problem. idea to that of having one network (called a hypernetwork)
More recently, we have also started applying CPG to develop generate the parameters of another. However, in that work,
better image and video compression methods, and also as a the input to the hypernetwork are structural features of the
means to handle task composition. original network (e.g., layer size and index). Al-Shedivat et al.
(2017) also propose a related method where a neural network
Feeding Context as an Additional Input. Why not just
generates the parameters of a linear model. Their focus is
feed the context as another network input?
mostly on interpretability (i.e., knowing which features the
1. Structured Information Sharing: Similar to the motivation network considers important). Dumoulin et al. (2018) provide
for PGMs described earlier in this section, CPG provides a comprehensive review of some more related work from
a structured way to share information across contexts. other fields, such as computer vision, that was published
2. Additive Interactions: Without loss of generality, let us concurrently to our machine translation work. Furthermore,
split the input x to a neural network in two parts, x0 and CPG is generic enough so that many existing methods can
6
Neural Cognitive Architectures for Never-Ending Learning
7
Neural Cognitive Architectures for Never-Ending Learning
brain, and (ii) cross-modal interactions between the modality workflow is that for each problem researchers build large
specific spokes are mediated by a single trans-modal hub that deep neural models that pool together information from dif-
is located bilaterally in the anterior temporal lobes (ATLs) of ferent sources and that are trained independently of each
the human brain. A visualization is shown in Figure 1. This other. An alternative approach is to pre-train large models in
model of the human brain serves as one of the main inspira- a problem-independent manner and then fine-tune them for
tions for the high-level design of the proposed architecture. each problem (e.g., Peters et al., 2018; Devlin et al., 2018;
Finn et al., 2017). However, most of these approaches do
3.4.3 P ROPOSED A RCHITECTURE not allow for information learned from solving one problem
to directly help solve another—something at which human
We propose a novel neural cognitive architecture (NCA) for intelligence excels. For example, BERT is pre-trained as a lan-
general learning and intelligence. The proposed architecture guage model and is then fine-tuned separately for problems
is inspired from the Hub-and-Spoke model for human cog- such as question answering and textual entailment. There-
nition (Rogers et al., 2004; Ralph et al., 2017), as well as fore, learning to answer questions well does not affect how
human goal priming (Custers and Aarts, 2005; Aarts et al., well BERT reasons about textual entailment. This motivates
2008; Papies, 2016; Takarada and Nozaki, 2018). It consists us to find ways to couple the learning of multiple problems
of the following parts (an overview is shown in Figure 2): in a way that results in constructive interference between
Perception and Action Spokes: Sensing input data consists the different problems, meaning that learning to solve one
of converting them to a common reasoning space, that is in- well, helps the system learn to solve others faster. It further
dependent of the data modality. Much of the complexity of motivates us to treat perception (i.e., learning informative rep-
models like BERT3 (Devlin et al., 2018), lies in perception, resentations of the input data) and reasoning (i.e., learning to
rather than reasoning. In fact, for BERT, reasoning often solve each task in the latent space of learned representations)
consists of a single linear layer, while perception consists separately, as most deep neural networks that are trained end-
of a Transformer (Vaswani et al., 2017). Similarly, taking to-end to solve multiple tasks effectively do that, and most of
an action consists of converting a common reasoning rep- their complexity is often related to perception (e.g., in BERT
resentation to some output data. This can include taking problem-specific reasoning is performed by a linear layer).
actions in some environment, or generating data of some Kernel Methods. Before deep learning was popular, some
structure (e.g., probabilistic distribution over labels). of the most successful machine learning methods were mak-
Reasoning Hub: Reasoning is performed in a latent space ing use of kernels (Hofmann et al., 2008), by formulating
that is independent of the data modalities and the problem learning problems in a reproducing kernel Hilbert space
being solved. We argue that this is necessary for general (RKHS) of functions defined on the data domain, expanded
learning and intelligence, as it allows for flexible sharing in terms of a kernel. These kernels effectively project the data
of information across different modalities and problems. to a space where reasoning is modeled as a linear problem.
Moreover, memory and simulations of the external world Such a projection can be thought of as a perception module,
are all defined over the same latent space, abstracting away in terms of our formulation. Given the success of kernel
details about the perceived data that are not relevant to methods, this further motivates separating the treatment of
reasoning. Reasoning is described in detail in Section 3.4.5. perception and reasoning.
Goal Contextualization: The problems that the system is
learning to solve are processed such that they can contex- Neuroscience. Neuroscientists have also observed that infor-
tualize any part of the neural cognitive architecture. This mation processing in the human brain goes from low-level
allows for the behavior of the system to vary across dif- (i.e., sensory input processing) to high-level (i.e., reasoning).
ferent problems, while still sharing information between There is ample evidence to support this both for both audi-
them, similar to how it was done for machine translation, tory (Kaas and Hackett, 1998; Rauschecker, 1998; Romanski
as described in Section 3.2. It further allows the system et al., 1999; Wessinger et al., 2001; Warren and Griffiths,
to generate its own target problems that it learns to solve. 2003; Zatorre and Belin, 2001; Zatorre et al., 2004) and vi-
This is perhaps the most novel aspect of the proposed archi- sual information (Mishkin et al., 1983; Felleman and Van,
tecture and, as shown in the following paragraphs, derives 1991). Furthermore, there has been evidence that the devel-
its inspiration from human goal priming in psychology, and opment of primary visual cortical networks is more rapid
is described in more detail in Section 3.4.6. than the development of primary motor networks in humans
(Gervan et al., 2011). This motivates the idea that perception
This is inspired from work in multiple areas: is a low-level functionality that is not necessarily problem-
specific and that can be learned before learning to reason and
Deep Learning. Deep neural networks are very effective at
take actions. In addition to this, there is evidence that the
learning abstract representations for arbitrary data modali-
brain relies on a set of canonical neural computations that are
ties, that can then be used to perform multiple diverse tasks
reused for different problems (Carandini and Heeger, 2012).
(e.g., Simonyan and Zisserman, 2015; He et al., 2016; Peters
For example, normalization of neural responses is one such
et al., 2018; Devlin et al., 2018). The typical deep learning
operation that is thought to underlie multiple other operations
3
BERT is the current state-of-the-art model for a multitude of such as the representation of odours, the modulatory effects
natural language processing tasks. of visual attention, the encoding of value, and the integration
8
Neural Cognitive Architectures for Never-Ending Learning
Building Blocks
Network Contextualized Network Context Compiler
Compiles a specifica�on to a
Parameters Context Parameter Generator
func�on (poten�ally a
Parameters
composi�on) that outputs a
Input Layer Output Input Layer Output Specifica�on single context vector. Output
Represents any machine learning network Represents a network that has been modified a priori Represents a compiler that takes a context
(e.g., recurrent neural network). to use contextual parameter genera�on (CPG). specifica�ons in a predefined language and
produces contexts that can be used by CPG.
Supervision
Figure 2: Overview of the proposed Neural Cognitive Architecture (NCA), its main building blocks, and a simple example showing an
instance of this architecture for a classification problem. In the example, the noun “Washington” alone without the provided image, is
ambiguous and would probably not result to a high probability of referring to a city and not a person, at the same time.
of multi-sensory information. This also supports the idea of ing is a technique where exposure to one stimulus influences
abstracting over reasoning, by making the operations used to the brain’s response to a subsequent stimulus. For example,
perform various tasks common across all tasks and finding the word “dog” is recognized more quickly after having seen
other ways to specialize them. the word “animal”. Priming can be perceptual, semantic,
and conceptual. Perhaps most importantly for this thesis is
Psychology. There has been significant evidence that prim-
goal priming (Custers and Aarts, 2005; Aarts et al., 2008;
ing is characteristic of human behavior (Tulving et al., 1982;
Papies, 2016; Takarada and Nozaki, 2018). Goal primes
Bargh and Chartrand, 2014; Weingarten et al., 2016). Prim-
are cues that trigger goal-directed cognition and behaviour.
9
Neural Cognitive Architectures for Never-Ending Learning
Here, a goal refers to a state or behaviour that has reward 3.4.4 P ERCEPTION AND ACTION S POKES
value and therefore motivates a person to pursue it. For ex-
ample, priming the concept of drinking can increase soda We define perception and action spokes using two kinds of
consumption(Veltkamp et al., 2008), or priming the goal of data modalities: (i) perception modalities that represent data
impression formation leads to better memory organization types that a model can receive as input, and (ii) action modal-
and recall compared to a mere memorization goal (Chartrand ities that represent data types that a model can produce as
and Bargh, 1996). Goal contextualization in our architec- output. Each kind of modality has a different specification:
ture is the computational equivalent of goal priming, in that Perception Modalities: Input space modalities are de-
having specific goals changes the way in which the different fined as tuples (DataType, SensorNetwork), where
architecture parts function. DataType is the type of data supported by this modal-
Benefits of Modularity. An important outcome of the Hub- ity (e.g., String), SensorNetwork is a contextualized
and-Spoke architecture design is reducing the per-problem network that takes inputs of type DataType and pro-
sample complexity. This means reducing the amount train- duces vectors of size Ls , and Ls is the reasoning input
ing data required to learn to solve each problem. This is representation size. Given some data of type DataType
because, for multiple existing machine learning models, most (e.g., a string of characters with type String), and op-
of the model complexity lies in perception (e.g., BERT). This tionally, a context (described in the next section), the
becomes more prevalent in reinforcement learning systems SensorNetwork produces a vector of size Ls , that the
playing video games where they receive as input the raw reasoning module can understand.
pixel values of video frames as they are being rendered while Action Modalities: Output space modalities are defined
playing, and they are tasked with learning to extract infor- as tuples (DataType, EffectorNetwork), where
mation from these raw values (e.g., Bellemare et al., 2013; DataType is the type of data supported by this
Bhonker et al., 2016; Vinyals et al., 2019). Such systems modality (e.g., scalar number in the interval [0, 1]),
require massive amounts of training data to learn, and we EffectorNetwork is a contextualized network that
argue that this is mostly due to their perception components. takes as input vectors of size Le and produces outputs
If these components were shared across multiple problems of type DataType (e.g., a linear transformation followed
then their effective per-task sample complexity would be by a sigmoid activation function), and Le is the reasoning
reduced significantly. In fact, Parisotto et al. (2015) show output representation size.
that pre-training agents on some arcade games, oftentimes Note that a modality can act as both a perception and an action
helps them learn faster when deployed to play other, new, modality, as long as both a sensor and an effector network
arcade games. Thus, assuming we can share the perception are provided. In this case, we also allow the sensor and the
component across different problems, we only need problem- effector networks to optionally share some or all of their
specific training data for the reasoning component. Moreover, parameters. Examples of various modalities are shown in
due to the shared reasoning hub, the per-problem sample com- Table 1. Modalities are defined such that, for any given input
plexity can be further reduced, because the same reasoning (or output) data type, there is a single matching perception
component is used for solving all problems. An interesting (or action) modality that will be used.
setting is one where the perception component can be trained
using supervised tasks with differentiable loss functions, and, Due to their generic definition, modalities can be composed.
at the same time, be shared with reinforcement learning (RL) For example, given perception modalities P1 and P2 , we can
tasks where the reward function is unknown and certainly construct a pair modality Pair[P1 , P2 ], whose data type
not differentiable. We believe that this would significantly is a pair of P1 .DataType and P2 .DataType, and whose
reduce the sample complexity of the RL tasks. In Section 4, sensor network is a function of the two modalities’ sensor
we propose a case study for testing this hypothesis. networks. For example:
x 7→ Pool(P1 .SensorNetwork(x[0]),
The proposed architecture components reflect assumptions
about human cognition that are based on facts derived from P2 .SensorNetwork(x[1])).
psychology experiments, thus rendering the proposed archi- Compositionality gives the proposed NCA high expressive
tecture, a cognitive architecture. In the following sections we power with respect to the kinds of data it can handle. Compo-
describe the different architecture components in more detail. sitionality, more generally (e.g., also at the problem space), is
Finally, in Sections 3.4.7 and 3.4.8, we describe how learning a core aspect of the proposed architecture and it is discussed
is performed. Note that, not all architectural components in more detail in Section 3.4.6.
that we describe in the following sections are necessary for
all problems. Therefore, for some problems, some of the Communication and Language. An interesting direction
components may be ignored (e.g., a world simulator may not that we wish to explore in the long term is to add support
be relevant for a text classification task). for a modality that corresponds to communication with other
agents (i.e., an artificial learned language). This modality
would act as both a perception and an action modality and
we could define its data type as a fixed-size vector containing
numeric values, for example. We can test for the ability of
10
Neural Cognitive Architectures for Never-Ending Learning
Modality Examples
Data Type Sensor Network Effector Network Description
String BERT Encoder RNN Decoder Text
Image CNN Deep Convolutional GAN Image
Scalar[0,1] – MLP→Sigmoid Binary Distribution
Vector[0,1] – MLP→Softmax Categorical Distribution
Table 1: Example modalities. RNN stands for Recurrent Neural Network, CNN for Convolutional Neural Network, GAN for Generative
Adversarial Network, MLP for Multi-Layer Perceptron, Scalar[0,1] for a single number in the interval [0, 1], and Vector[0,1] for a
vector containing numbers in the interval [0, 1].
agents to learn a language and communicate effectively by ing unit input at time t which comes from the perception
conducting experiments in a multi-agent setting where solv- component (note that if the system operates in a real-time
ing certain problems requires coordination and collaboration. environment, this may be different across different reasoning
This is related to the work of Sukhbaatar et al. (2016) and steps), and STOPt is a boolean flag representing the decision
Andreas et al. (2017). of the reasoning unit about whether or not to stop reason-
ing at time t. Finally, aT is fed to the action component,
3.4.5 R EASONING H UB where T is such that STOPT = True. Enhancing the rea-
soning unit with a state significantly increases its modeling
The reasoning component of the proposed architecture con- capacity; it can now even perform a search with backtracking
sists of a few parts. At the core lies the reasoning unit. This support (e.g., dynamic programming). This initial approach
unit transforms the perception component output to an input is inspired by the work of Graves (2016).
for the action component, and is represented as a contextual-
ized network. It is generally accepted that not all problems Memory. In designing general learning architectures, we
require the same amount of reasoning (Kahneman and Egan, need allow for an explicit way for learning systems to re-
2011). For example, solving an algebra problem requires member experiences. This can happen implicitly, through
more thinking than recalling your own name. Therefore, the learned model parameters (assuming high capacity net-
we argue that the ability to reason for arbitrary amounts of works), but it can also be modeled explicitly by equipping
time, depending on the problem being solved, is an important the agent with a memory component. Cognitive architectures
aspect of general learning and intelligence. Most existing ma- often use some form of memory that is symbolic, such as a
chine learning approaches do not allow for a variable amount knowledge-base (KB) that contains learned facts. We propose
of reasoning, as the amount of computation is predefined and to add a memory component to our architecture, where all
fixed, as part of the network architecture. The few attempts memories are represented in the latent reasoning space, rather
that do allow for this have been limited to very specific prob- than being grounded in the perception or action modality
lems and have only shown small gains over preexisting fixed data types. This allows the memory to abstract away details
computation time approaches (Graves, 2016; Dehghani et al., about the data that are not relevant to the reasoning process.
2019). In order to enable this capability in the proposed neu- The way memory is added to our architecture is through the
ral cognitive architecture, we decided to make the reasoning reasoning unit, which is enhanced such that it can read and
unit recursive, meaning that its output can optionally be fed write to memory, while performing its transformation. More
back as input again, to recurse over the reasoning transforma- formally, we define the memory component as consisting of
tion. Each application of the reasoning transformation can two functions, MREAD : K 7→ V and MWRITE : (K, V ) 7→ (),
be thought of as a reasoning step. The reasoning unit also where K and V correspond to the memory key and value
outputs a decision on whether or not to stop, so that it can types, respectively, and the “7→” notation is used to denote
stop reasoning and produce an output at some point. The the function input and output types5 . Possible design choices
recursive nature of this unit introduces several challenges for the memory include memory networks (Sukhbaatar et al.,
with respect to how it should be trained. Our initial plan is to 2015), or even KBs defined over the latent reasoning space.
incur a pondering cost, which is proportional to the number We propose to start with a simple, yet novel6 , attention-based
of reasoning steps used, and add that cost to the loss function memory mechanism. In this case, the memory is defined
used to train the reasoning unit. as a pair of matrices, Mk ∈ RM ×Dk , and Mv ∈ RM ×Dv ,
Recursion. More formally, at each time step t, the reasoning where M is the memory size, Dk is the dimensionality of
unit performs the following transformation4 : the keys, and Dv is the dimensionality of the values stored in
the memory. Mk contains the memory keys and Mv contains
[at+1 , st+1 , STOPt+1 ] = R(pt , at , st ), (5)
the corresponding memory values. Let us refer to the K-
where R represents the reasoning unit transformation, at rep- valued input of MREAD and MWRITE as the query. Queries are
resents the reasoning unit output at time t, st represents the
5
internal state of the unit at time t, pt represents the reason- We use “()” to represent the “void” type, meaning that the
function returns no values, and is only used for its side effects.
4 6
This is not equivalent to simply using a recurrent neural network Novel because we are not aware of prior work that learns a
(RNN), because the number of recursion steps is not predetermined. memory indexing mechanism.
11
Neural Cognitive Architectures for Never-Ending Learning
defined as vectors of size Dk . When a component wants to World Simulator. An important aspect of human reasoning
access a value stored in memory, it needs to provide a query is simulating the external world. Jay Wright Forrester, the fa-
“describing” that value7 . We also define an indexing function, ther of system dynamics, described a mental model as: “The
I : K 7→ ∆M , where ∆M denotes the M -simplex, which image of the world around us, which we carry in our head,
contains all vectors of size M whose elements are in [0, 1] is just a model. Nobody in his head imagines all the world,
and sum to 1. Intuitively, the indexing function maps from a government or country. He has only selected concepts, and
query to a distribution over memory locations The indexing relationships between them, and uses those to represent the
function that we plan to use initially is the scaled dot-product real system.” (Forrester, 1971). There is significant evidence
attention by Vaswani et al. (2017): of the importance of simulation in neuroscience (Singer et al.,
qM T
2018). For example, Nijhawan (1994) shows that to strike a
I(q) = Softmax √ k , (6) cricket ball one must estimate its future location, rather than
Dk where it is now. Bialek et al. (2001) show that prediction has
which effectively measures the similarity between the query the fundamental theoretical advantage that a system which
and all the memory keys. Then, the memory read function is parsimoniously predicts future inputs from their past, and that
defined as (in pseudocode): generalizes well to new inputs, is likely to contain representa-
MREAD (q) : return I(q)Mv , (7) tions that reflect their underlying causes. Furthermore, they
show that much of sensory processing involves discarding
which returns a convex combination of all stored values, irrelevant information, such as that which is not predictive of
based on the computed index. The memory write function is the future, to arrive at a representation of what is important
similarly defined as: in the environment for guiding action. Another related line of
MWRITE (q, v) : Mv := λI(q)v + (1 − λI(q))Mv , (8) work is in the importance of auditory feedback (i.e., when we
where := is used to denote assignment, and λ is an M -sized hear ourselves speaking). The study of neural mechanisms
vector with values in [0, 1] that denotes the strength of the underlying audio-vocal integration has shown that auditory
write operation. If λ is closer to 1, then old values are forgot- feedback may be used for updating internal representations
ten faster. λ can be set adaptively, based on how often each of mappings between voice feedback and speech motor con-
value is being read. For example, it can be set closer to 1 for trol. One of the earliest demonstrations of the role of auditory
values that are rarely read. The learnable parameters of this feedback in voice control is the Lombard effect, where people
learning mechanism consist of the parameters of I, and the raise their voice amplitude to overcome environmental noise
memory keys, Mk . We can initialize Mv with zeros. (Lombard, 1911; Lane and Tranel, 1971). A related phe-
nomenon is side-tone amplification, in which people increase
Allowing the memory indexing mechanism to be learnable, their voice loudness when their self-perceived loudness is
by using separate keys and values8 , enables associative learn- too quiet to achieve a communication goal, and vice versa
ing and memories, which have been shown to be important (Lane and Tranel, 1971). Given this strong evidence from
aspects of human cognition (Fanselow and Poulos, 2005; neuroscience, we argue that in an interactive setting, where
Ranganath and Ritchey, 2012). In psychology, associative the learning agent keeps interacting with an outside world—
memory is defined as the ability to learn and remember the which may also include other agents—being able to simulate
relationship between unrelated items (e.g., remembering the that world can be very important. For example, this abil-
name of someone or the aroma of a particular perfume). This ity could enable a search over the potential implications its
is enabled by our indexing mechanism because it allows for decisions will have on that outside world.
two unrelated values to have similar keys. This is mainly
because we learn keys separately from the values they corre- We thus propose to add a world simulator component to
spond to. Note that our proposed memory mechanism also our neural cognitive architecture. Formally, the simulator S
allows for a natural way of forgetting, where the keys of un- performs the following prediction:
used values change while learning to the point where they p̂t+1 = S(pt , at+1 ), (9)
may be used for storing other unrelated values instead. where p̂t+1 is a prediction estimate of pt+1 . Furthermore, we
We also allow the sensor and effector networks to option- allow the world simulator to read from memory (as defined
ally read from this memory. This can be important in cases in the previous paragraph), but not write to it. Intuitively,
where perception depends on past experiences. Tulving et al. the world simulator is trying to predict the next perception
(1982) provides some evidence supporting that this has been input, given the current perception input and action output,
observed to be true of human perception (this is known as while operating only in the latent reasoning space. Similar to
priming in psychology literature). the memory component, this allows the simulator to abstract
away information that is not relevant to the problems the
7
Note that, the querying mechanisms are also learned, similar to system is learning to solve. This type of world simulation in
the indexing mechanism. a latent reasoning space is also supported by neuroscientific
8
As opposed to indexing by comparing queries to values as done
evidence (e.g., Keller et al., 2012).
in memory networks.
Recently, Ha and Schmidhuber (2018) proposed using an
RNN-based world simulator for playing games in an RL set-
12
Neural Cognitive Architectures for Never-Ending Learning
ting. They use a variational auto-encoder (VAE) to compress given a fixed number of pre-specified problems the system
the input images to a smaller vector representation and then may learn vector embeddings to represent them. The main
learn a model that simulates the environment in this vector disadvantage of this approach is that the vector representa-
space. This differs from our proposal in that, we are simulat- tions of learning problems may not be interpretable.
ing the world in the latent reasoning space that our system Natural Language: This could be a problem description
learns. This should help us obtain a representation, that has that is provided as input to the system (e.g., “Identify human
higher information content that is relevant for the reasoner. faces in the input image.”). This is the approach taken, for
example, by McCann et al. (2018).
3.4.6 G OAL C ONTEXTUALIZATION Structured Language: This could be first-order logic
(e.g., “Collect[JellyBean]∧¬Collect[Onion]”), or
Even though deep learning methods are very effective at more general (e.g., “If[JellyBean]Then[Collect]
learning representations for arbitrary data modalities, they Else[Avoid]”, or even a Python program).
are often treated as black-box methods offering little control
over how information is shared across different tasks, and Problem Compilation. Given a problem specification, we
over what exactly the networks are learning. For example, need to define a compiler that takes it as input and produces
we can rarely guarantee that a network will generalize well a composition of learnable functions that, when evaluated,
to new tasks, and we often also have to keep training the results in a single structured representation for the problem
network with new problem-specific data, in order for it to (e.g., a set of vectors). This representation can then be used to
generalize better. Furthermore, deep learning approaches contextualize different parts of the proposed architecture (e.g.,
often render generalizing to new tasks, for which we might sensor or effector networks, or parts of the reasoner that are
have no data at all, impossible. However, most real-world discussed in the next section). Given that the representation
problems can be defined in terms of simpler problems (e.g., can potentially be a set of vectors, we could use different
translating sentences relies on first being able to translate sin- parts of that structure to contextualize different parts of the
gle words). Therefore, we argue that the ability to represent architecture. For example, text sensor networks could be
problems in a way such that they can be transformed and contextualized using an embedding of the language in which
composed out of other problems, is an important aspect of the text is written. Note that contextualizing networks is
general learning and intelligence. As discussed in Section 3.2, optional, as it is sometimes not necessary (e.g., the effector
this motivated our recent work in contextual parameter gener- network used in the bottom of Figure 2 is not contextualized).
ation (CPG) for machine translation (Platanios et al., 2018) The choice of the problem compiler is important. For fixed-
and question answering (Platanios* et al., 2019), and forms size vectors and natural language specifications the compiler
the basis of contextualization. In the proposed neural cogni- could be as simple as just a neural network (e.g., a multi-layer
tive architecture, contextualization plays the important role perceptron, a recurrent neural network, or a Transformer net-
of emulating the goal priming mechanism that is inherent in work). However, for other structured languages the compiler
human intelligence and learning. We now describe how this would be something more similar to programming languages
is achieved, in three parts: (i) we first describe how problems compilers. Some examples of representations and their corre-
(or goals) are specified through some language, (ii) we then sponding compiled forms are shown in Table 2. Following
define an architectural component that compiles the problem from the previous section examples, given a problem speci-
specification to a representation that can be used to contex- fication that is written as a Python program, we could also
tualize other parts of the NCA by using CPG, and (iii) we compile it into a composition of learnable functions.
describe how this allows for the learning system to generate
its own target problems (or goals) that it aims to learn. This definition of problem specifications and problem com-
pilers allows us to make the contextualization mechanism
As shown in Figure 2, we also allow the sensor and effector very flexible and extensible by introducing operators that
networks to be contextualized because perception and action compose compiled forms in arbitrary ways. For example, we
are often not independent of the problem being solved. This could have two problem specifications, each with their own
is motivated by the fact that priming in humans can be percep- compiler, and a separate operator that allows us to merge the
tual, semantic, and conceptual (Bargh and Chartrand, 2014). two compiled forms, resulting in a single final context vector.
From a machine learning perspective, we have also shown the
usefulness of contextualizing equivalents of perception and Problem Generation. An important aspect of human learn-
action modules, when we proposed using CPG for universal ing is that, even though nature provides us some reward
neural machine translation (Platanios et al., 2018). signals for our actions (e.g., eating resolves hunger), we often
“invent” new problems that we learn to solve. We could argue
Problem Specification. We first need to define a represen- that this is a way of structuring much larger overarching prob-
tation for problems. We propose to use a fixed language for lems into multiple subproblems. This human behavior aspect
this representation, which could take multiple forms: is very interesting and, at the same time, not really tackled at
Fixed-Size Vector: Problems could be represented as all by current machine learning systems. Therefore, we pro-
continuous-valued, fixed-size, vectors (e.g., Snell et al., pose to let our learning system “invent” problems on its own.
2017; Wang et al., 2017b; Grover et al., 2018). For example, For this section, we will use a reinforcement learning setting
where a learning agent can perceive certain things about the
13
Neural Cognitive Architectures for Never-Ending Learning
Table 2: Example uses of the problem compiler. We use c with different subscripts to denote context vectors representing primitives in
the problem specification language, and g with different subscripts to denote transformation functions for context vectors (which could be
defined as learnable neural networks, for example).
environment in which it “lives” and take actions. Oftentimes, to their parameters10 . Under this assumption, we define our
the agent receives a reward, but it may not know why. Thus, learning mechanism as follows:
in such a setting, it would make sense for the agent to try
1. Each action modality can optionally provide a feedback
and “invent” problems to solve, that would result in higher
mechanism. Let us denote the output of the modality’s
collected rewards. We propose to introduce one additional
effector network as a function, fθ (x), where x represents
action modality that allows the agent to generate problem
all inputs that it depends on. In this case, f represents the
specifications, that are directly fed in the problem compiler9 ,
composition of all architecture modules that participated
and can contextualize multiple parts of the architecture.
in producing this output (i.e., this includes the reasoning
For the fixed-size vector specification format, this could be module, the goal contextualization module, and the rele-
implemented by having the effector network output a vector vant perception modalities). Then, we define the feedback
representing the problem. Perhaps more interestingly though, mechanism as a function, h, of fθ (x) and the external envi-
we could define a structured language that only depends on ronment. For example, if fθ (x) is producing a distribution
the agent’s perception and action modalities. This would over classes (for a multi-class classification problem), h
allow the agent to generate arbitrary problem specifications could be defined as:
that only depend on what it is able to perceive and how h(fθ (x), y) = fθ (x) − y, (10)
it can act. For example, given a perception modality that
identifies the types of items in the environment, and an action where y represents a one-hot representation of the true
modality that can collect items, we could define the problem class assignment provided by the environment. The main
specification language to be: constraint on h is that it should produce an output that can
be multiplied with ∇θ fθ (x).
(¬)Collect[<Item>](∧(¬)Collect[<Item>])*, 2. Whenever an action modality produces an output and a
where ¬ denotes the logical NOT operation, ∧ the logical corresponding feedback signal is returned from the envi-
AND operation, parenthesis denote optional parts, <Item> ronment, a gradient-based parameter update is performed
denotes any item type that can be sensed by the item identi- along the following direction:
fication perception modality, and * denotes that the term in Dθ , h(fθ (x), E)∇θ fθ (x), (11)
parenthesis preceding it can be repeated zero or more times. ↓ ↓
Note that Collect[·] acts as a logic predicate that can be External Internal
applied on any item type. An example specification in this
language is Collect[JellyBean]∧¬Collect[Onion]. where E represents the external environment. Note that
the first part, shown in blue, is provided from the external
We propose to formalize this problem generation mechanism environment, whereas the second part, shown in red, can
and allow learning systems to decide on the problems they be computed internally from the learning system itself.
are learning to solve. This separation is interesting from a human cognition per-
spective because, intuitively: (i) a human would know
3.4.7 L EARNING M ECHANISMS how to tweak their brain to move their hand further for-
The architecture components presented so far depend on ward (internal update), while (ii) the external environment
parameters that need to be learned (e.g., the weights of neural could tell them that to achieve a particular goal they would
network layers used). Learning consists of setting the values need to move their hand forward (external update). The
of these parameters so that the system as a whole can solve model update could be a stochastic gradient descent step:
the target problems. We assume that all components are θt+1 = θt + λt Dθt , (12)
formulated as functions that are differentiable with respect
10
Note that this is a very general assumption that holds for most
9
In this case, we assume that no problem specification is pro- deep learning models, and a lot of machine learning models, more
vided to the agent as input. generally.
14
Neural Cognitive Architectures for Never-Ending Learning
where λt represents the learning rate, or it could be a more explore other interesting directions such as staged learning.
elaborate update such as when using Adam (Kingma and
Staged Learning. The aforementioned reinforcement learn-
Ba, 2014) or AMSGrad (Reddi et al., 2018).
ing example on experience replay demonstrates the idea of
Equation 11 is interesting because it can be used to unified staged learning. In staged learning, we freeze the learning of
multiple different learning paradigms, such as supervised, the perception and action modalities early on during training
semi-supervised, unsupervised, and reinforcement learning, (e.g., by significantly lowering the corresponding learning
under one formulation. For example: rate), and then focus more on training the reasoning module.
As discussed in the beginning of Section 3, this would be
Supervised Learning: In this case, the gradient-based
more similar to how human learning works. Assuming that
updates as computed by differentiating a loss function,
the perception and action modality networks have already
L(fθ (x), E). This fits in our formulation by defining the
been trained using a diverse set of learning goals, freezing
feedback mechanism using the chain rule of differentiation:
them should allow for the reasoning module to tackle new
∂L(fθ (x), E) learning goals in a fixed latent space, determined by these
h(fθ (x), E) , . (13)
∂fθ (x) pretrained networks. We believe that this will result in signif-
For example, for L2 loss we have h(fθ (x), y) , fθ (x) − y, icantly faster training times.
and for the cross-entropy classification loss we have
Mixed-Paradigm Learning. As shown earlier, our learning
h(fθ (x), y) , y/fθ (x).
mechanism is a generalization of multiple existing learning
Semi-Supervised Learning: Can often also be formulated
paradigms thus allowing us to mix them together by simply
in terms of minimizing a differentiable loss function and
intertwining their gradient-based updates. For example, we
thus Equation 13 also applies here.
can take a gradient descent step towards minimizing a su-
Unsupervised Learning: In this case, h(fθ (x), E) does not
pervised cross-entropy classification loss, and then take a
depend on E at all and could be defined internally as well.
gradient descent step that improves the current Q-function
More specifically, h could be used to perform some sort of
estimate, in a reinforcement learning setting. This introduces
self-reflection. This is a direction we wish to explore more
multiple challenges that we will have to overcome, including,
in the future, but may be outside the scope of this thesis
but not limited to:How do we properly balance the gradient
and is described in a bit more detail in the last section.
contribution from each learning problem? How do we set the
Reinforcement Learning: In the case of Q-learning
per-learning-goal and per-parameter learning rates? How do
(Watkins and Dayan, 1992), we can have an action modality
we make the learning mechanism scale? How do we properly
that predicts the Q-function value (Mnih et al., 2013) and
batch the training data? Other learning paradigms, such as ac-
then the learning mechanism can use a supervised learning
tive learning and curriculum learning, can also be supported
feedback function, h, to learn it using the rewards provided
by designing appropriate perception and action modalities.
by the environment. In the case of policy gradient methods
(Sutton et al., 2000), h can be defined as the advantage
3.4.8 N EVER -E NDING L EARNING
function being used, or even some function of the advan-
tage for more complex methods (Mnih et al., 2016; Wang A never-ending learning system must be highly modular and
et al., 2017a; Schulman et al., 2017). More interestingly, allow for the addition and removal of modules without requir-
if we want to use experience replay, as done by (Mnih ing a complete retraining from scratch. For this reason, we
et al., 2013), we could develop a variant where: (i) the per- plan to implement the proposed architecture in a highly mod-
ception and action modality parameters are fixed and we ular manner, with each module being completely independent
are training only the problem compiler and the reasoning of the rest and having a fixed, well-defined, and generic inter-
modules, and (ii) the stored experiences that are replayed face. This will allow for adding and removing perception and
are not represented in the original data space, but rather in action modalities and for extending the problem specifica-
the more abstract and compact reasoning space. This has tion language, without requiring a complete retraining from
the significant advantage of being able to store a lot more scratch every time such a modification is made. Furthermore,
experiences, as memory is typically the bottleneck when each module will be solely responsible for persisting its state,
using experience replay. Furthermore, we would only be so that we can keep extending the architecture and avoiding
storing information that is relevant to reasoning. training restarts, as much as possible. Our goal by the end
of this project is for the proposed architecture to have been
Our learning mechanism manages all feedback mechanisms
training for the duration of this thesis, with some modules
and determines how to apply the corresponding updates and
having been trained for a year and some newer ones only
what learning rate to use for each one. Initially, we plan to
for a few days. This will allow us to provide convincing
use the same learning rate for all parameters and feedback
evidence for its never-ending learning capabilities. Moreover,
mechanisms with exponential decay over time. However,
unlike NELL (Mitchell et al., 2018), we aim for this system
our definition allows us to use potentially different learning
to fully avoid complete training restarts throughout its life-
rates for each parameter and for each learning goal (defined
time. Finally, we want to explore directions where the latent
by corresponding feedback mechanisms). Next, we plan to
reasoning representation is also extensible without requiring
integrate the ideas presented in Sections 3.1 and 3.3, to this
complete training restarts. This is a long term goal that goes
learning mechanism. In the long term, we would like to
15
Neural Cognitive Architectures for Never-Ending Learning
16
Neural Cognitive Architectures for Never-Ending Learning
problem and across simulation steps. Performance could be that was briefly discussed in Section 3.3.
measured in terms of a metric computed over a validation set, 4. Unified Architecture [09/19-05/20]: We plan to
or simply in terms of cumulative reward collected, and its rate work towards developing a unified architecture as pre-
of change. Furthermore, our JBW simulator already supports sented in Section 3.4, in the following order:
multiple agents interacting with the same grid-world and with i. Goal Contextualization: Can we use contextual pa-
each other, and thus also allows us to conduct multi-agent rameter generation to achieve the equivalent of goal
experiments (e.g., test for agent communication). priming in a machine learning system? We have al-
ready shown that we can do that in a couple specific
4.2 Case Studies
applications (Platanios et al., 2018; Platanios* et al.,
We propose to perform the following case studies: 2019). However, it will be challenging to extend
JBW #1: The agent gets a positive reward for col- that to a multi-problem setting with a problem spec-
lecting some items and a negative rewards for collect- ification language handling structured information
ing some other items. We will test performance when sharing across the different problems.
only being provided a single problem specification (e.g., ii. Module Sharing: Can we effectively share the same
Collect[JellyBean]∧¬Collect[Onion]), and when perception, reasoning, and action modules across
also trying to classify or recognize items based on their multiple problems? There has been some early work
color or scent. The latter case should help us test whether in this direction (e.g., Kaiser et al., 2017), but perfor-
the mixed learning paradigm scenario results in better learn- mance generally drops for problems for which we
ing performance for our architecture. have a lot of data available. It will be challenging to
JBW #2: Same as JBW #1, except that we let the agent overcome this issue, but succeeding would pave the
generate the problem specification, rather than having it be way for more general learning systems.
provided as input from the environment. This should help iii. Unified Learning Paradigm: Can we design a uni-
us test the problem generation and goal contextualization fied learning paradigm that encompasses supervised,
capabilities of the proposed architecture. semi-supervised, unsupervised, and reinforcement
JBW #3: Design some tasks in the JBW that require the learning, and can be successfully used to learn in
use of memory (e.g., counting items) and world simulation, mixed-paradigm settings? We proposed a first step
so that we can test the relevant parts of the reasoner. in this direction in Section 3.4.7, but it may be chal-
Atari Games: Learn to play multiple Atari games using a lenging to successfully deploy such a system in real-
single learning system. This should helps us test whether world applications. If successful, this could poten-
modularizing and sharing perception and action modali- tially lead to a merge between some seemingly dis-
ties across games can help reduce the sample complexity tinct ideas about machine learning.
of learning to play a new game, after having learned to iv. Goal Generation: Can we design and implement a
play some others. In this case, the problem specification system that can successfully generate its own learn-
language will consist simply of an Atari game identifier. ing goals and let them guide its learning process?
NLP: Tackle multiple natural language processing (NLP) How do we evaluate such a capability?
problems using a single NCA learning system. An exam- v. Self-Reflection: Can the proposed neural cognitive
ple would be to try and outperform BERT in the problems architecture achieve self-reflection capabilities? Self-
Devlin et al. (2018) tackle, or to compete in the decaNLP reflection capabilities could form a basis for unsuper-
challenge (McCann et al., 2018). It will also be interesting vised learning and could potentially be achieved by
to explore multi-modal NLP problems such as visual ques- designing appropriate self-sensors and self-effectors
tion answering and problems involving knowledge graphs. (i.e., designing appropriate perception and action
This will allows us to test for multi-modal learning aspects. modalities). In fact, we hope that we may be able
to model the learning mechanism itself as the result
5 Proposed Timeline of the interaction between certain self-sensors and
self-effectors. Given our framing of the learning
We propose to structure the proposed thesis work in four main mechanism using Equation 11, we believe we may
chapters, as discussed in Section 3: be able to model self-reflection by using the internal
1. Learning from Multiple Noisy Labels [DONE]: Pub- component of the feedback direction, and having the
lished in (Platanios et al., 2014; 2016; 2017; 2019). feedback mechanism be provided by self-sensors,
2. Contextual Parameter Generation [01/18-09/19]: rather than by the external environment.
We have already performed extensive empirical evalu- Points (iii), (iv), and (v) above, are long-term goals that
ations of the core idea behind contextual parameter gener- may not be finished during the indicated time frame. How-
ation (e.g., Platanios et al., 2018; Platanios* et al., 2019). ever, we hope to make some first steps before defending
In the next couple of months we aim to obtain a theoreti- this thesis in May 2020.
cal understanding of when and why contextual parameter
generation works and of the limitations it addresses.
3. Self-Reflection [07/19-12/19]: We plan to develop
and evaluate the differentiable intrinsic reward mechanism
17
Neural Cognitive Architectures for Never-Ending Learning
Cao, Y., Yu, W., Ren, W., and Chen, G. (2013). An overview of re- Forrester, J. W. (1971). Counterintuitive Behavior of Social Systems.
cent progress in the study of distributed multi-agent coordination. Technological Forecasting and Social Change, 3:1–22.
IEEE Transactions on Industrial informatics, 9(1):427–438.
Franceschi, L., Frasconi, P., Salzo, S., and Pontil, M. (2018). Bilevel
Carandini, M. and Heeger, D. J. (2012). Normalization as a Canoni- programming for hyperparameter optimization and meta-learning.
cal Neural Computation. Nature Reviews Neuroscience, 13(1):51. arXiv preprint arXiv:1806.04910.
18
Neural Cognitive Architectures for Never-Ending Learning
Frénay, B. and Verleysen, M. (2014). Classification in the Pres- Lane, H. and Tranel, B. (1971). The Lombard Sign and the Role
ence of Label Noise: A Survey. IEEE Transactions on Neural of Hearing in Speech. Journal of Speech and Hearing Research,
Networks and Learning Systems, 25(5):845–869. 14(4):677–709.
Gervan, P., Berencsi, A., and Kovacs, I. (2011). Vision First? The Lombard, E. (1911). Le signe de l’elevation de la voix. Ann. Mal.
Development of Primary Visual Cortical Networks is More Rapid de L’Oreille et du Larynx, pages 101–119.
than the Development of Primary Motor Networks in Humans.
PloS one, 6(9). Madani, O., Pennock, D., and Flake, G. (2004). Co-Validation:
Using Model Disagreement on Unlabeled Data to Validate Classi-
Graves, A. (2016). Adaptive Computation Time for Recurrent fication Algorithms. In Neural Information Processing Systems.
Neural Networks. arXiv preprint arXiv:1603.08983.
McCann, B., Keskar, N. S., Xiong, C., and Socher, R. (2018). The
Grover, A., Al-Shedivat, M., Gupta, J. K., Burda, Y., and Edwards, Natural Language Decathlon: Multitask Learning as Question
H. (2018). Learning Policy Representations in Multiagent Sys- Answering. arXiv preprint arXiv:1806.08730.
tems. arXiv preprint arXiv:1806.06464.
Mishkin, M., Ungerleider, L. G., and Macko, K. A. (1983). Object
Gulshan, V., Peng, L., Coram, M., Stumpe, M. C., Wu, D., Vision and Spatial Vision: Two Cortical Pathways. Trends in
Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, Neurosciences, 6:414–417.
T., Cuadros, J., et al. (2016). Development and Validation of a
Deep Learning Algorithm for Detection of Diabetic Retinopathy Mitchell, T. M., Cohen, W. W., Hruschka Jr, E. R., Pratim Talukdar,
in Retinal Fundus Photographs. Journal of the American Medical P., Betteridge, J., Carlson, A., Dalvi, B., Gardner, M., Kisiel,
Association, 316(22):2402–2410. B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T. P.,
Nakashole, N., Platanios, E. A., Ritter, A., Samadi, M., Settles,
Ha, D., Dai, A., and Le, Q. V. (2018). HyperNetworks. In Interna- B., Wang, R. C., Wijaya, D., Gupta, A., Chen, X., Saparov, A.,
tional Conference on Learning Representations. Greaves, M., and Welling, J. (2018). Never-Ending Learning.
Communications of the ACM, 61(5):103–115.
Ha, D. and Schmidhuber, J. (2018). World Models. arXiv preprint
arXiv:1803.10122. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley,
T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous Meth-
Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., and Levine, ods for Deep Reinforcement Learning. In International Confer-
S. (2018). Composable deep reinforcement learning for robotic ence on Machine Learning, pages 1928–1937.
manipulation. arXiv preprint arXiv:1803.06773.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I.,
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual Wierstra, D., and Riedmiller, M. (2013). Playing Atari with Deep
learning for image recognition. In Proceedings of the IEEE Reinforcement Learning. arXiv preprint arXiv:1312.5602.
conference on computer vision and pattern recognition, pages
770–778. Moreno, P. G., Artés-Rodríguez, A., Teh, Y. W., and Perez-Cruz,
F. (2015). Bayesian Nonparametric Crowdsourcing. Journal of
Hofmann, T., Schölkopf, B., and Smola, A. J. (2008). Kernel Machine Learning Research, 16.
Methods in Machine Learning. The Annals of Statistics, pages
1171–1220. Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A.
(2013). Learning with Noisy Labels. In Advances in Neural
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., and Saenko, K. Information Processing Systems, pages 1196–1204.
(2017). Learning to reason: End-to-end module networks for
visual question answering. CoRR, abs/1704.05526, 3. Nettleton, D. F., Orriols-Puig, A., and Fornells, A. (2010). A Study
of the Effect of Different Types of Noise on the Precision of
Kaas, J. H. and Hackett, T. A. (1998). Subdivisions of Audito- Supervised Learning Techniques. Artificial Intelligence Review,
ryCortex and Levels of Processing in Primates. Audiology and 33(4):275–306.
Neurotology, 3(2-3):73–85.
Newell, A. (1990). Unified Theories of Cognition. Harvard Univer-
Kahneman, D. and Egan, P. (2011). Thiking, Fast and Slow, vol- sity Press, Cambridge, MA, USA.
ume 1. Farrar, Straus and Giroux New York.
Nijhawan, R. (1994). Motion Extrapolation in Catching. Nature.
Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N.,
Jones, L., and Uszkoreit, J. (2017). One Model to Learn them OpenAI et al. (2018). Learning dexterous in-hand manipulation.
All. arXiv preprint arXiv:1706.05137. arXiv preprint arXiv:1808.00177.
Kearns, M. (1998). Efficient Noise-tolerant Learning from Statistical Papies, E. K. (2016). Health Goal Priming as a Situated Intervention
Queries. Journal of the ACM (JACM), 45(6):983–1006. Tool: How to Benefit from Nonconscious Motivational Routes to
Health Behaviour. Health Psychology Review, 10(4):408–424.
Keller, G. B., Bonhoeffer, T., and Hübener, M. (2012). Sensorimotor
Mismatch Signals in Primary Visual Cortex of the Behaving Parisi, F., Strino, F., Nadler, B., and Kluger, Y. (2014). Ranking and
Mouse. Neuron, 74(5):809–815. combining multiple predictors without labeled data. Proceedings
of the National Academy of Sciences.
Khetan, A., Lipton, Z. C., and Anandkumar, A. (2017).
Learning from Noisy Singly-Labeled Data. arXiv preprint Parisotto, E., Ba, J. L., and Salakhutdinov, R. (2015). Actor-Mimic:
arXiv:1712.04577. Deep Multitask and Transfer Reinforcement Learning. arXiv
preprint arXiv:1511.06342.
Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic
Optimization. arXiv preprint arXiv:1412.6980. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C.,
Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word
Laird, J. E. (2012). The Soar Cognitive Architecture. MIT Press. representations. arXiv preprint arXiv:1802.05365.
19
Neural Cognitive Architectures for Never-Ending Learning
Platanios, E. A., Al-Shedivat, M., Xing, E., and Mitchell, T. M. Schuurmans, D., Southey, F., Wilkinson, D., and Guo, Y. (2006).
(2019). Learning from Multiple Noisy Labels. In Review for Metric-Based Approaches for Semi-Supervised Regression and
Advances in Neural Information Processing Systems. Classification. In Semi-Supervised Learning.
Platanios, E. A., Blum, A., and Mitchell, T. M. (2014). Estimating Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang,
Accuracy from Unlabeled Data. In Conference on Uncertainty in A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.
Artificial Intelligence, pages 1–10. (2017). Mastering the Game of Go without Human Knowledge.
Nature, 550(7676):354.
Platanios, E. A., Dubey, A., and Mitchell, T. M. (2016). Estimating
Accuracy from Unlabeled Data: A Bayesian Approach. In Inter- Simonyan, K. and Zisserman, A. (2015). Very Deep Convolutional
national Conference in Machine Learning, pages 1416–1425. Networks for Large-Scale Image Recognition. In International
Conference on Learning Representations.
Platanios, E. A., Poon, H., Horvitz, E., and Mitchell, T. M. (2017).
Singer, Y., Teramoto, Y., Willmore, B. D., Schnupp, J. W., King,
Estimating Accuracy from Unlabeled Data: A Probabilistic Logic
A. J., and Harper, N. S. (2018). Sensory Cortex is Optimized for
Approach. In Advances in Neural Information Processing Sys-
Prediction of Future Input. eLife, 7:e31557.
tems.
Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. (2017).
Platanios, E. A., Sachan, M., Neubig, G., and Mitchell, T. (2018). Federated multi-task learning. In Advances in Neural Information
Contextual Parameter Generation for Universal Neural Machine Processing Systems, pages 4424–4434.
Translation. In Conference on Empirical Methods in Natural
Language Processing (EMNLP), Brussels, Belgium. Snell, J., Swersky, K., and Zemel, R. (2017). Prototypical Networks
for Few-Shot Learning. In Advances in Neural Information
Platanios*, E. A., Stretcu*, O., Stoica*, G., Poczos, B., and Mitchell, Processing Systems, pages 4077–4087.
T. (2019). Contextual Parameter Generation for Question An-
swering. In Annual Conference of the North American Chapter Sukhbaatar, S., Fergus, R., et al. (2016). Learning Multiagent
of the Association for Computational Linguistics (NAACL). Communication with Backpropagation. In Advances in Neural
Information Processing Systems, pages 2244–2252.
Rajpurkar, P., Jia, R., and Liang, P. (2018). Know What You Don’t
Know: Unanswerable Questions for SQuAD. In Proceedings of Sukhbaatar, S., Weston, J., Fergus, R., et al. (2015). End-to-End
the 56th Annual Meeting of the Association for Computational Memory Networks. In Advances in Neural Information Process-
Linguistics (Volume 2: Short Papers), pages 784–789. Associa- ing Systems, pages 2440–2448.
tion for Computational Linguistics.
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y.
Ralph, M. A. L., Jefferies, E., Patterson, K., and Rogers, T. T. (2017). (2000). Policy Gradient Methods for Reinforcement Learning
The Neural and Computational Bases of Semantic Cognition. with Function Approximation. In Advances in Neural Information
Nature Reviews Neuroscience, 18(1):42. Processing Systems, pages 1057–1063.
Takarada, Y. and Nozaki, D. (2018). Motivational Goal-Priming
Ranganath, C. and Ritchey, M. (2012). Two Cortical Systems With or Without Awareness Produces Faster and Stronger Force
for Memory-Guided Behaviour. Nature Reviews Neuroscience, Exertion. Scientific Reports, 8(1):10135.
13(10):713.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2005).
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, Sharing Clusters Among Related Groups: Hierarchical Dirich-
C. (2017). Snorkel: Rapid Training Data Creation with Weak let Processes. In Advances in Neural Information Processing
Supervision. Proceedings of the VLDB Endowment, 11(3):269– Systems, pages 1385–1392.
282.
Thrun, S. and Pratt, L. (1998). Learning to learn. Springer.
Rauschecker, J. P. (1998). Cortical Processing of Complex Sounds.
Current opinion in neurobiology, 8(4):516–521. Tian, T. and Zhu, J. (2015). Max-Margin Majority Voting for Learn-
ing from Crowds. In Neural Information Processing Systems.
Reddi, S. J., Kale, S., and Kumar, S. (2018). On the Convergence
of Adam and Beyond. In International Conference on Learning Tran, D., Mike, D., van der Wilk, M., and Hafner, D. (2018).
Representations. Bayesian Layers: A Module for Neural Network Uncertainty.
arXiv preprint arXiv:1812.03973.
Rogers, T. T., Ralph, L., Matthew, A., Garrard, P., Bozeat, S., Mc-
Tulving, E., Schacter, D. L., and Stark, H. A. (1982). Priming Effects
Clelland, J. L., Hodges, J. R., and Patterson, K. (2004). Structure
in Word-Fragment Completion are Independent of Recognition
and Deterioration of Semantic Memory: A Neuropsycholog-
Memory. Journal of experimental psychology: learning, memory,
ical and Computational Investigation. Psychological Review,
and cognition, 8(4):336.
111(1):205.
Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep Rein-
Romanski, L. M., Bates, J. F., and Goldman-Rakic, P. S. (1999). forcement Learning with Double Q-Learning. In Thirtieth AAAI
Auditory Belt and Parabelt Projections to the Prefrontal Cor- Conference on Artificial Intelligence.
tex in the Rhesus Monkey. Journal of Comparative Neurology,
403(2):141–157. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention
Samarakoon, S., Bennis, M., Saady, W., and Debbah, M. (2018). is All You Need. In Advances in Neural Information Processing
Distributed federated learning for ultra-reliable low-latency ve- Systems, pages 5998–6008.
hicular communications. arXiv preprint arXiv:1807.08127.
Veltkamp, M., Aarts, H., and Custers, R. (2008). On the Emer-
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. gence of Deprivation-Reducing Behaviors: Subliminal Priming
(2017). Proximal Policy Optimization Algorithms. arXiv preprint of Behavior Representations turns Deprivation into Motivation.
arXiv:1707.06347. Journal of Experimental Social Psychology, 44(3):866–873.
20
Neural Cognitive Architectures for Never-Ending Learning
Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M.,
Czarnecki, W. M., Dudzik, A., Huang, A., Georgiev, P., Powell,
R., Ewalds, T., Horgan, D., Kroiss, M., Danihelka, I., Agapiou,
J., Oh, J., Dalibard, V., Choi, D., Sifre, L., Sulsky, Y., Vezhnevets,
S., Molloy, J., Cai, T., Budden, D., Paine, T., Gulcehre, C.,
Wang, Z., Pfaff, T., Pohlen, T., Wu, Y., Yogatama, D., Cohen,
J., McKinney, K., Smith, O., Schaul, T., Lillicrap, T., Apps, C.,
Kavukcuoglu, K., Hassabis, D., and Silver, D. (2019). AlphaStar:
Mastering the Real-Time Strategy Game StarCraft II. https:
//deepmind.com/blog/alphastar-mastering-
real-time-strategy-game-starcraft-ii/.
Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu,
K., and de Freitas, N. (2017a). Sample Efficient Actor-Critic with
Experience Replay. In International Conference on Learning
Representations.
Wang, Z., Merel, J. S., Reed, S. E., de Freitas, N., Wayne, G.,
and Heess, N. (2017b). Robust Imitation of Diverse Behaviors.
In Advances in Neural Information Processing Systems, pages
5320–5329.
Warren, J. D. and Griffiths, T. D. (2003). Distinct Mechanisms for
Processing Spatial Sequences and Pitch Sequences in the Human
Auditory Brain. Journal of Neuroscience, 23(13):5799–5804.
Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine learning,
8(3-4):279–292.
Weingarten, E., Chen, Q., McAdams, M., Yi, J., Hepler, J., and
Albarracín, D. (2016). From Primed Concepts to Action: A
Meta-Analysis of the Behavioral Effects of Incidentally Presented
Words. Psychological Bulletin, 142(5):472.
Wessinger, C., VanMeter, J., Tian, B., Van Lare, J., Pekar, J., and
Rauschecker, J. P. (2001). Hierarchical Organization of the Hu-
man Auditory Cortex Revealed by Functional Magnetic Reso-
nance Imaging. Journal of cognitive neuroscience, 13(1):1–7.
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How
transferable are features in deep neural networks? In Advances
in neural information processing systems, pages 3320–3328.
Zatorre, R. J. and Belin, P. (2001). Spectral and Temporal Processing
in Human Auditory Cortex. Cerebral Cortex, 11(10):946–953.
Zatorre, R. J., Bouffard, M., and Belin, P. (2004). Sensitivity to
Auditory Object Features in Human Temporal Neocortex. Journal
of Neuroscience, 24(14):3637–3642.
Zhou, D., Liu, Q., Platt, J. C., Meek, C., and Shah, N. B. (2015).
Regularized Minimax Conditional Entropy for Crowdsourcing.
CoRR, abs/1503.07240.
21