Continual Learning of Context-Dependent Processing in Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Continual Learning of Context-dependent

Processing in Neural Networks


Guanxiong Zeng 1,2,∗ , Yang Chen 1,∗ , Bo Cui 1,2 and Shan Yu 1,2,3,†
1 Brainnetome Center and National Laboratory of Pattern Recognition, Institute of
Automation, Chinese Academy of Sciences, 100190 Beijing, China.
2 University of Chinese Academy of Sciences, 100049 Beijing, China.
3 Center for Excellence in Brain Science and Intelligence Technology, Chinese

Academy of Sciences, 100190 Beijing, China.


* These authors contributed equally to this work.
† Correspondence [email protected]
arXiv:1810.01256v3 [cs.LG] 27 Jun 2021

ABSTRACT
Deep neural networks (DNNs) are powerful tools in learning sophisticated but fixed mapping
rules between inputs and outputs, thereby limiting their application in more complex and dynamic
situations in which the mapping rules are not kept the same but changing according to different
contexts. To lift such limits, we developed a novel approach involving a learning algorithm, called
orthogonal weights modification (OWM), with the addition of a context-dependent processing
(CDP) module. We demonstrated that with OWM to overcome the problem of catastrophic
forgetting, and the CDP module to learn how to reuse a feature representation and a classifier for
different contexts, a single network can acquire numerous context-dependent mapping rules in
an online and continual manner, with as few as ∼10 samples to learn each. This should enable
highly compact systems to gradually learn myriad regularities of the real world and eventually
behave appropriately within it.

INTRODUCTION
One of the hallmarks of high-level intelligence is flexibility [1]. Humans and non-human priamtes can
respond differently to the same stimulus under different contexts, e.g., different goals, environments,
and internal states [2–5]. Such an ability, named cognitive control, enables us to dynamically map
sensory inputs to different actions in a context-dependent way [6–8], thereby allowing primates to behave
appropriately in an unlimited number of situations with limited behavioral repertoire[9, 10]. However,
this flexible, context-dependent processing is quite different to that found in current artificial deep neural
networks (DNNs). DNNs are very powerful in extracting high-level features from raw sensory data
and learning sophisticated mapping rules for pattern detection, recognition, and classification [11]. In
most networks, however, the outputs are largely dictated by sensory inputs, exhibiting stereotyped input-
output mappings that are usually fixed once training is complete. Therefore, current DNNs lack sufficient
flexibility to work in complex situations in which 1) the mapping rules change according to context and 2)
these rules need to be learned sequentially when encountered from a small number of learning trials. This
constitutes a significant gap in the abilities between current DNNs and primate brains.
In the present study, we propose an approach, including an orthogonal weight modification (OWM)
algorithm and a context-dependent processing (CDP) module, that enables a neural network to
progressively learn various mapping rules in a context-dependent way. We demonstrate that with OWM to
protect previously acquired knowledge, the networks could sequentially learn up to thousands of different

1
mapping rules without interference, and needing as few as ∼10 samples to learn each. In addition, by
using the CDP module to enable contextual information to modulate the representation of sensory features,
a network can learn different, context-specific mappings for even identical stimuli. Taken together, our
proposed approach can teach a single network numerous context-dependent mapping rules in an online
and continual manner.

1 ORTHOGONAL WEIGHTS MODIFICATION (OWM)


The first step towards flexible context-dependent processing is to incorporate efficient and scalable
continual learning, i.e., learning different mappings sequentially, one at a time. Such an ability is
crucial to humans as well as artificial intelligence agents for two reasons: 1) there are too many possible
contexts to learn concurrently, and 2) useful mappings cannot be pre-determined but must be learned
when corresponding contexts are encountered. The main obstacle to achieve continual learning is that
conventional neural network models suffer from catastrophic forgetting, i.e., training a model with new
tasks interferes with previously learned knowledge and leads to significantly decreases on the performance
of previously learned tasks [12–15]. To avoid catastrophic forgetting, we developed the OWM method.
Specifically, when training a network for new tasks, its weights can only be modified in the direction
orthogonal to the subspace spanned by all previously learned inputs (termed the input space hereafter)
(Fig. 1a and Supplementary Fig. 1). This ensures that new learning processes do not interfere with
previously learned tasks, as weight changes in the network as a whole do not interact with old inputs.
Consequently, combined with a gradient descent-based search, the OWM helps the network to find a
weight configuration that can accomplish new tasks while ensuring the performance of learned tasks
remains unchanged (Fig. 1b). This is achieved by first constructing a projector used to find the direction
−1
orthogonal to the input space: P = I − A AT A + αI A, where matrix A consists of all previously
trained input vectors as its columns A = [x1 , · · · , xn ] and I is a unit matrix multiplied by a relatively small
constant α. The learning-induced modification of weights is then determined by ∆W = κP∆WBP ,
where κ is the learning rate and ∆WBP is the weights adjustment calculated according to the standard
backpropagation. To calculate P, an iterative method can be used (see Methods). Thus, the algorithm
does not need to store all previous inputs A. Instead, only the current inputs and projector for the last
task are needed. This iterative method is related to the Recursive Least Square (RLS) algorithm [16, 17]
(see Supplementary Information for the discussion), which can be used to train feedforward and recurrent
neural networks to achieve fast convergence [18, 19], tame chaotic activities [20] and avoid interference
between consecutively loaded patterns or tasks [21, 22].
We first tested the performance of the OWM on several benchmark tasks of continual learning. Shuffled
and disjoint MNIST experiments, in which different tasks involving recognition of handwritten digits
need to be learned sequentially (see Methods and Supplementary Information for details regarding the
datasets used in this study), were conducted on the feedforward network with the rectified linear unit
(ReLU) [23]. The OWM was used to train the entire multi-layer networks. For 3- or 10-task shuffled and
2-task disjoint experiments, OWM resulted in either superior or equal performance in comparison to other
continual learning methods without storage of previous task samples or dynamically adding new nodes
to the network [22, 24–26] (Tables 1, 2). In the more challenging 10-task disjoint and 100-task shuffled
experiments, OWM exhibited significant performance improvement over other methods (Fig. 2 and Table
1). Interestingly, for the more difficult continual learning tasks, we found that the order of tasks mattered.
As the performance for specific classes can be significantly influenced by the classes learned previously
(Fig. 2 inset), suggesting that curriculum learning is a potentially important factor to consider in continual
learning.

2
A B

Fig. 1. Schematic diagram of OWM. a, In the new task training process, the original weight
modification calculated by the standard backpropagation (BP), ∆WBP, is projected to the subspace (dark
green surface), in which good performance for learned tasks has been achieved. As a result, the actual
implemented weight modification is ∆WOWM . This process ensures that the weights configuration after
learning the new task is still within the same subspace. b, With the OWM, the training process searches for
configurations that can accomplish Task 2 ( pale red area), within the subspace that enables the network to
accomplish Task 1 ( blue area). A successful search necessarily stops at a position inside the overlapping
subspace ( light green area). In comparison, the solution obtained by stochastic gradient descent search
(SGD) is more likely to end outside this overlapping area.

To examine whether the OWM is scalable, i.e., whether it can be applied to learn more sophisticated
tasks, regarding both number of different mappings and complexity of inputs, we tested the network’s
ability in learning to classify thousands of hand-written Chinese characters (CASIA-HWDB1.1) and
natural images (ImageNet). The Chinese character recognition task included a total of 3,755 characters
forming the level I vocabulary, which constitutes more than 99% of the usage frequency in written Chinese
literature [27] (see Fig. 3a for exemplars of characters). In this task, a feature extractor was pre-trained
to analyze the raw images. The feature vectors were fed into an OWM-trained classifier to learn the
mapping between combinations of features and the labels of individual classes. We found that a classifier
trained with the OWM could learn to recognize all 3,755 characters sequentially, with a final accuracy
∼ 92% closely approaching the results obtained in human performance when recognizing handwritten
Chinese characters (∼ 96%) [28]. Considering humans learn these characters over years and the learning
necessarily contains revision, these results suggest that our method endows neural networks with a strong
capability to continually learn new mappings between sensory features and class labels. Similar results
were obtained with the ImageNet dataset, where the classifier trained by the OWM combined with a pre-
trained feature extractor, was able to learn 1000 classes of natural images sequentially (Supplementary
Table 1), with the final accuracy approaching the results obtained by training the system to classify all
categories concurrently. These results suggest that, by using the OWM, the performance of the system

3
in classification approached the limit set by the front-end feature extractor, with liability to the classifier
caused by sequential learning itself effectively mitigated.

Shuffled MNIST Experiment


3 tasks Accuracy (%) 10 tasks Accuracy (%) 100 tasks Accuracy (%)
SGD# [14] 71.32 ± 1.54∗ EWC# [24] ∼ 97.0 EWC‡ [29] ∼ 70.8
IMM# [25] 98.30 ± 0.08n.s OWM# 97.52 ± 0.03 SI‡ [29] ∼ 82.3
EWC# [24] ∼ 98.2 †
EWC [22] ∼ 89.0 OWM‡ ∼ 85.4
OWM# 98.34 ± 0.02 CAB† [22] ∼ 95.2
OWM† 95.15 ± 0.08
SI‡ [26, 29] ∼ 97.0
OWM ‡ 97.64 ± 0.03
Table 1. Comparison of performance of different methods in Shuffled MNIST task. Network size:
† , 3-layer networks with [784-100-10] neurons; # , 4-layer networks with [784-800-800-10] neurons;
‡ , 4-layer networks with [784-2000-2000-10] neurons. Results from other methods were adopted from
corresponding publications. Results for OWM are represented as mean ± s.d.. ∗ , p < 0.01. n.s, not
significant. EWC: Elastic Weight Consolidation; IMM:Incremental Moment Matching; SI: Synaptic
Intelligence

Disjoint MNIST Experiment


Methods Accuracy (%)
#
EWC [25] 52.72 ± 1.36∗
#
IMM [25] 94.12 ± 0.27∗
OWM# 96.59 ± 0.06
SGD† 53.85 ± 0.14∗
CAB† [22] 94.91 ± 0.30∗
OWM † 96.30 ± 0.03
Table 2. Comparison of performance of different methods in disjoint MNIST tasks. Network size:
† , 3-layer networks with [784-800-10] neurons; # , 4-layer networks with [784-800-800-10] neurons.
Performance results from other methods were adopted from previous studies. ∗ , p < 0.01.

Disjoint CIFAR10 Experiment


Methods Accuracy (%)
EWC[30] 31.09
IMM[30] 32.36
MA [30] 40.47
OWM 52.83
Table 3. Comparison of performance of different methods in disjoint CIFAR-10 task. See Methods
for details. MA: Model Adaptation

In the results mentioned above, feature extractors pre-trained by the complete training sets in
corresponding tasks were used to provide the feature vectors for the OWM-trained classifier. We next
examined whether the classifier can learn categories on which the feature extractor has not been trained.
Results were in the affirmative, as shown in Fig. 3b. For example, the feature extractor trained with
500 randomly selected Chinese characters (out of 3,755, less than 15% of categories) could already
support the classifier to sequentially learn the remaining 3,255 characters with near 80% accuracy (chance

4
100

80 100
Test Accuracy (%)

Test Accuracy (%)


p<0.01
95 OWM
60 CAB
90 SGD

40 85
4g7g9 9g4g7

20

0
1 2 3 4 5 6 7 8 9 10
Number of Classes

Fig. 2. Performance of OWM, CAB, and SGD in 10-disjoint MNIST task. Test accuracy was plotted
as a function of number of classes learned. Results are presented as mean ± s.d.. For OWM-trained
task, the sequence of learnt digits influenced recognition accuracy for specific classes. Inset: performance
of recognizing digit “9” was significantly higher after learning digits “7” and “4”; two-sided t-test was
applied to assess statistical significance.

level of 1/3, 255), demonstrating that the network could sequentially learn new categories not previously
encountered. However, we note that a higher degree of pre-taining was associated with better performance
(Fig. 3b and c), indicating the importance of training the feature extractor on as various classes as possible.
Another important question is how quickly the OWM-trained classifier can learn. As shown in Fig.
3c, it only needed a small sample size to learn new mappings. For Chinese characters, < 10 samples
per class were sufficient to gain satisfactory performance. Comparison with other methods in the same
task further confirmed the advantage of the OWM in achieving better performance with fewer training
samples (see Supplementary Fig. 2). We note that the better performance achieved by the OWM with fewer
samples is rooted in the well-known fact that the RLS algorithm, from which we derived the OWM, can
converge more quickly than the least mean square (LMS) algorithm, which is equivalent to the standard
backpropagation [16, 19].
With the dateset of Chinese characters, we also analyzed network capacity in the OWM-based continual
learning. We tested two conditions, including reducing the size of the network for a given task and
increasing the number of tasks for a given network. We observed that network performance remained
stable until its size was reduced or the number of tasks was increased to a certain value, after which point
performance declined, indicating an approach to network capacity (Fig. 3d, e). Importantly, the changes in
network performance were highly correlated with decreases in the rank of the orthogonal projector, which
is consistent with our theoretical analysis regarding network capacity in OWM-based continual learning
(see Methods for details).
In the experiments with Chinese characters, it is possible that although a class was never seen by the
network, it shared features with other classes used in feature extractor pre-training. Thus, to further test

5
a b c
100 100

Test Accuracy (%)

Test Accuracy (%)


80 80

3755 Classes
Average
3000 Classes
60 60
W/O Pre-training 2000 Classes

40 40
50 100 1000 3755 1 10 100 240
Number of Pre-training Classes Number of Samples in Each Class
d e
100 100 100 2.0

Test Accuracy (%)


Test Accuracy (%)

Accuracy Accuracy

Rank

Rank
Used Rank Spare Rank
10 10 10

1 1 1 0.5
4 100 1000 2000 1 10 100 400
Number of Hidden Neurons Number of Task

Fig. 3. Continual learning with small sample size achieved by OWM in recognizing Chinese
characters. a, Examples showing seven characters with five samples for each. b, Classification accuracy
is plotted as a function of the number of classes used for pre-training the feature extractor. Performance
was assessed based on classifying all characters (blue) or characters not included in pre-training (orange).
Variance of test accuracy across classes in each case is reported in Supplementary Table 2. c, Classification
accuracy is plotted as a function of sample size used for sequential training, obtained with feature
extractors having different degrees of pre-training (color-coded). Performances differed significantly
(paired t-test, p < 0.001) across different degrees of pre-training (see Supplementary Table 3 for
variance in performance across all classes). (d-e), Relationship between network capacity for continual
learning and rank of the orthogonal projector. In d, the task was to learn 100 classes of Chinese
characters sequentially. Average accuracy achieved by the network (blue) and the corresponding value
of (rank(βI) − rank(P)) (red) are plotted with respect to the number of neurons in the hidden layer. In
e, the same neural network with 50 neurons in the hidden layer was trained to recognize an increasing
number of Chinese characters. Average accuracy (blue) achieved by the network and the corresponding
value (red) of ranktot (see Methods) are plotted with respect to the number of tasks/characters.

the ability of the OWM in continual learning without a pre-trained feature extractor, we examined its
performance in the disjoint CIFAR-10 task. In this task, the network was trained to recognize two classes
each time; thus, in a total of five consecutive tasks it learned to recognize all 10 classes. Importantly, the
whole network, including both the feature extractor and classifier, was trained continually in an end-to-
end manner. In this task, the OWM outperformed other recently proposed continual learning methods by
a large margin (Table 3), exhibiting great potential to improve the networks’ ability to learn new classes
“on the go” and with the feature extractor free of pre-training. Although the performance in an end-to-
end training setting was still inferior than that with a pre-trained feature extractor, this is an important
step towards removing the usual distinction between the training and application phases of DNNs, thus
allowing efficient online learning. We note that a continual learning method for classifiers with a pre-
trained feature extractor may also be useful. While the number of features in a given domain (e.g., human
faces) is usually limited, the possible ways of combining different features to form a new object (e.g.,
individual faces) are almost infinite. Thus, given feature extractors pre-trained on sufficiently diverse
sample sets, a classifier could greatly benefit from continual learning to recognize countless new classes.

6
a Context b
Feature Space

CDP

Encoder
Sensory
Output
Inputs

C
in
W
Feature Extractor
Sensory Inputs

out
W

Y lable

Feature Rotator
c
100
Test Accuracy (%)

Multi-task Training Hard Task


75
Sequential Training Easy Task

50
s
at
Bald
Male

s
ee
Skin
ache
ecktie

in
y
eard

by
ks
ado w
stick
Open
irline

ing
ows
keup

o ung
on es
ye s
ce
yes

rows
ctive
se
Face
ips
Hair

ir

ir
ir

ir

ir
lasse

burn

Blurr

Bang

rring

s
d Ha

k Ha
n Ha

y Ha

ht Ha
le Ch
W.H

C he e
C hub

eckla

ty No
Goat

Big L
Smil

ow E

Un.E

Big N
E.Br
No B

W.Lip

Attra
d.Ha

A.E.B
Gray

Must

H.Ma

eekb
Pale

Y
5 Sh

th S.
Side

W.Ea

Oval
W.N

Blac

Wav
Blon

Brow
Eyeg

Straig
Do ub

W.N

Poin
Rosy

Narr
Bush

Bags
Rece

H.Ch
Mou

d e
Multi-task Training Sequential Training 100
Test Accuracy (%)

CL 1
Easy Task
Hard task
Switch Module

CL 2
Feature Input

Feature Input

CDP Module

75

CL 3 CL
...
...

...

CL n 50
2 100 1000 2000
Number of Pictures

Fig. 4. Achieving context-dependent sequential learning via the OWM algorithm and the CDP
module. a, Schematic diagram of network architecture. The CDP module dynamically modulates the
mapping between sensory inputs to network outputs according to the contextual information. The main
figure and the inset illustrate the detailed internal structure of the module and the overall architecture,
respectively. b, Schematic diagram showing the role of the CDP module in rotating inputs in feature
space (see Methods for details). c, Performance of sequentially learning to classify faces by 40 different
attributes, each associated with a unique contextual signal, compared with results obtained by multi-task
training. Tasks were sorted by test accuracy. d, Schematic diagrams showing network architecture for
multi-task (left) and sequential (right) training. CL, classifier. To achieve context-dependent processing,
in multi-task training a switch module and n classifiers are needed, where n is the number of different
attributes. e, Classification accuracies for a relatively easy task (gender; blue curve) and five more difficult,
sequentially learned tasks (e.g., attractiveness; orange curve; mean results across all five tasks are shown)
are plotted as a function of training sample size. Tasks and corresponding performance obtained by
training on the full dataset are marked with arrows in c.

2 CONTEXT DEPENDENT PROCESSING MODULE


Although a system that can learn many different mapping rules in an online and sequential manner is
highly desirable, such a system cannot accomplish context-dependent learning by itself. To achieve that,

7
contextual information needs to interact with sensory information properly. Here we adopted a solution
inspired by the primate PFC. The PFC receives sensory inputs as well as contextual information, which
enables it to choose sensory features most relevant to the present task to guide action[4, 5, 31]. To mimic
this architecture, we added the context-dependent processing (CDP) module before the OWM-trained
classifier, which was fed with both sensory feature vectors and contextual information (Fig. 4a). The CDP
module consists of an encoder sub-module, which transforms contextual information to proper controlling
signals, and a “rotator” sub-module, which uses controlling signals to manipulate the processing of
sensory inputs. The encoder sub-module is trainable and learns in a continual way with the OWM.
Mathematically, the context-dependent manipulation serves by rotating the sensory input space according
to the contextual information (Fig. 4b, see Methods), thereby changing the representation of sensory
information without interfering with its content. The rotation of the input space allows for OWM to be
applied for identical sensory inputs in different contexts. To demonstrate the effectiveness of this CDP
module, we trained the system to classify a set of faces according to 40 different attributes [32], i.e., to
learn 40 different mappings sequentially with the same sensory inputs. The contextual information was the
embedding vectors [33] of the corresponding task names, which were projected to control the rotation of
the sensory inputs. As shown in Fig. 4c, the system sequentially learned all 40 different, context-specific
mapping rules with a single classifier. The accuracy was very close to that achieved by multi-task training,
in which the network was trained to classify all 40 attributes using 40 separate classifiers (Fig. 4d). In
addition, similar to the results obtained in learning Chinese characters, the network was able to learn
context-dependent processing quickly. Here, ∼20 faces were enough to reach the learning plateau for
both simple, e.g., male vs. female, and difficult, e.g., attractive vs unattractive, tasks (Fig. 4e). In the
experiment, our approach achieved better performance with fewer samples in comparison with other
methods for continual learning, indicating its potential to enable a system to adapt quickly in highly
dynamic environments with regularities changing with contexts (Supplementary Fig. 2b). Interestingly,
we found that the CDP module was able to identify the meaningful signal from the contextual inputs
with noise (Supplementary Fig. 3 and Supplementary Table 4, see Methods for task details) and to learn
how to use the contextual information effectively (Supplementary Fig. 4). These results indicate that
our approach allows the system to infer the correct context signal from experience and use it properly.
Importantly, such an ability would open the door for an intelligent agent to explore environments and
gradually learn its regularities in an autonomous way.

DISCUSSION
If we view traditional DNNs as powerful sensory processing modules, the current approach could be
understood as adding a flexible cognitive module to the system. This architecture was inspired by the
primate brain. For example, the primate visual pathway is dedicated to analyzing raw visual images and
eventually representing ∼ 100 features in higher visual areas such as the inferotemporal cortex [34]. The
outputs of this “feature extractor” are then sent to the prefrontal cortex (PFC) for object identification
and categorization [35–37]. The training of the feature extractor is difficult and time-consuming. In
humans, it takes years or even decades for higher visual cortices to become fully developed and reach
peak performance [38]. However, with sufficiently developed visual cortices, humans can quickly learn
new visual object categories, often by seeing just a few positive examples [39]. By adding a cognitive
module supporting continual learning to DNN-based feature extractors, we found a qualitatively similar
behavior in neural networks. That is, although the training of the feature extractor is computationally
difficult and requires a large number of samples, with a well-trained feature extractor, the learning of new
categories can be achieved quickly. This suggests that the mechanisms underlying fast concept formation

8
in humans may be understood, at least in part, from a connectionist perspective. In addition to the role
of supporting the fast learning of new concepts, another function of the primate PFC is to represent
contextual information [9] and use it to select those sensory features most relevant for the current task
[4]. This gives rise to the flexibility exhibited in primate’ behavior and here we demonstrated that similar
architecture can do the same in artificial neural networks. Interestingly, we found that in the CDP module,
the neuronal responses showed mixed selectivity to sensory features, contexts, and their combinations
(Supplementary Fig. 5), similar to that found for real PFC neurons [40]. Thus, it would be informative
to see whether the rotation of input space adopted in our CDP module captures the operation carried out
in the real PFC. For tasks similar to the face classification tested above, one possible solution to achieve
context-dependent processing is to add additional classifier outputs for each new task/context. However,
this approach only works if there is no hidden layer between the feature extractor and final output layer.
Otherwise the shared weights between different classifier outputs will suffer from catastrophic forgetting
during continual learning, especially if the inputs are the same for all contexts. More importantly, adding
additional classifier outputs (and all related weights) for each new task/context would lead to increasingly
complex and bulky systems (cf. Fig. 4d left). As the total number of possible contexts can be arbitrarily
large, such a solution is clearly not scalable. As the total number of possible contexts can be arbitrarily
large, such a solution is clearly not scalable. Finally, for artificial intelligence systems, the importance of
the CDP-module would depend on application. In scenarios in which a compact system needs to learn
numerous contexts ”on the go”, similar to what human individuals need to do within their lifetimes, the
ability of the OWM-empowered CDP-module to reuse classifiers is of paramount importance.

As demonstrated in the present results, an efficient and scalable algorithm of continual learning
is not only crucial for achieving flexible context-dependent processing, but also important to ensure,
more generally, that the added cognitive module is able to learn new tasks when encountered. In
continual learning, preserving previously acquired knowledge while maintaining plasticity for subsequent
learning is the key [15]. In the brain, the separation of synapses utilized for different tasks is essential
for sequential learning [41], which inspired the development of algorithms to protect the important
weights involved in previously learned tasks while training the network for new ones [24, 26]. However,
these “frozen” weights necessarily reduce the degrees of freedom of the system, i.e., they decrease
the volume of parameter space to search for a configuration that can satisfy both old and new tasks.
Here, by allowing the “frozen” weights to be adjustable again without erasing acquired knowledge, the
OWM exhibited clear advantages in performance. However, further studies are required to investigate
whether algorithms similar to the OWM are implemented in the brain. Recently, it has been suggested
that a variant of backpropagation algorithm, i.e., the “conceptor-aided back-prop” (CAB) can be used
for continual learning by shielding gradients against degradation of previously learned tasks [22]. By
providing more effective shielding of gradients through constructing an orthogonal projector, the OWM
achieved much better protection of previously acquired knowledge, yielding highly competitive results
in empirical tests compared with the CAB (see Tables 1, 2, Fig. 2 and Supplementary Information for
details). The OWM and continual learning methods mentioned above are regularization approaches [15].
Similar to other methods within this category, the OWM exhibits tradeoff between the performance of
the old and new tasks, due to limited resources to consolidate the knowledge of previous tasks. In
contrast to regularization approaches, other types of continual learning methods involve dynamically
introducing extra neurons or layers along the learning process [42], which may help mitigate the tradeoff
described above [15]. However, regularization approaches require no extra resources to accommodate
newly acquired knowledge during training and, therefore, are capable of producing compact yet versatile
systems.

9
We note that a solution for continual learning based on context-dependent processing was suggested
recently [29]. In this work, a context-dependent gating mechanism was used to separate subnetworks
for processing individual tasks during continual learning. However, for this approach to work, the
same contextual information needs to be present during both the training and testing phases. As such
information is rarely available in practical situations, this seriously limits the applicability of the method.
Different from this approach, the CDP module in our work enables the network to modulate its processing
according to the contextual information so that the same inputs can be treated differently in different
contexts. This role is not related to continual learning, as the CDP module is needed in the same task as
shown in Fig. 4, even if the system was trained concurrently. Importantly, as contextual information, e.g.,
environmental cues, the task at hand, etc. is always available for any input that needs context-dependent
processing, the limitations of using context-dependent gating for continual learning is not a problem for
our CDP module.
Other biologically inspired approaches for continual learning are based on complementary learning
systems (CLS) theory [43, 44]. Such systems involve interplay between two sub-systems similar to
the mammalian hippocampus and neocortex, i.e., a task-solving network (neocortex) accompanied by
a generative network (hippocampus) to maintain the memories of previous tasks [45]. With the aid
of the Learning without Forgetting (LwF) method [46], data for old tasks sampled by the generative
module are interleaved with those for the current task to train the neural network to avoid catastrophic
forgetting. Although here we used a completely different approach for continual learning, the CLS
framework may also be instrumental for further development of our approach. Currently, the encoder
of the CDP module has the ability to infer contextual information from the environment and also to learn
how to use it effectively. Conceivably it could be further developed to recognize and classify complex
contexts. Such a flexible module for recognizing proper contextual signals may be analogous to the
hippocampus in the brain, which is related to the classification of different environmental cues via pattern
separation and completion [44]. Thus, it would be informative for future studies to investigate whether
the current approach can be combined with the CLS framework to achieve more flexible and sophisticated
context-dependent processing.
Taken together, our study demonstrated that it is possible to teach a highly compact network many
context-dependent mappings sequentially. Although we demonstrated its effectiveness here with the
supervised learning paradigm, the OWM has the potential to be applied to other training frameworks.
Another regularization approach for overcoming catastrophic forgetting, i.e., the EWC, has been
successfully implemented in reinforcement learning [24]. As the EWC can be viewed as a special case
of OWM in some circumstances (see Supplementary Information for details), it suggests that similar
procedures could be extended for the use of the OWM and CDP module in unsupervised conditions,
thereby enabling networks to learn different mapping rules for different contexts by reinforcement
learning. We expect that such an approach, combined with effective methods of knowledge transfer,
e.g., [47–50], may eventually lead to systems with sufficient flexibility to work in complex and dynamic
situations.

10
METHODS
The OWM algorithm. Consider a feed-forward network of L + 1 layers, indexed by l = 0, 1, · · · , L with
l = 0 and l = L being the input and output layer, respectively. All hidden layers share the same activation
function g(•). Wl represents the connections between the (l − 1)th and lth layer with Wl ∈ Rs×m . xl
and yl denote the output and input of the lth layer, respectively, where xl = g(yl ) and yl = WlT xl−1 .
xl−1 ∈ Rs and yl ∈ Rm .
In the OWM, the orthogonal projector Pl defined in the input space of layer l for learned tasks is key for
overcoming catastrophic interference in sequential learning. In practice, Pl can be recursively updated for
each task in a way similar to calculating the correlation-inverse matrix P(RLS) = ( ni=1 x(i)xT (i)+αI)−1
P
in the RLS algorithm [16, 18, 19] (see discussion on relationship of OWM and RLS in Supplementary
Information). This method allows Pl to be determined based on the current inputs and the Pl for the last
task. It also avoids matrix-inverse operation in the original definition of Pl .
Below we provide the detailed procedure for the implementation of the OWM method.

a. Initialization of parameters: randomly initialize Wl (0) and set Pl (0) = Il /β for l = 1, · · · , L.


b. Forward propagate the inputs of the ith batch in the j th task, then back propagate the errors and
calculate weight modifications ∆WlBP (i, j) for Wl (i − 1, j) by the standard BP method.
c. Update the weight matrix in each layer by

Wl (i, j) = Wl (i − 1, j) + κ(i, j)∆WlBP (i, j) if j = 1


(1)
Wl (i, j) = Wl (i − 1, j) + κ(i, j)Pl (j − 1)∆WlBP (i, j) if j = 2, 3, · · ·

where κ(i, j) is the predefined learning rate.


d. Repeat steps (b) to (c) for the next batch.
e. If the j th task is accomplished, forward propagate the mean of the inputs for each batch (i =
1, · · · , nj ) in the j th task successively. Update Pl for Wl as Pl (j) = Pl (nj , j), where Pl (j) =
Pl (nj , j) can be calculated iteratively according to:

Pl (i, j) = Pl (i − 1, j) − kl (i, j)x̄l−1 (i, j)T Pl (i − 1, j)


(2)
kl (i, j) = Pl (i − 1, j)x̄l−1 (i, j)/[α + x̄l−1 (i, j)T Pl (i − 1, j)x̄l−1 (i, j)]

in which x̄l−1 (i) is the output of the l − 1th layer in response to the mean of the inputs in the ith batch
of thej th task, and Pl (0, j) = Pl (j − 1).
f. Repeat steps (b) to (e) for the next task.

We note that the algorithm achieved the same performance if the orthogonal projector Pl was updated
for each batch according to Eq. 2, with α decaying as αi,j = α0 λi/nj for the ith batch of data in the jth
task. This method can be understood as treating each batch as a different task. It avoids the extra storage
space as well as data reloading in (d) and, therefore, significantly accelerates processing. In this case, if the
learning rate is set to κ(i) = 1/[1 + x̄l−1 (i)T Pl (i − 1)x̄l−1 (i)] and αi,j is permanently set to αi,j = 1, the
procedure essentially uses RLS to train the neural network under the name of Enhanced Back Propagation
(EBP), which is proposed to increase the speed of convergence in training [19]. Therefore, our algorithm
has the same computational complexity as EBP—O(Nn Nw2 ), where Nn is the total number of neurons and
Nw is the number of input weights per neuron [19].

11
In addition, we analyzed the capacity of the OWM, i.e., how many different tasks could be learned
using this method. The capacity of one network layer can be measured by the rank of Pi , which is
defined as the orthogonal projector calculated after task i, with ∆Pi+1 then defined as the update in the
next task satisfying Pi+1 = Pi − ∆Pi+1 . As range(Pi+1 ) ∩ range(∆Pi+1 ) = ∅, rank(Pi+1 ) =
rank(Pi ) − rank(∆Pi+1 ). In the ideal case where each task consumes the capacity effectively, as the
learning process continues, the rank of Pl is approaching 0, indicating that this particular layer no longer
has the capacity to learn new tasks. The capacity of the whole network can be approximated by the
summation of the capacity of each layer: ranktot = L
P
l=1 rank(Pl )/rank(βI) where βI is the initial
value of matrix P. The rank is normalized to balance the contribution of each layer. We conducted two
experiments (Fig. 3d,e) on the CASIA-HWDB1.1 dataset to verify the above analysis. In the experiments,
to avoid influence by the tolerance value in the calculation of matrix rank, the rank was estimated as
P
rank(P) = i=1 si (P)/β, where si (•) denotes the ith singular value of the matrix. If the capacity limit
of the entire network is finally approached, two solutions can be considered: 1) introduction of a larger
α or the forgetting factor used in RLS [16] and online EWC [50]; and 2) addition of more layer(s), e.g.,
CDP module (see below for details), to provide more space to preserve previously learned knowledge.

The CDP module. In context-dependent learning, to change the representation of sensory inputs without
distorting information content in different contexts, we added one layer of neurons after the output layer
of the feature extractor (cf. Fig. 4a). Below we describe, from a mathematical point of view, how this CDP
layer works, using the face classification task as an example.

In this task, the rotator sub-module was fed with feature vectors for different faces, F =
[f1 , f2 · · · , fk ]T ∈ Rk , and modulated by non-negative controlling signals, C = [c1 , c2 , ..., cm ]T ∈ Rm .
The controlling signals C were drawn from the contextual information (word vector of corresponding task
name) by the encoder sub-module. Then the CDP module outputted Y out = [y1 , y2 , ..., ym ]T ∈ Rm , with
yi = ci g((wiin )T F), to a classifier for further processing. The input weight Win = [w1in , w2in, .., wm in ] ∈

Rk×m of the CDP module was randomly initialized and fixed across all contexts. The rest weights in the
CDP module, including the output weight Wout and the weights in the encoder, were trained by the OWM
method. The function of the CDP module can then be summarized as
 T 
out in
Y =g W F ⊙C
  
= g w1in w2in · · · wm in T F ⊙ C

 (3)
in in in
T 
= g c1 kFk w1 cosθ1 , c2 kFk w2 cosθ2 , · · · , cm kFk wm cosθm
 T 
= g c1 w1in cosθ1 , c2 w2in cosθ2 , · · · , cm wm in cosθ
m kFk

where ⊙ represents element-wise multiplication and θi is the angle between wiin and F. Note that for any
υ ≥ 0, g(υx) = max(0, υx) = υmax(0, x) = υ g(x). The ReLU function was used in the current study for
g(•) but this is not necessary. g(•) can also be chosen as a hyperbolic tangent function or logistic function.
As Win was initialized by the Xavier method[51] in most cases, wiin F was located in the linear range.
Thus the Eq.(3) can approximately hold even for activation functions other than ReLU. We confirmed
that the average accuracies for the same tasks in Fig. 4c with hyperbolic tangent function (90.93%) and
logistic function (90.05%) were close to that with ReLU ( 90.38%).

12
For individual faces, given the same feature vector F and fixed Win , cos θi is constant. Thus, output
out
q is affected by the controlling
Y signal C, which is different across tasks. If we normalize C by
Pm in
2
i=1 ci wi g (cosθi ) , it is apparent from Eq. 3 that the CDP layer “rotates” the input vector
in feature space, as illustrated in Fig. 4b. This explains why this added layer can change the representation
of sensory inputs while keeping information contents unchanged. Importantly, it also enables the system
to sequentially learn different tasks with the OWM for identical inputs.
To examine whether the CDP module can infer the correct context from distracting noise in the
environment, four face recognition tasks were conducted continually with the OWM, except that the
explicit context signal was not presented. The context signals and distracting noises were simultaneously
fed to the CDP module. The noise was sampled from a Gaussian distribution with the same mean and
variance as the context signal, and varied on a trial-by-trial basis. In the training phase for different tasks,
the position of the context signal and noises could be swapped (Supplementary Fig. 3). During the testing
phase, either the corresponding context+noises or only noises were presented.
Shuffled MNIST experiment. The shuffled MNIST experiment [14, 22, 24–26] usually consists of a
number of sequential tasks. All tasks involve classifying handwritten digits from 0 to 9. However, for each
new task, the pixels in the image are randomly shuffled, with the same randomization across all digits in
the same task and different randomization across tasks. For this experiment, we trained 3- or 4- layer,
feed-forward networks with [784-800-10] (3-layer) or [784-800/2000-800/2000-10] (4-layer) neurons
(see Table 1 for details) to minimize cross entropy loss by the OWM method. The ReLU activation
function [52] was used in the hidden layer. During training, the L2 regularization coefficient was 0.001.
Dropout was applied with a drop rate of 0.2.
Table 1 shows the performance of the OWM method for the shuffled MNIST tasks in comparison with
other continual learning algorithms. The accuracy of the OWM method was measured by repeating the
experiments 10 times. The results of other algorithms were adopted from corresponding publications. The
size of the network, regarding the number of layers and number of neurons in each layer, was the same as
in previous publications for a fair comparison.
Two-sided t-tests were used to compare performance between the OWM and other continual learning
methods for both the shuffled and disjoint (see below) MNIST experiments. The t values were calculated
according to the means and standard deviations across 10 experiments. Significance was considered at
p < 0.01 with results shown in Table 1.
Disjoint MNIST experiment. In the 2-disjoint MNIST experiment [53], the original MNIST dataset was
divided into two parts: The first contained digits from 0 to 4 and the second consisted of digits from 5 to
9. Correspondingly, the first task was to recognize digits among 0, 1, 2, 3 and 4 and the second task was
to recognize digits among 5, 6, 7, 8 and 9. In the 10-disjoint MNIST task, 10 digits, from 0 to 9, were
learned sequentially. Again, to facilitate comparison, network size and architecture were the same as in
previous work [53]. During training, momentum optimization was applied, and the learning rate for all
layers remained the same during training. Performance was calculated based on 10 repeated experiments
and is shown in Table 2.
Sequential learning of classification tasks with Chinese characters and ImageNet. Classification tasks
with ImageNet and Chinese handwritten characters are more challenging due to the complex structure in
each image and more classes to “memorize” in a sequential learning task. In sequential learning, the
training for a new task started only when the neural network accomplished the current task well enough,
defined here as < 1% accuracy gap between two successive training epochs. For these two tasks, we first

13
trained a DNN as the feature extractor on the whole or partial dataset to extract features of each image.
The extracted feature vectors were then fed into a 3-layer classifier with [1024-4000-3755] neurons for the
Chinese characters task and [2048-4000-1000] neurons for the ImageNet task. The classifier was trained to
recognize each of the classes sequentially using the OWM method, with results shown in Supplementary
Table 1. We note that in these experiments, as in other tests mentioned above, no negative samples were
used for training the network to recognize a new class. In other words, only positive samples of a particular
class were presented to the network during training.
Disjoint CIFAR-10 experiment. In contrast to the pre-training of feature extractors in the tasks of
Chinese characters and ImageNet, the feature extractor was trained together with the classifier in an
end-to-end way using the OWM in this task. The CIFAR-10 dataset was divided into 5 groups. Each
group included 2 classes of samples used to train the whole network in one task. The feature extractor
consisted of three convolutional layers and the classifier consisted of three fully connected layers. The
three convolutional layers had 64, 128, and 256 filters, respectively, and the size of the convolution kernel
was 2 × 2. A maxpooling layer with size of 2 × 2 was attached to each convolutional layer. Dropout was
applied to each maxpooling layer with a dropping probability of 0.2. The features extracted were flattened
and then fed to the classifier of [1000-1000-10] neurons. The activation function for all layers was the
ReLU function. The initial weights for all layers were in accordance with the Xavier initialization method
proposed by Glorot and Bengio[51]. Cross entropy loss was applied for the training. Table 3 compares the
performance of the OWM with other methods using the same network structure and task.
Context-dependent face recognition with CelebA. In this experiment, we first trained a feature extractor
using the architecture of ResNet50 [54] on the whole training dataset and with the conventional multi-task
training procedure. The outputs of the feature extractor were then fed into the CDP module, which also
received contextual information (cf. Fig. 4a in the main text). The rotator layer contained 5000 neurons.
The size of the encoder layer was [200-5000], with ReLU applied as the activation function. For the
face classification task in the present study, rotated feature vectors were fed directly into the classifier by
weights Wout . Before training, all weights and biases were randomly initialized. Wout and the weights in
the encoder were modified by the OWM method. Detailed results of classifying individual attributes are
listed in Supplementary Table 5.
Network parameters. Weights in the hidden layers of the classifiers for the tasks other than disjoint
CIFAR-10 task were initialized according to previously suggested method [55]. The output layers were all
initialized to zero. The biases of each layer were randomly initialized according to a uniform distribution
within (0, 0.1). The ReLU neurons were applied to every hidden layer in all experiments. The momentum
in all optimization algorithms was 0.9. The details of hyperparameters used for feature extractors are
shown in Supplementary Table 6. Early stopping was used for training both the feature extractors and
classifiers. The hyperparameters for the OWM method are shown in Supplementary Table 7. For tasks
with MNIST and CelebA, the classifier was trained to minimize cross entropy loss, whereas for tasks
with ImageNet and Chinese characters, the classifier was trained to minimize mean squared loss. Note
that cross-entropy loss was also suitable for the latter datasets. However, mean squared loss is easier to
compute and less time-consuming when many tasks are involved.
Mixed selectivity analysis. For classifying different facial attributes, responses of neurons in the CDP
were analyzed to examine if they exhibited mixed selectivity similar to that of real PFC neurons. To this
end, we chose two attributes with low correlation, i.e., Attractiveness (Task 1) and Smile (Task 2). Both has
about 50% positive and 50% negative samples in the whole dataset. The responses of each neuron in the
CDP module to different inputs as well as contextual signals were analyzed with the weights in the encoder

14
sub-module fixed for both contexts. There were 19962 test pictures, with 90% correctly classified after
training for both tasks. The threshold of excitation for each neuron was chosen as the average activity level
across all neurons during the processing of all correctly-classified pictures. Supplementary Fig. 5 shows
the selectivity of three exemplar neurons. According to the criteria usually used in electrophysiological
experiments, these three neurons belonged to different categories, including task-sensitive (Neuron 1) and
attribute-sensitive (Neuron 2). Importantly, Neuron 3 exhibited complex selectivity towards combinations
of task and sensory attributes, as well as combinations of different attributes. This mixed selectivity is
commonly reported for real PFC neurons [56].

DATA AND CODE AVAILABILITY


All data used in this paper are publicly available and can be accessed at: http://yann.lecun.com/exdb/mni
for the MNIST dataset; https://www.cs.toronto.edu/˜kriz/cifar.html for the CIFAR
dataset; http://image-net.org/index for the ILSVR2012 dataset; http://www.nlpr.ia.ac.cn/da
for the CASIA-HWDB dataset; http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
for the CelebA dataset. For more details of the dataset, please refer to the references cited in the Dataset
section of the Supplementary Information.
The source code can be accessed at https://github.com/beijixiong3510/OWM.

REFERENCES
[1]Newell, A. Unified theories of cognition (Harvard University Press, 1994).
[2]Miller, G. A., Heise, G. A. & Lichten, W. The intelligibility of speech as a function of the context of
the test materials. Journal of Experimental Psychology 41, 329–335 (1951).
[3]Desimone, R. & Duncan, J. Neural mechanisms of selective visual-attention. Annual Review of
Neuroscience 18, 193–222 (1995).
[4]Mante, V., Sussillo, D., Shenoy, K. V. & Newsome, W. T. Context-dependent computation by
recurrent dynamics in prefrontal cortex. Nature 503, 78–+ (2013).
[5]Siegel, M., Buschman, T. J. & Miller, E. K. Cortical information flow during flexible sensorimotor
decisions. Science 348, 1352–1355 (2015).
[6]Miller, E. K. The prefrontal cortex: Complex neural properties for complex behavior. Neuron 22,
15–17 (1999).
[7]Wise, S. P., Murray, E. A. & Gerfen, C. R. The frontal cortex basal ganglia system in primates.
Critical Reviews in Neurobiology 10, 317–356 (1996).
[8]Passingham, R. The frontal lobes and voluntary action. oxford psychology series (1993).
[9]Miller, E. K. & Cohen, J. D. An integrative theory of prefrontal cortex function. Annual Review of
Neuroscience 24, 167–202 (2001).
[10]Miller, E. K. J. N. r. n. The prefontral cortex and cognitive control 1, 59 (2000).
[11]LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
[12]McCloskey, M. & Cohen, N. J. Catastrophic interference in connectionist networks: The sequential
learning problem, vol. 24, 109–165 (Elsevier, 1989).
[13]Ratcliff, R. Connectionist models of recognition memory - constraints imposed by learning and
forgetting functions. Psychological Review 97, 285–308 (1990).
[14]Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A. & Bengio, Y. An empirical investigation of
catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211 (2013).
[15]Parisi, G. I., Kemker, R., Part, J. L., Kanan, C. & Wermter, S. J. a. p. a. Continual lifelong learning
with neural networks: A review (2018).

15
[16]Haykin, S. S. Adaptive filter theory (Pearson Education India, 2008).
[17]Golub, G. H. & Van Loan, C. F. Matrix computations, vol. 3 (JHU Press, 2012).
[18]Singhal, S. & Wu, L. Training feed-forward networks with the extended kalman algorithm. In
Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on,
1187–1190 (IEEE, 1989).
[19]Shah, S., Palmieri, F. & Datum, M. Optimal filtering algorithms for fast learning in feedforward
neural networks. Neural Networks 5, 779–787 (1992).
[20]Sussillo, D. & Abbott, L. F. Generating coherent patterns of activity from chaotic neural networks.
Neuron 63, 544–557 (2009).
[21]Jaeger, H. J. a. p. a. Controlling recurrent neural networks by conceptors (2014).
[22]He, X. & Jaeger, H. Overcoming catastrophic interference using conceptor-aided backpropagation.
In International Conference on Learning Representations (2018).
[23]Nair, V. & Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In
International Conference on International Conference on Machine Learning (2010).
[24]Kirkpatricka, J. et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the
National Academy of Sciences of the United States of America 114, 3521–3526 (2017).
[25]Lee, S.-W., Kim, J.-H., Jun, J., Ha, J.-W. & Zhang, B.-T. Overcoming catastrophic forgetting by
incremental moment matching. In Advances in Neural Information Processing Systems, 4652–4662
(2017).
[26]Zenke, F., Poole, B. & Ganguli, S. J. a. p. a. Continual learning through synaptic intelligence (2017).
[27]Liu, C.-L., Yin, F., Wang, D.-H. & Wang, Q.-F. Chinese handwriting recognition contest 2010. In
Pattern Recognition (CCPR), 2010 Chinese Conference on, 1–5 (IEEE, 2010).
[28]Yin, F., Wang, Q.-F., Zhang, X.-Y. & Liu, C.-L. Icdar 2013 chinese handwriting recognition
competition. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference
on, 1464–1470 (IEEE, 2013).
[29]Masse, N. Y., Grant, G. D. & Freedman, D. J. Alleviating catastrophic forgetting using context-
dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences 115,
E10467–E10475 (2018).
[30]Hu, W. et al. Overcoming catastrophic forgetting via model adaptation. In International Conference
on Learning Representations (2019).
[31]Fuster, J. The prefrontal cortex (Academic Press, 2015).
[32]Liu, Z., Luo, P., Wang, X., Tang, X. & Ieee. Deep Learning Face Attributes in the Wild, 3730–3738.
IEEE International Conference on Computer Vision (2015).
[33]Řehůřek, R. & Sojka, P. Software framework for topic modelling with large corpora. In Proceedings
of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50 (ELRA, Valletta, Malta,
2010).
[34]Lehky, S. R., Kiani, R., Esteky, H. & Tanaka, K. Dimensionality of object representations in monkey
inferotemporal cortex. Neural Computation 26, 2135–2162 (2014).
[35]Freedman, D. J., Riesenhuber, M., Poggio, T. & Miller, E. K. Categorical representation of visual
stimuli in the primate prefrontal cortex. Science 291, 312–316 (2001).
[36]Hung, C. P., Kreiman, G., Poggio, T. & DiCarlo, J. J. Fast readout of object identity from macaque
inferior temporal cortex. Science 310, 863–866 (2005).
[37]Kravitz, D. J., Saleem, K. S., Baker, C. I., Ungerleider, L. G. & Mishkin, M. The ventral visual
pathway: an expanded neural framework for the processing of object quality. Trends in Cognitive
Sciences 17, 26–49 (2013).

16
[38]Gomez, J. et al. Microstructural proliferation in human cortex is coupled with the development of
face processing. Science 355, 68–+ (2017).
[39]Xu, F. & Tenenbaum, J. B. Word learning as bayesian inference. Psychological Review 114, 245–272
(2007).
[40]Rigotti, M. et al. The importance of mixed selectivity in complex cognitive tasks. Nature 497,
585–590 (2013).
[41]Cichon, J. & Gan, W.-B. Branch-specific dendritic ca2+ spikes cause persistent synaptic plasticity.
Nature 520, 180–U80 (2015).
[42]Rusu, A. A. et al. Progressive neural networks (2016).
[43]McClelland, J. L., McNaughton, B. L. & Oreilly, R. C. Why there are complementary learning-
systems in the hippocampus and neocortex - insights from the successes and failures of connectionist
models of learning and memory. Psychological Review 102, 419–457 (1995).
[44]Kumaran, D., Hassabis, D. & McClelland, J. L. J. T. i. c. s. What learning systems do intelligent
agents need? complementary learning systems theory updated 20, 512–534 (2016).
[45]Shin, H., Lee, J. K., Kim, J. & Kim, J. Continual learning with deep generative replay. In Advances
in Neural Information Processing Systems, 2990–2999 (2017).
[46]Li, Z. & Hoiem, D. Learning without forgetting. IEEE transactions on pattern analysis and machine
intelligence (2017).
[47]Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I. & Schiele, B. What helps where -and why?
semantic relatedness for knowledge transfer (2010).
[48]Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks?
In Advances in neural information processing systems, 3320–3328 (2014).
[49]Hinton, G., Vinyals, O. & Dean, J. J. a. p. a. Distilling the knowledge in a neural network (2015).
[50]Schwarz, J. et al. Progress & compress: A scalable framework for continual learning. arXiv preprint
arXiv:1805.06370 (2018).
[51]Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–
256 (2010).
[52]Nair, V. & Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings
of the 27th international conference on machine learning (ICML-10), 807–814 (2010).
[53]Srivastava, R. K., Masci, J., Kazerounian, S., Gomez, F. & Schmidhuber, J. Compete to compute. In
Advances in neural information processing systems, 2310–2318 (2013).
[54]He, K. M., Zhang, X. Y., Ren, S. Q., Sun, J. & Ieee. Deep Residual Learning for Image Recognition,
770–778. IEEE Conference on Computer Vision and Pattern Recognition (Ieee, New York, 2016).
[55]He, K., Zhang, X., Ren, S. & Sun, J. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. In Proceedings of the IEEE international conference on
computer vision, 1026–1034 (2015).
[56]Ramirez-Cardenas, A. & Viswanathan, P. The role of prefrontal mixed selectivity in cognitive control.
Journal of Neuroscience 36, 9013–9015 (2016).

ACKNOWLEDGEMENTS
The authors thank Dr. Danko Nikolić for helpful discussions. This work was supported by the National
Key Research and Development Program of China (2017YFA0105203), Natural Science Foundation of
China (81471368), the Strategic Priority Research Program of the Chinese Academy of Sciences (CAS)
(XDB32040200), and the Hundred-Talent Program of CAS (for S.Y.).

17
CONTRIBUTIONS
S.Y., Y.C. and G.Z conceived the study and designed the experiments. G.Z. and Y.C. conducted
computational experiments and theoretical analyses. C.B. assisted with some experiments and analyses.
S.Y., Y.C. and G.Z. wrote the paper.

COMPETING INTERESTS
The Institute of Automation, Chinese Academy of Sciences has submitted the patent applications on the
OWM algorithm (application No. PCT/CN2019/083355; invented by Chen Yang, Guanxiong Zeng and
Shan Yu; pending) and the CDP module (application No. PCT/CN2019/083356; invented by Guanxiong
Zeng, Chen Yang and Shan Yu; pending).

18

You might also like