Multitasking Capability Versus Learning Efficiency in Neural Network Architectures
Multitasking Capability Versus Learning Efficiency in Neural Network Architectures
829-834
The network consists of two input layers, one of which repre- Task Environment
sents the stimulus presented to the network and another that Each task is defined as a mapping between a subspace of five
encodes the task that the network is instructed to perform on stimulus features (referred to as a task-relevant stimulus di-
the stimulus. Stimulus input features can take any real value mension) onto five output units of a task-specific response
between 0 and 1 and can be grouped into stimulus dimen- dimension, so that only one of the five relevant output units is
sions that are relevant for a particular task. The network is permitted to be active (see Fig. 1). The value of each stimu-
instructed to perform a single task by clamping the corre- lus feature is drawn from a uniform distribution U[0, 1]. The
sponding task unit in the task layer to 1 while all other task rule by which 5 relevant stimulus features of any task-relevant
units are set to 0. These stimulus and task input values are stimulus dimension are mapped onto one of the 5 output units
multiplied by a matrix of connection weights from the re- of the task-relevant response dimension corresponds to a non-
spective input layer to a shared associative layer, and then linear function that was randomly generated2 with a sepa-
passed through a logistic function to determine the pattern of rate “teacher“ network (cf. Seung, Sompolinsky, & Tishby,
activity over the units in the associative layer. This pattern is 1992), and is the same across tasks. However, tasks are con-
then used (together with a set of direct projections from the sidered to be independent in that they differ which stimulus
task layer) to determine the pattern of activity over the output dimension is linked to which response dimension.
layer. The latter provides a response pattern that is evaluated The task environment across all simulations encompasses
by computing its mean squared error (MSE) with respect to nine tasks. As illustrated in Fig. 2 groups of three tasks map
the correct (task-determined) output pattern. Similar to stim- 2 Note that it is ensured that, for the uniform distribution U[0, 1]
ulus features, output units can be grouped into response di- of stimulus unit activations in the task-relevant set of input units, ev-
mensions that are relevant for a particular task. Note that the ery relevant output unit is equally likely to be required for execution.
onto the same response dimension. However, similarity be- no relevant stimulus features (cf. Fig. 2a), all five stimu-
tween tasks could be varied by manipulating the overlap be- lus features (cf. Fig. 2b) or any whole number of features
tween their relevant stimulus dimensions. At the extremes, in between, resulting in 6 different task environments. We
task environments can be generated such that tasks of differ- Task
trained 100 networks in each of the environments. The net-
Structure
ent response dimensions relate to separate stimulus features works were trained on all nine tasks with the same set of 50
(no feature overlap, Fig. 2a), or the same stimulus features stimulus samples until the network achieved an MSE of 0.01.
(full feature overlap, e.g. tasks 1-3 in Fig. 2b). Feature O
(a) Feature Overlap Between Tasks (b)
Multitasking P(Correct)
Between Task- 0.75 70
0.8
Feature Overlap
Relevant Stimulus yo 0.8
Feature Overlap
0.7 0.6
65
Features 0.65 0.6
60 0.4
!" !*
!$ !) 0.6 0.4 55
!# ! !& !' !( 0.2
% 0.55
0.2 50
xs 0.5
0 0.5 1
0
0
Stimulus Features 0.4 0.6 0.8 1 Learned Task Correlation
Learned Weight Correlation
(b) Full Overlap Response Units Figure 3: Effects of task similarity. (a) Networks were trained
Between Task- yo in task environments with varying degrees of feature overlap.
Relevant Stimulus
Yellow and red shades highlight task-relevant stimulus fea-
Features !" !& !*
tures for two tasks involving different response dimensions.
!$ !)
!# !% !' !(
(b) Final multitasking accuracy of the network as a function
of the learned similarity between tasks involving different
xs
Stimulus Features response dimensions. Colors indicate the degree of feature
Figure 2: Task environments. For each task, the network was overlap present in the task environment as illustrated in (a).
trained to map a subset of 5 stimulus features onto a subset of In order to assess the similarity of learned task representa-
5 output units. At the extremes tasks that were mapped onto tions we focus our analysis on the weights from the task units
different response dimensions (e.g. tasks 1-3) could either (a) to the associative layer, insofar as these reflect the compu-
rely on separate stimulus features or (b) completely overlap tations carried out by the network required to perform each
in terms of their relevant stimulus features. task. For a given pair of tasks we compute the learned repre-
Networks are initialized with a set of small random weights sentational similarity between them as the Pearson correlation
and then trained on all tasks using the backpropagation algo- of their weight vectors to the associative layer.
rithm3 (Rumelhart & Geoffrey E. Hinton, 1986) to produce We measured multitasking performance for pairs of tasks
the task-specified response for each stimulus. (of different stimulus and response dimensions) by activating
two task units at the same time and evaluating the concurrent
Multitasking Limitations Due to Shared processing performance in the response dimensions relevant
Structure in the Task Environment to the two tasks. The accuracy of a single task ASingle can be
A key feature of neural networks is their ability to discover computed as ac
Asingle = 5 (1)
latent structure in the task environment, exploiting similarity Âi=1 ai
between stimulus features in the form of shared representa-
tions (Hinton, 1986; Saxe, McClelland, & Ganguli, 2013). where ai is the activation of the ith output unit of the task-
In this section we explore how the emergence of shared rep- relevant response dimension and ac is the activation of the
resentations as a function of structural similarities between correct output unit. The multitasking accuracy is simply the
tasks can impact the multitasking performance of a network. mean accuracy of both engaged single tasks.
The simulation results confirm well-known explorations in
Simulation Experiment 1: Shared Task neural networks (Hinton, 1986; McClelland & Rogers, 2003;
Representations as a Function of Feature Overlap Saxe et al., 2013) that task similarities in the environment
In order to investigate the effect of structural similarities be- can translate into similarities between learned task represen-
tween tasks we generated task environments with varying tations. Critically, this extrinsic bias toward the learning of
overlap between task-relevant stimulus features. We define shared representations negatively affected multitasking per-
feature overlap as the number of relevant stimulus features formance (Fig. 3b). To illustrate this, consider the simultane-
that are shared between any pair of tasks linked to different ous execution of tasks 1 and 5 in an environment as depicted
response dimensions (see Fig. 3a). That is, two tasks in- in Fig. 2b. If the network learns similar representations at the
volving two different response dimensions could either share associative layer for tasks 1 and 2 (note that both tasks rely on
3 All reported results were obtained using gradient decent to min- the same stimulus features), then executing task 1 will implic-
imize the MSE of each training pattern. However, we observed the itly engage the representation of task 2 which in turn causes
same qualitative effects using the cross-entropy loss function. interference via its link to the response dimension of task 5.
Multitasking Limitations due to Intrinsic !$ !"
(a) !#
Output gating
signal
Learning Biases
Associative gating
In addition to environmental biases that shape the learning of signal
shared task representations there may be factors intrinsic to #
&'(
*
&'(
the neural system that can regulate the degree to which such
%# %$ %"
representations are exploited in learning. In this section we
Group 1 Group 2
introduce a formal analysis of how such biases can affect the (b)
!# !$ !"
tradeoff between learning efficiency and multitasking perfor- Output gating signal
mance. We then use weight initialization as a learning bias Group 1 Group 2
in simulations to establish a causal relationship between the
use of shared representations on the one hand, and resulting *,$
&'(
#,$
&'( *,# Associative
#,# &'(
effects on learning and multitasking, on the other hand. &'( gating
signal
Efficiency and Multitasking Capability Figure 4: Gating model used for formal analysis. (a) Task
information directly switches on or off task-relevant dimen-
To gain formal intuition into the tradeoff between multitask-
sions in the output and associative layers. This allows input-
ing ability and learning speed, we consider a stripped-down
to-hidden weights to be shared across the M different tasks
version of the introduced network model that is amenable to
corresponding to differentpresponse dimensions, increasing
analysis. In the full model, nonlinear interactions between
learning speed by a factor M. However, two tasks that rely
the task units and the stimulus units occur in the associative
on different input dimensions cannot be multitasked due to
layer. Here we assume a gating model in which these non-
crosstalk at the output (convergent red and green arrows). (b)
linear interactions are carried out through gating signals that
Multitasking ability can be improved by separating response
can zero out parts of the activity in the associative and output
dimensions into Q groups, each with a dedicated set of units
layers, or pass it through unchanged. The choice of which
in the associative layer. Gating now permits one task from
parts of each layer are gated through on each input is left to
each group to operate concurrently (red and green arrows no
the designer (not learned, as in the full model).
longer converge). However, weightpsharing is limited to the
We study the scheme depicted in Fig. 4 consisting of M
group, yielding a learning speed of M/Q.
input and response dimensions with full feature overlap (cf.
Fig. 2b). For the output layer, we assume that the gating learning speed and multitasking ability. Let t be the number
variables automatically zero all but the task-relevant response of iterations required to learn all tasks, Q the maximum num-
dimensions. For the associative layer, we separate the hid- ber of concurrently executable tasks, and M the number of
den units into dimensions, one for each input dimension, and input/response response dimensions. Then
make the gating variables zero all representations except the
t 2 µ Q/M (2)
one coming from the task-relevant input dimension (Fig. 4a).
Crucially, when the gating structure is known on a specific where the proportionality constant is related to the statisti-
example, the output of the network is a linear function of the cal strength of the input-output association for one task, the
neurons that are on. Given this setting, the learning dynamics learning rate, and the error cut-off used to decide when learn-
can be solved exactly using methods developed by Saxe, Mc- ing is complete (Saxe, Musslick, & Cohen, 2017).
Clelland, and Ganguli (2014). The key advantage afforded by Due to the tradeoff in Eqn. (2), gating schemes that share
the gating scheme is depicted in Fig. 4a: the input-to-hidden more structure will learn more quickly. Hence generic, ran-
weights for one input dimension can be shared by all tasks p domly initialized nonlinear networks will tend to favor shared
that rely on that input dimension. This leads to a factor M representations, as shown in Simulation Experiment 1.
speedup in learning relative to learning a single task by itself
(proof omitted due to space constraints). Simulation Experiment 2: Effects of Learning
However, with this gating system, multitasking is not pos- Biases for Shared Representations
sible: gating another task through to the output will lead to In Simulation 2 we focus on a bias intrinsic to the neural sys-
interference. To counteract this, the gating scheme must be tem, i.e. the initialization of the weights from the task layer.
changed: response dimensions can be divided into Q groups, We use this factor to systematically examine how the use of
each with a dedicated set of hidden units (Fig. 4b). This al- shared representations facilitates the discovery of similarity
lows tasks that use response dimensions in different output structure while diminishing multitasking performance. To do
groups to be performed simultaneously. Hence a maximum so, we focus initially on a training environment in which tasks
of Q tasks can be performed simultaneously, but weight shar- are maximally similar, as this is the condition in which there
ing is reduced across tasks by a factor Q, slowing learning. is most opportunity for exploiting shared representations. We
This analysis provides, at least in a simplified system, a then examine environments with 80% and 0% feature overlap
quantitative expression of the fundamental tradeoff between between tasks, to test the generality of the observed effects.
To manipulate the bias towards shared task representations, sources of this capacity constraint associated with control re-
we initialized the weights from the task units to the associa- main largely unexplored. Here, we build upon the observation
tive layer, varying the similarity among the weight vectors that multitasking limitations can arise from shared representa-
across tasks with the rationale that greater similarity should tions between tasks (Feng et al., 2014; Musslick et al., 2016),
produce a greater bias toward the use of shared representa- and use a combination of formal analysis and neural network
tions in the associative layer. Weight vectors for tasks relying simulations to examine biases towards shared representations
on the same stimulus input dimensions were randomly ini- that incur such costs in multitasking.
tialized to yield a correlation coefficient of value r. The cor- In the first part of this study, we build upon early insights of
relation value r was varied from 0 to 0.975 in steps of 0.025 connectionism that shared representations emerge as a func-
and was used to constrain initial weight similarities for 100 tion of task similarities in the environment and demonstrate
simulated networks per initial condition. The weight vectors the deleterious consequences for multitasking performance.
for tasks of non-overlapping stimulus dimensions were un- It has been shown that networks are capable of extracting sim-
correlated. Finally, all task weights to the associative layer ilarities from a hierarchically structured input space (Hinton,
were scaled by a factor of 5 to enhance the effects of different 1986). Recent analytic and empirical work in the domain of
initial task similarities. The networks were trained using the semantic cognition paints a similar picture: neural systems
same parameters as reported for Simulation Experiment 1. may gradually discover shared structure in the task environ-
Simulation results indicate that networks with a higher ment with a bias towards the initial formation of shared, low-
similarity bias tend to develop more similar representations dimensional representations (Saxe et al., 2013; McClelland &
at the associative layer for those tasks (in terms of their fi- Rogers, 2003). Our simulation results are in line with these
nal weight vector correlations), whereas a lower similarity observations showing that shared task representations emerge
bias leads to more distinct task representations at this layer. as a function of high stimulus feature overlap between tasks
In environments with high feature overlap between tasks, and furthered the insight that such similarities in the task en-
stronger initial biases toward shared representations lead to vironment lead to multitasking limitations.
increased learning speed (i.e. less iterations required to train
In the second part, we examined how intrinsic learning bi-
the network), as similarities between tasks can be exploited
ases towards shared or separate representations (by means
(Fig. 5a). Critically, this comes at the cost of multitasking
of weight initialization) can be used to expose a tradeoff
performance. Learning benefits gained from shared represen-
between learning efficiency and multitasking performance.
tations are less prevalent in environments with less feature
Early& work in machine learning suggests that learning bi-
Learning Speed
overlap between tasks. However, effects of weight similarity Generalization
ases towards a particular representation can be understood
biases on multitasking impairments remain (Fig. 5b).
as biases of the learner’s hypothesis space (Baxter, 1995),
(a) 100% Feature Overlap (b) Effects Across Task Environments that is, the set of all hypotheses a learner may use to ac-
quire new tasks. We formalized this hypothesis space in terms
Multitasking Accuracy (%)
52 1 90 1
Multitasking Accuracy (%)
0% Feature Overlap
Initial Task Correlation
50 0.8 80 0.8
46 0.4 60
100%
Feature Overlap 80%
learning efficiency and interference-free multitasking. Our
0.4
44 0.2 50 Feature Overlap neural network simulations confirmed these analytical predic-
0.2