Meta-Learning With Temporal Convolutions
Meta-Learning With Temporal Convolutions
Abstract
Deep neural networks excel in regimes with large amounts of data, but tend to
struggle when data is scarce or when they need to adapt quickly to changes in the
task. Recent work in meta-learning seeks to overcome this shortcoming by training
a meta-learner on a distribution of similar tasks; the goal is for the meta-learner to
generalize to novel but related tasks by learning a high-level strategy that captures
the essence of the problem it is asked to solve. However, most recent approaches
to meta-learning are extensively hand-designed, either using architectures that are
specialized to a particular application, or hard-coding algorithmic components
that tell the meta-learner how to solve the task. We propose a class of simple and
generic meta-learner architectures, based on temporal convolutions, that is domain-
agnostic and has no particular strategy or algorithm encoded into it. We validate our
temporal-convolution-based meta-learner (TCML) through experiments pertaining
to both supervised and reinforcement learning, and demonstrate that it outperforms
state-of-the-art methods that are less general and more complex.
1 Introduction
The ability to learn quickly is a key characteristic that distinguishes human intelligence from its
artificial counterpart. Humans effectively utilize prior knowledge and experiences to learn new skills
quickly. However, artificial learners trained with traditional supervised-learning or reinforcement-
learning methods generally perform poorly when only a small amount of data is available or when
they need to adapt to a changing task.
Meta-learning attempts to resolve this deficiency by broadening the learner’s scope to a distribution of
related tasks. Rather than training the learner on a single task, with the goal of generalizing to unseen
samples from a similar data distribution, a meta-learner is trained on a distribution of similar tasks,
with the goal of learning a strategy that generalizes to related but unseen tasks from a similar task
distribution. A successful traditional learner discovers a rule that generalizes across data points, while
a successful meta-learner devises an algorithm that generalizes across tasks. Few-shot classification
is an example of a supervised setting that lends itself well to meta-learning: when there are very few
labeled examples per class, a meta-learner, who learns a high-level strategy based on comparing data
points, can significantly outperform a traditional learner, who learns a specific mapping from data
points to classes.
Many recently-proposed methods for meta-learning have performed well at the expense of being
hand-designed at either the architectural or algorithmic level. Some have been engineered with a
particular application in mind, while others have aspects of a particular high-level strategy already
hard-coded into them. However, the optimal strategy for an arbitrary range of tasks may not be
obvious to the humans designing a meta-learner, in which case the meta-learner should have the
flexibility to learn the best way to solve the tasks it is presented with. Such a meta-learner would
need to have a versatile model architecture with sufficient expressive capacity in order to be able to
learn a range of strategies in a variety of domains.
∗
denotes equal contribution, authors are listed in alphabetical order.
Our primary contribution is a simple and generic class of model architectures for meta-learning that
are based on temporal convolutions. At its core, our temporal-convolution-based meta-learner, or
TCML, is nothing more than a deep stack of dilated convolutional layers, inspired by similar models
that have been successfully used for applications ranging from image generation to text-to-speech
synthesis to modeling stochastic dynamical systems [28, 27, 17]. We evaluate our TCML on few-shot
image classification and multi-armed bandit problems, demonstrating both its capacity to learn a
variety of algorithms on its own and its competitiveness with the state-of-the-art meta-learning
approaches. It outperforms a number of approaches that are less general and/or more complex.
A meta-learner is trained by optimizing this expected loss over tasks (or mini-batches of tasks)
sampled from T . During testing, the meta-learner is evaluated on unseen tasks from a different task
distribution Te that is similar to the training task distribution T .
In a supervised learning setting, such as regression or classification, each input xt that the meta-learner
receives is a training/test example. The corresponding output at is the meta-learner’s prediction for
the current example. The loss functions Li are regression/classification losses, such as the l2 -loss or
cross-entropy loss. The transition distributions are uniform, as supervised learning generally assumes
the examples to be i.i.d.
In reinforcement learning, where the task is often defined by a Markov decsion process (MDP), the
inputs xt are the observations and rewards that the meta-learner receives from the environment, and
the output at is a distribution over actions (which can be discrete or continuous), from which the
action to take is sampled. The loss function Li is the negative of the MDP’s reward function, and
the transition distributions depend on the dynamics of the environment (in contrast to the supervised
setting, where current outputs do not affect future inputs).
2
van den Oord et al. [27] introduced a class of architectures that generate sequential data (in their
case, audio) by performing dilated 1D-convolutions over the temporal dimension. The convolutions
are causal, so that the generated values at the next timestep are only influenced by past timesteps
and not future ones. Evidenced by the use of similar architectures for image generation (van den
Oord et al. [28]) and modeling stochastic dynamical systems (Mishra et al. [17]), these models based
on temporal convolutions (TC) are not only expressive and versatile, but lend themselves well to
problems with an underlying sequential structure. Compared to traditional RNNs, the convolutional
structure offers more direct, high-bandwidth access to past information, allowing them to perform
more sophisticated computation on a fixed temporal segment.
However, one limitation of TC-based architectures is that, to scale to long sequences, the dilation
rates generally increase exponentially, so that the the number of layers required scales logarithmically
with the sequence length. This means the model has coarser access to inputs that are further back
in time. To ameliorate this shortcoming, we augment the TC layers with a lightweight attention
mechanism ([32]), allowing the TCML to attend over its own activations from previous timesteps. We
can think of temporal convolutions as aggregating information from past timesteps, and this attention
mechanism as allowing the model to pinpoint specific pieces of information. Appendix A provides a
detailed description of the attention mechanism.
Supervised Learning Reinforcement Learning
Predicted Labels ŷt at Actions
Temporal
Convolution
Meta-Learners
3
two images belong to the same class. Vinyals et al. [30] learned an embedding function and used
cosine distance in an attention kernel to judge image similarity. Snell et al. [25] employed a similar
approach to Vinyals et al. [30], but used Euclidean distance with their embedding function. All three
methods work well within the context of classification, but are not readily applicable to other domains,
because a strategy based on comparing data points is built into them.
A number of methods consider a meta-learner that makes updates to the parameters of a traditional
learner [10, 4]. Both Andrychowicz et al. [1] and Li and Malik [16] investigated the setting of
learning to optimize, where the learner is an objective function to minimize, and the meta-learner uses
the gradients of the learner to perform the optimization. Their meta-learner was implemented by an
LSTM and the strategy that it learned can be interpreted as a gradient-based optimization algorithm.
Ravi and Larochelle [20] extended this idea, using a similar LSTM meta-learner in a few-shot
classification setting, where the traditional learner was a convolutional-network-based classifier. In
this setting, the whole meta-learning algorithm is decomposed into two parts: the traditional learner’s
initial parameters are trained to be suitable for fast gradient-based adaptation; the LSTM meta-learner
is trained to be an optimization algorithm adapted for meta-learning tasks. Finn et al. [6] explored
a special case where the meta-learner is constrained to use ordinary gradient descent to update the
learner and showed that this simplified model can achieve very competitive performance. Munkhdalai
and Yu [18] explored a more sophisticated weight update scheme that improved performance.
All of the methods discussed in the previous paragraph have the benefit of being domain independent,
but they explicitly encode a particular strategy for the meta-learner to follow (namely, adaptation via
gradient descent at test time). On a particular domain, there may exist better strategies, but these
methods will be unable to discover them. In contrast, TCML presents a alternative paradigm where
an expressive but generic architecture has the capacity to learn different algorithms on its own.
Graves et al. [8] investigated the use of recurrent neural networks (RNNs) to solve algorithmic tasks.
They experimented with a meta-learner implemented by an LSTM, but their results suggested that
LSTM architectures are ill-equipped for these kinds of tasks. They then designed a more sophisticated
RNN architecture, where a feedforward or LSTM controller was coupled to an external memory
bank from which it can read and write, and demonstrated that these memory-augmented neural
networks (MANNs) achieved substantially better performance than LSTMs. Santoro et al. [21]
evaluated both LSTM and MANN meta-learners on few-shot image classification, and confirm
the inadequacy of the LSTM architecture. These approaches are generic, but MANNs feature a
complicated memory-addressing architecture that is difficult to train.
Out of all the prior methods we’ve discussed, TCML is closest in spirit to Santoro et al. [21]; however,
our experiments indicate that TCML outperforms such traditional RNN architectures. We can view
the TCML architecture as a flavor of RNN that can remember information through the activations of
the network rather than through an explicit memory module. Because of its convolutional structure,
the TCML better preserves the temporal structure of the inputs it receives, at the expense of only
being able to remember information for a fixed amount of time. However, by exponentially increasing
the dilation factors of the higher convolutional layers (as done by van den Oord et al. [27]), TCML
architectures can tractably store information for long periods of time.
A challenge unique to meta-learning in reinforcement learning (RL) settings is the exploration-
exploitation tradeoff. A RL agent must explore an environment, gathering information about what
the task is and what behaviors are possible, and then eventually exploit its knowledge for maximum
rewards, once confident that it has explored sufficiently. Duan et al. [5] and Wang et al. [31] both
investigated meta-RL and demonstrated that their meta-learners could explore and exploit in the
context of multi-arm bandit problems and finite MDPs. They both used traditional RNN architectures
(GRUs and LSTMs) to implemented their meta-learners. In both of these works, the bottleneck to
scaling up meta-RL appears to be the availability of numerous complex and diverse environments for
training, rather than the capacity of the model used.
If the task distribution used in training is not complex enough, the resulting meta-RL-learner may
simply learn an optimal policy for each environment, and then learn to detect which previously-
seen environment it is in, limiting the range of environments that it generalizes to. Finn et al. [6]
experimented with fast adaptation of policies for reinforcement learning, where the meta-learner was
trained on a distribution of closely-related locomotion tasks. It successfully learned to adapt its gait
according to current environment by learning how to perform system identification.
4
4 Experiments
• How does TCML’s generality affect its performance on a range of meta-learning tasks?
• How does its performance compare to existing approaches that are specialized to a particular
task domain, or have elements of high-level strategy already built-in?
In the few-shot classification setting, we wish to classify data points into N classes, when we
only have a small number (K) of labeled examples per class. A meta-learner is readily applicable,
because it learns how to compare input points, rather than memorize a specific mapping from points to
classes. Figure 2 illustrates how few-shot image classification fits into the meta-learning formalization
presented in Section 2.1 and our introduction of the TCML in Section 2.2.
3
Predicted Labels
TCML
(Features, Label)
(x0, y0) (x1, y1) (x2, y2) (x3, --)
Learned
Embedding Network
A D C A
(Image, Labels)
(i0, y0) (i1, y1) (i2, y2) (i3, --)
Figure 2: An episode of few-shot image classification using a TCML. Its inputs are a sequence of
feature vectors x1 , . . . , xt (produced by an embedding network φ from images i1 , . . . , it ) and their
labels y1 , . . . , yt−1 . Qualitatively, to make the correct prediction at time t = 3, the TCML would
need to (i) determine that x3 , x0 look more similar than x3 , x1 or than x3 , x2 , and then (ii) remember
that the label for x0 was y0 , and (iii) accordingly output y0 as its prediction for ŷ3 . Thus, for the
TCML to succeed at few-shot image classification, it needs to learn how to compare images and
evaluate their similarity, but nowhere is such a comparison-based strategy built into the model.
5
Omniglot and mini-Imagenet are two datasets for few-shot image classification. Introduced by Lake
et al. [15], Omniglot consists of black-and-white images of handwritten characters gathered from
50 languages, for a total of 1632 different classes with 20 instances per class. Like prior works, we
downsampled the images to 28 × 28 and randomly selected 1200 classes for training and 432 for
testing. We performed the same data augmentation proposed by Santoro et al. [21], forming new
classes by rotating each member of an existing class by a multiple of 90 degrees. Mini-ImageNet
is a more difficult benchmark, consisting of 84 × 84 color images from 100 different classes with
600 instances per class. It comprises a subset of the well-known ImageNet dataset, providing the
complexity of ImageNet images without the need for substantial computational resources. We used
the split released by Ravi and Larochelle [20] with 64 classes for training, 16 for validation, and 20
for testing.
To evaluate a TCML on the N -way, K-shot problem, we sample N classes and K examples of each
class, and fed the corresponding N K example-label pairs to the model in a random order, followed
by a new, unlabled example from one of the N classes. We report the average accuracy on this last,
(N K + 1)-th timestep.
We tested the TCML on 5-way Omniglot, 20-way Omniglot, and 5-way mini-Imagenet. For each
of these three splits, we trained the TCML on episodes where the number of shots K was chosen
uniformly at random from 1 to 5. (note that this is unlike prior works, who train separate models
for each shot). For a K-shot episode within an N -way problem, the loss was simply the average
cross-entropy between the predicted and true label on the (N K + 1)-th timestep. For a complete
description of our TCML architecture, we refer the reader to Appendix B.
Table 1 displays our results on 5-way and 20-way Omniglot, and Table 2 has the results for 5-way
mini-Imagenet. We see that the TCML outperforms state-of-the-art methods that are extensively
hand-designed, and/or domain-specific. It significantly exceeds the performance of methods such as
Santoro et al. [21] that are similarly simple and generic.
Table 1: 5-way and 20-way, 1-shot and 5-shot classification accuracies on Omniglot, with 95%
confidence intervals where reported. For each task, the best-performing method is highlighted, along
with any others whose confidence intervals overlap.
Table 2: 5-way, 1-shot and 5-shot classification accuracies on Omniglot and mini-Imagenet, with
95% confidence intervals where reported. For each task, the best-performing method is highlighted,
along with any others whose confidence intervals overlap.
6
For Omniglot, our embedding network used a similar architecture to the one proposed in Vinyals
et al. [30], consisting of 4 blocks of {3 × 3 conv (64 filters), batch normalization, leaky ReLU
activation (leak 0.1), and 2 × 2 max-pooling}. The output was then flattened and passed through a
fully-connected layer to get a 64-dimensional feature vector.
For mini-ImageNet, our embedding network took inspiration from architectures commonly used
for the full ImageNet dataset by He et al. [9] and Simonyan and Zisserman [24], as we found that
the shallower architectures used by previous works did not make full use of the TCML’s expressive
capacity. Our model architecture is described in detail in Appendix C.
In all cases, the TCML and embedding were trained end-to-end and jointly 1 using Adam [13].
In addition to comparisons with prior work, we also considered a number of ablations:
• Using nearest-neighbors on the features learned by the Omniglot embedding. With Euclidean
distances, this achieves 65.1% and 67.1% on 1-shot and 5-shot, respectively. With cosine
distances, it achieves 67.7% and 68.3%. This indicates that the TCML learned a more
sophisticated strategy for performing comparisons than a traditional distance metric.
• Replacing the TCML with a stacked LSTM. We experimented with a similar number
of parameters to the TCML, varying the number of layers and their sizes. The best
model achieved 78.1% on 1-shot Omniglot and 90.8% on 5-shot, demonstrating the low
information-bandwidth capacity of traditional RNN architectures.
• Using the method of Finn et al. [6] (which relies on the built-in strategy of adaptation via
gradient descent at test time) using our mini-Imagenet embedding architecture. They ran the
experiment for us using the original implementation of their method. The performance was
substantially worse than the numbers they reported using the shallower embedding (30% on
1-shot mini-Imagenet), indicating that the TCML’s performance cannot be solely attributed
to the deeper embedding.
• Using the attention mechanism without the TC layers. This is related to the method of
Vinyals et al. [30]; they modify the feature vectors before applying attention using an LSTM,
and explicitly use the attention operation to form their prediction as a linear combination
of the known labels. Accordingly, we found that this baseline’s performance was slightly
worse than the reported performance of Vinyals et al. [30].
7
K one-hot vector indicating the arm selected at the previous timestep, and the corresponding reward.
It outputs a discrete probability distribution over the K arms; the arm to select is determined by
sampling from this distribution.
We compared our TCML to the GRU architecture used by Duan et al. [5], and several baselines
with varying optimality guarantees. The Gittins index [7] is the Bayes optimal solution in the
discounted, infinite-horizon setting, and serves as an oracle against which to benchmark performance.
We trained our TCML with trust region policy optimization (Schulman et al. [23]) using the same
hyperparameters as Duan et al. [5], and tested all combinations of N = 10, 100, 500 and K =
5, 10, 50. To test the scalability of our TCML as compared to traditional RNN architectures, we also
tested N = 1000, K = 50. For a detailed description of the other baselines and other experimental
details, we refer the reader to Duan et al. [5]’s paper. Appendix D describes our TCML architecture.
The results are summarized in Table 4.2. For all baseline methods, we report the values from Duan
et al. [5] (the last row is the only exception, since they did not test that scenario). The values reported
are the average reward per episode of length N , averaged over 1000 episodes, with 95% confidence
intervals. We observe that the gap between the Gittins index (oracle) and the meta-learners grows
with the size of the problem. However, there is a significant gap between TCML and Duan et al. [5]
in the N = 1000, K = 50 setting, which suggests that the GRU architecture they used is nearing its
capacity. Figure 3 displays a learning curve for this scenario, where the difference is quite pronounced;
the TCML learns both faster and better than the GRU.
Environment Method
Random ǫ-greedy UCB-1 Gittins Duan et al. [5] Ours
N = 10, K = 5 5.0 6.6 6.7 6.6 6.7 6.68 ± 0.14
N = 10, K = 10 5.0 6.6 6.7 6.6 6.7 6.73 ± 0.13
N = 10, K = 50 5.1 6.5 6.6 6.5 6.8 6.72 ± 0.14
N = 100, K = 5 49.9 75.4 78.0 78.3 78.7 79.11 ± 1.04
N = 100, K = 10 49.9 77.4 82.4 82.8 83.5 83.46 ± 0.80
N = 100, K = 50 49.8 78.3 84.3 85.2 84.9 85.15 ± 0.64
N = 500, K = 5 249.8 388.2 405.8 405.8 401.5 408.09 ± 4.92
N = 500, K = 10 249.0 408.0 438.9 437.8 432.5 432.42 ± 3.45
N = 500, K = 50 249.6 413.6 457.6 463.7 438.9 442.58 ± 2.49
N = 1000, K = 50 499.8 845.2 926.3 944.1 847.43 ± 6.86 889.79 ± 5.63
Table 3: Performance on a range of multi-arm bandit problems. For each, we highlighted the best
performing method, and any others whose performance is not statistically significantly different.
8
5 Conclusion and Future Work
We presented a simple and generic class of model architectures for meta-learning. By evaluating
them on few-shot image classification and multi-arm bandit problems, we demonstrated that these
temporal-convolution-based meta-learners, or TCMLs, are versatile enough to perform well in variety
of domains and can surpass the performance of methods that are substantially more hand-engineered
or are tailored to a particular domain.
Meta-learning holds enormous potential as an avenue towards artificial agents learning general-
purpose algorithms and reusing prior knowledge for efficient learning of new skills. TCMLs have
shown themselves to be an encouraging meta-learning technique, despite their simplicity and general-
ity. Directions for future work include exploring their performance in more complex domains, and
we hope that they inspire many advances in meta-learning.
Acknowledgements
We thank Yan Duan and Prafulla Dhariwal for feedback on rough drafts of the paper, Chelsea Finn for
both feedback on the paper and help with an ablation experiment, and Hugo Larochelle for insightful
feedback on the first version of the paper, incorporated in this revision. This work was funded in part
by the Office of Naval Research (N000141612723).
References
[1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom
Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In
Advances in Neural Information Processing Systems (NIPS), 2016.
[2] Jean-Yves Audibert and Rémi Munos. Introduction to bandits: Algorithms and theory. ICML
Tutorial on bandits, 2011.
[3] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine
Learning Research, 2002.
[4] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a
synaptic learning rule. In Optimality in Artificial and Biological Neural Networks, pages 6–8.
Univ. of Texas, 1992.
[5] Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl$ˆ2$:
Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,
2016.
[6] Chelsea Finn, Pieter Abbeel, and Sergy Levine. Model-agnostic meta learning. arXiv preprint
arXiv:1703.03400, 2017.
[7] J.C. Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical
Society. Series B (Methodological), 1979.
[8] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint
arXiv:1410.5401, 2014.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[10] Sepp Hochreiter, A Younger, and Peter Conwell. Learning to learn using gradient descent.
Artificial Neural Networks—ICANN 2001, pages 87–94, 2001.
[11] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected
convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training
by reducing internal covariate shift. International Conference on Machine Learning (ICML),
2015.
[13] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International
Conference on Learning Representations (ICLR), 2015.
9
[14] Gregory Koch. Siamese neural networks for one-shot image recognition. PhD thesis, University
of Toronto, 2015.
[15] Brenden M Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B Tenenbaum. One shot
learning of simple visual concepts. In CogSci, 2011.
[16] Ke Li and Jitendra Malik. Learning to optimize. International Conference on Learning
Representations (ICLR), 2017.
[17] Nikhil Mishra, Pieter Abbeel, and Igor Mordatch. Prediction and control with temporal segment
models. arXiv preprint arXiv:1703.04070, 2016.
[18] Tsendsuren Munkhdalai and Hong Yu. Meta networks. arXiv preprint arXiv:1703.00837, 2017.
[19] Devang K Naik and RJ Mammone. Meta-neural networks that learn by learning. In Neural
Networks, 1992. IJCNN., International Joint Conference on, volume 1, pages 437–442. IEEE,
1992.
[20] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In Interna-
tional Conference on Learning Representations (ICLR), 2017.
[21] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap.
Meta-learning with memory-augmented neural networks. In International Conference on
Machine Learning (ICML), 2016.
[22] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. On learning how to
learn: The meta-meta-... hook.) Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.
[23] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region
policy optimization. International Conference on Machine Learning (ICML), 2015.
[24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. International Conference on Learning Representations (ICLR), 2014.
[25] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning.
arXiv preprint arXiv:1703.05175, 2017.
[26] Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to
learn. Springer, 1998.
[27] Aaron van den Oord, Sander Dieleman, Heig Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model
for raw audio. CoRR, abs/1609.03499, 2016. URL http://arxiv.org/abs/1609.03499.
[28] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.
Conditional image generation with pixelcnn decoders. In Advances in Neural Information
Processing Systems (NIPS), 2016.
[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762,
2017.
[30] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one
shot learning. In Advances in Neural Information Processing Systems (NIPS), 2016.
[31] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,
Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn.
arXiv preprint arXiv:1611.05763, 2016.
[32] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,
Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with
visual attention. In International Conference on Machine Learning, 2015.
10
Appendix
In our experiments, we found that setting q = st worked just as well as passing it through a feed-
forward network; therefore, we report results using the former scheme. We used this attention
mechanism after the final convolutional layers (although it could conceivably be used elsewhere
and/or at multiple layers in the network) and set d′ = 16 and d′ = 48 for 5-way and 20-way tasks
respectively.
11
(a) Residual Block, (b) Dense Block (dilation rate R) (c) TCML Architecture for Few-Shot Image Classi cation
dilation rate R
Output shape: [B, T, D + 128]
Output shape: [B, T, D] Predicted
Labels H
1x1 conv,
Causal Conv 16 lters
Dilation R
128 Filters Attention Attention
Keys Values
X0 X1 XH
Input
Examples
Input
Labels
Y0 YH-1
Figure 4: The TCML architecture we used for few-shot image classification. Where input and
output shapes are indicated, the dimensions are (batch size, B) × (sequence length, T ) × (number of
channels, D). From left to right: (a) A residual block performs a causal convolution and adds the
output to the input. (b) The dense blocks in our model utilize a series of causal convolutions and
residual blocks like in (a), and then concatenate the output to the inputs. (c) The full model primarily
consists of a series of dense blocks. We used the same TCML architecture for both datasets. For
20-way OmniGlot we added 2 extra dense blocks at the end with dilation 32 and 64 to account for
the increased sequence length.
12
C Embedding Network Architecture for mini-Imagenet
Our embedding network for mini-Imagenet also used residual connections. We used residual blocks
that contained a series of convolution layers, followed by a residual connection and then a 2 × 2
max-pooling operation. The structure of the residual blocks, as well as the entire network architecture,
is illustrated in Figure 5. We used dropout of 0.5 in the later layers (indicated in the figure) to help
prevent overfitting.
dropout 0.5,
1x1 conv, 512 lters
Add
batch norm
84x84x3 image
Figure 5: Left (a), the residual blocks used in our mini-Imagenet embedding, and right (b) the entire
network architecture, which had 14 layers and about 3 million parameters. Its input is an 84 × 84
color image and it outputs a 512-dimensional feature vector.
13
D TCML Architecture for Multi-Armed Bandits
The TCML architecture we used for multi-armed bandit experiments was a smaller but closely-related
version of the one illustrated in Figure 4.
For the bandit problem with N timesteps and K arms, we chose the number of layers to be min x :
2x ≥ N , to ensure that the network’s receptive field was at least as long as the length of the episode.
We did not use residual connections or attention, but used dense connections between every causal
layer. All causal convolutions had 64 channels, and the dilation factor doubled at each layer. We used
a 1 × 1 convolution with 128 filters before the first causal layer, and a 1 × 1 convolution with K filters
after the last causal layer (the latter outputs the log-probability of choosing each of the K arms).
14