Taskonomy: Disentangling Task Transfer Learning
Taskonomy: Disentangling Task Transfer Learning
Amir R. Zamir1,2 Alexander Sax1∗ William Shen1∗ Leonidas Guibas1 Jitendra Malik2 Silvio Savarese1
1 2
Stanford University University of California, Berkeley
http://taskonomy.vision/
Abstract 3D
Edges
13712
are still largely unknown. The relationships are non-trivial, supervised learning [20, 6, 15, 100, 17, 80], which are stud-
and finding them is complicated by the fact that we have ied across various fields [70, 91, 10]. We review the topics
imperfect learning models and optimizers. In this paper, most pertinent to vision within the constraints of space:
we attempt to shed light on this underlying structure and Self-supervised learning methods leverage the inherent
present a framework for mapping the space of visual tasks. relationships between tasks to learn a desired expensive one
Here what we mean by “structure” is a collection of com- (e.g. object detection) via a cheap surrogate (e.g. coloriza-
putationally found relations specifying which tasks supply tion) [65, 69, 15, 100, 97, 66]. Specifically, they use a
useful information to another, and by how much (see Fig. 1). manually-entered local part of the structure in the task space
We employ a fully computational approach for this pur- (as the surrogate task is manually defined). In contrast, our
pose, with neural networks as the adopted computational approach models this large space of tasks in a computational
function class. In a feedforward network, each layer succes- manner and can discover obscure relationships.
sively forms more abstract representations of the input con- Unsupervised learning is concerned with the redundan-
taining the information needed for mapping the input to the cies in the input domain and leveraging them for forming
output. These representations, however, can transmit statis- compact representations, which are usually agnostic to the
tics useful for solving other outputs (tasks), presumably if downstream task [6, 47, 18, 7, 30, 74]. Our approach is not
the tasks are related in some form [80, 17, 56, 44]. This is unsupervised by definition as it is not agnostic to the tasks.
the basis of our approach: we computes an affinity matrix Instead, it models the space tasks belong to and in a way
among tasks based on whether the solution for one task can utilizes the functional redundancies among tasks.
be sufficiently easily read out of the representation trained Meta-learning generally seeks performing the learning
for another task. Such transfers are exhaustively sampled, at a level higher than where conventional learning occurs,
and a Binary Integer Programming formulation extracts a e.g. as employed in reinforcement learning [19, 29, 26],
globally efficient transfer policy from them. We show this optimization [2, 79, 46], or certain architectural mecha-
model leads to solving tasks with far less data than learn- nisms [25, 28, 84, 62]. The motivation behind meta learn-
ing them independently and the resulting structure holds on ing has similarities to ours and our outcome can be seen as
common datasets (ImageNet [75] and Places [101]). a computational meta-structure of the space of tasks.
Being fully computational and representation-based, the Multi-task learning targets developing systems that can
proposed approach avoids imposing prior (possibly incor- provide multiple outputs for an input in one run [48, 16].
rect) assumptions on the task space. This is crucial because Multi-task learning has experienced recent progress and the
the priors about task relations are often derived from either reported advantages are another support for existence of a
human intuition or analytical knowledge, while neural net- useful structure among tasks [90, 97, 48, 73, 70, 48, 16, 94,
works need not operate on the same principles [60, 31, 38, 58, 9, 63]. Unlike multi-task learning, we explicitly model
43, 99, 85]. For instance, although we might expect depth the relations among tasks and extract a meta-structure. The
to transfer to surface normals better (derivatives are easy), large number of tasks we consider also makes developing
the opposite is found to be the better direction in a compu- one multi-task network for all infeasible.
tational framework (i.e. suited neural networks better). Domain adaption seeks to render a function that is de-
An interactive taxonomy solver which uses our model veloped on a certain domain applicable to another [42, 96,
to suggest data-efficient curricula, a live demo, dataset, and 5, 77, 50, 24, 34]. It often addresses a shift in the input do-
code are available at http://taskonomy.vision/. main, e.g. webcam images to D-SLR [45], while the task
is kept the same. In contrast, our framework is concerned
2. Related Work with output (task) space, hence can be viewed as task/output
adaptation. We also perform the adaptation in a larger space
Assertions of existence of a structure among tasks date among many elements, rather than two or a few.
back to the early years of modern computer science, e.g.
with Turing arguing for using learning elements [92, 95] 3. Method
rather than the final outcome or Jean Piaget’s works on
developmental stages using previously learned stages as We define the problem as follows: we want to max-
sources [71, 37, 36], and have extended to recent works [73, imize the collective performance on a set of tasks T =
70, 48, 16, 94, 58, 9, 63]. Here we make an attempt to actu- {t1 , ..., tn }, subject to the constraint that we have a limited
ally find this structure. We acknowledge that this is related supervision budget γ (due to financial, computational, or
to a breadth of topics, e.g. compositional modeling [33, 8, time constraints). We define our supervision budget γ to be
11, 21, 53, 89, 87], homomorphic cryptography [40], life- the maximum allowable number of tasks that we are willing
long learning [90, 13, 82, 81], functional maps [68], certain to train from scratch (i.e. source tasks). The task dictionary
aspects of Bayesian inference and Dirichlet processes [52, is defined as V=T ∪ S where T is the set of tasks which we
88, 87, 86, 35, 37], few-shot learning [78, 23, 22, 67, 83], want solved (target), and S is the set of tasks that can be
transfer learning [72, 81, 27, 61, 64, 57], un/semi/self- trained (source). Therefore, T − T ∩ S are the tasks that
3713
(I) Task-specific Modeling (II) Transfer Modeling (III) Task Affinity (IV) Compute Taxonomy
Layout Normals Reshading Layout Normals Reshading
Normalization
Figure 2: Computational modeling of task relations and creating the taxonomy. From left to right: I. Train task-specific networks. II. Train (first
order and higher) transfer functions among tasks in a latent space. III. Get normalized transfer affinities using AHP (Analytic Hierarchy Process). IV. Find
global transfer taxonomy using BIP (Binary Integer Program).
Query Image Surface Normals Eucl. Distance Object Class. Scene Class.
we want solved but cannot train (“target-only”), T ∩ S are
Top 5 prediction: Top 2 prediction:
the tasks that we want solved but could play as source too, sliding door living room
and S − T ∩ S are the “source-only” tasks which we may home theater, home theatre
studio couch, day bed
television room
be optionally used if they increase the performance on T . Jigsaw puzzle Colorization 2D Segm. 2.5D Segm. Semantic Segm.
The task taxonomy (taskonomy) is a computationally
found directed hypergraph that captures the notion of task
transferability over any given task dictionary. An edge be-
tween a group of source tasks and a target task represents a
feasible transfer case and its weight is the prediction of its Vanishing Points 2D Edges 3D Edges 2D Keypoints 3D Keypoints
performance. We use these edges to estimate the globally
optimal transfer policy to solve T . Taxonomy produces a
family of such graphs, parameterized by the available su-
pervision budget, chosen tasks, transfer orders, and transfer
3D Curvature Image Reshading In-painting Denoising Autoencoding
functions’ expressiveness.
Taxonomy is built using a four step process depicted in
Fig. 2. In stage I, a task-specific network for each task in S
is trained. In stage II, all feasible transfers between sources
and targets are trained. We include higher-order transfers Cam. Pose (non-fixated) Cam. Pose (fixated) Triplet Cam. Pose Room Layout Point Matching
which use multiple inputs task to transfer to one target. In
stage III, the task affinities acquired from transfer function
performances are normalized, and in stage IV, we synthe-
size a hypergraph which can predict the performance of any
transfer policy and optimize for the optimal one. Figure 3: Task Dictionary. Outputs of 24 (of 26) task-specific networks
A vision task is an abstraction read from a raw image. for a query (top left). See results of applying frame-wise on a video here.
We denote a task t more formally as a function ft which
maps image I to ft (I). Our dataset, D, contains for each generalize to out-of-dictionary tasks. The more regular /
task t a set of training pairs (I, ft (I)), e.g. (image, depth). better sampled the space, the better the generalization. We
Task Dictionary: Our mapping of task space is done evaluate this in Sec. 4.2 with supportive results. For evalu-
via (26) tasks included in the dictionary, so we ensure they ation of the robustness of results w.r.t the choice of dictio-
cover common themes in computer vision (2D, 3D, seman- nary, see the supplementary material.
tics, etc) to the elucidate fine-grained structures of task Dataset: We need a dataset that has annotations for ev-
space. See Fig. 3 for some of the tasks with detailed def- ery task on every image. Training all of our tasks on exactly
inition of each task provided in the supplementary material. the same pixels eliminates the possibility that the observed
It is critical to note the task dictionary is meant to be transferabilities are affected by different input data peculiar-
a sampled set, not an exhaustive list, from a denser space ities rather than only task intrinsics. We created a dataset of
of all conceivable visual tasks. This gives us a tractable 4 million images of indoor scenes from about 600 build-
way to sparsely model a dense space, and the hypothesis is ings; every image has an annotation for every task. The
that (subject to a proper sampling) the derived model should images are registered on and aligned with building-wide
3714
...
Surface Normal
I Es 3rd order Input Truth Specific Reshade Layout 2D Segm. Autoenc. Scratch
Estimation
nd
2 order
Ground Task
Representation Es(I) Reshade Layout 2D Segm. Autoenc. Scratch
Segmentation
Transfer Function Input Truth Specific
2.5D
Frozen (e.g., curvature) (e.g., surface normal)
Figure 4: Transfer Function. A small readout function is trained to map Figure 5: Transfer results to normals and 2.5D Segmentation from
representations of source task’s frozen encoder to target task’s labels. If 5 different source tasks. The spread in transferability among sources is
order> 1, transfer function receives representations from multiple sources. apparent. “Scratch” was trained from scratch without transfer learning.
meshes similar to [3, 98, 12] enabling us to programmati- Higher-Order Transfers: Multiple source tasks can
cally compute the ground truth for many tasks without hu- contain complementary information for solving a target task
man labeling. For the tasks that still require labels (e.g. (see examples in Fig 6). We include higher-order transfers
scene classes), we generate them using Knowledge Distil- which are the same as first order but receive multiple rep-
lation [41] from known methods [101, 55, 54, 75]. See the resentations in the input. Thus, our transfers are functions
supplementary material for full details of the process and D : ℘(S) → T , where ℘ is the powerset operator.
a user study on the final quality of labels generated using
As there is a combinatorial explosion in the number of
Knowledge Distillation (showing < 7% error).
feasible higher-order transfers (|T | × |S|k for k th
order),
3.1. Step I: Task-Specific Modeling we employ a sampling procedure with the goal of filtering
out higher-order transfers that are less likely to yield good
We train a fully supervised task-specific network for results, without training them. We use a beam search: for
each task in S. Task-specific networks have an encoder- transfers of order k ≤ 5 to a target, we select its 5 best
decoder architecture homogeneous across all tasks, where sources (according to 1st order performances) and include
the encoder is large enough to extract powerful represen- all of their order-k combination. For k ≥ 5, we use a beam
tations, and the decoder is large enough to achieve a good of size 1 and compute the transfer from the top k sources.
performance but is much smaller than the encoder. We also tested transitive transfers (s → t1 → t2 ) which
showed they do not improve the results, and thus, were not
3.2. Step II: Transfer Modeling include in our model (results in supplementary material).
Given a source task s and a target task t, where s ∈ S
and t ∈ T , a transfer network learns a small readout func- 3.3. Step III: Ordinal Normalization using Analytic
tion for t given a statistic computed for s (see Fig 4). The Hierarchy Process (AHP)
statistic is the representation for image I from the encoder We want to have an affinity matrix of transferabilities
of s: Es (I). The readout function (Ds→t ) is parameterized across tasks. Aggregating the raw losses/evaluations Ls→t
by θs→t minimizing the loss Lt : from transfer functions into a matrix is obviously problem-
h i atic as they have vastly different scales and live in different
Ds→t := arg min EI∈D Lt Dθ Es (I) , ft (I) , (1) spaces (see Fig. 7-left). Hence, a proper normalization is
θ
needed. A naive solution would be to linearly rescale each
where ft (I) is ground truth of t for image I. Es (I) may or row of the matrix to the range [0, 1]. This approach fails
may not be sufficient for solving t depending on the relation when the actual output quality increases at different speeds
between t and s (examples in Fig. 5). Thus, the performance w.r.t. the loss. As the loss-quality curve is generally un-
of Ds→t is a useful metric as task affinity. We train transfer known, such approaches to normalization are ineffective.
functions for all feasible source-target combinations. Instead, we use an ordinal approach in which the output
Accessibility: For a transfer to be successful, the latent quality and loss are only assumed to change monotonically.
representation of the source should both be inclusive of suf- For each t, we construct Wt a pairwise tournament matrix
ficient information for solving the target and have the in- between all feasible sources for transferring to t. The ele-
formation accessible, i.e. easily extractable (otherwise, the ment at (i, j) is the percentage of images in a held-out test
raw image or its compression based representations would set, Dtest , on which si transfered to t better than sj did (i.e.
be optimal). Thus, it is crucial for us to adopt a low-capacity Dsi →t (I) > Dsj →t (I)).
(small) architecture as transfer function trained with a small We clip this intermediate pairwise matrix Wt to be in
amount of data, in order to measure transferability condi- [0.001, 0.999] as a form of Laplace smoothing. Then we
tioned on being highly accessible. We use a shallow fully divide Wt′ = Wt /WtT so that the matrix shows how many
convolutional network and train it with little data (8x to times better si is compared to sj . The final tournament ratio
′
120x less than task-specific networks). matrix is positive reciprocal with each element wi,j of Wt′ :
3715
Autoencoding
Image GT (Normals) Fully Supervised Image GT (Reshade) Fully Supervised Object Class. (1000)
Scene Class
Curvature
Denoising
2D Edges
Occlusion Edges
Egomotion
Cam. Pose (fix)
2D Keypoint
3D Keypoint
Cam. Pose (nonfix)
Matching
nd nd
Reshading
{Occlusion Edges + Curvature } = 2 order transfer { 3D Keypoints + Surface Normals } = 2 order transfer Z-Depth
Distance
Normals
Layout
2.5D Segm.
2D Segm.
Semantic Segm.
Vanishing Pts.
2.5D Se ts.
Po ose m.
.
m. . P Seg .
Va ntic Segm.
t C pe j.
nis eg .
sha roj.
Tas hin m.
pec ts.
De rvatu g
hin on
Dis epth
E No nce
no re
Z-Dding
Va gomrmals
clu 2D Eising
nti Cla g
C Jig ting
2Dion Edges
Se Scenatchin0)
No tance
ific
3D Key dges
In-ose ( on
Z-Dding
(no (fix)
In- c Se ss
in )
2.5 Layoals
Ob Tas domation
Re m Ping
las Layfix)
Pain gm
Dis epth
Ca am D gm
Re ypoin t
Raoloriz saw
la od )
t C e C 0)
Curizati 0)
Ca Ego Edges
Ca 3D Key saw
se p o t
Colass ( lass
Ob Au ss ( ific
m. mo es
S m
2D Jigting
ndo h )
clu D ing
Se 2D Segut
M 100 t
(no int
sh a t
Ob S ss. (1 ing
D rvatuon
Oc 2enois re
Ke poin
poin
jec k-S Pro
Pa fix
s. ( o u
Ra Matcnfix
Cu codin
2 gP
k-S g P
jec cen 00
lo 10
sio Edg
nis oti
ti
rm
ta
c
n
Po ey
t C nc
ma e
D
n
K
toe
la
P
se
n
s
ma
Au
tC
Oc
m.
jec
Figure 6: Higher-Order Transfers. Representations can contain com-
Ob
plementary information. E.g. by transferring simultaneously from 3D Figure 7: First-order task affinity matrix before (left) and after (right)
Edges and Curvature individual stairs were brought out. See our publicly Analytic Hierarchy Process (AHP) normalization. Lower means better
available interactive transfer visualization page for more examples. transfered. For visualization, we use standard affinity-distance method
dist = e−β·P (where β = 20 and e is element-wise matrix exponential).
See supplementary material for the full matrix with higher-order transfers.
3716
Supervision Budget 2 Supervision Budget 8 Supervision Budget 15 Supervision Budget 26
Colo
orizaation 2D Segm. Vanishing Pts. Occlusion Edges Cam. Pose (nonfix)
Cam. Pose (fix) Normals
Egomotion
Jig
gsaw
w 2D Keypoints Denoising Occlusion Edges Matching Normals
Semantic Segm.
Jigsaw
Transfer Order 1
Cam. Pose (fix)
Reshading 2D Edges 3D Keypoints Normals
Reshading
Reshading
Distance
i
Curvature
Supervision Budget 8 - Order 4 (zoomed)
2.5D Segm. Distance Layout Reshading
Semantic Segm. Autoencoding Z-Depth 2D Segm.
Vanishing Pts. Semantic Z-Depth
2.5D Segm. In-ppaintting Egomotion
Distance Matching Layout Curvature Segm. Jiggsaw Matching
Egomotion
Z-Depth Egomotion Z-Depth 2D Edges
Cam. Pose (nonfix) Semantic Segm. Randdom Proj
o.
Coloorizaation Denoising
Occlusion Edges Normals 3D Keypoints Rando
om Proj
o.
3D Keypoints Vanishing Pts. 2D
2D Edges Random Proj
Object Class. Vanishing o. 2.5D Segm. Autoencoding Segm. 2D
(1000)
Pts. 2.5D Segm.
Cam. Pose (fix)
Scene Class. Edges
Cam. Pose (nonfix) Cam. Pose (fix) In-painting Autoencoding 3D Keypoints
Scene Class. Scene Class. Collorization In-ppainttingg Layout Denoising
Autoencoding
g Object Class. 2D Segm. 2D Keypoints Autoencoding
Curvature Object Class. (1000) Jig
gsaw
w Denoising
Layout (1000) 2D Edges Cam. Pose (nonfix)
In-paintting
g 2D Segm. Scene Class. 2D Keypoints Colorizattion Occlusion Edges Object Class. (1000)
Denoising Curvature
Rando
om Proj
o. 2D Keypoints Matching Distance
Normals
Z-Depth 2D Edges 2D Edges Semantic Segm. 2D Keypoints (fix) Vanishing Pts.
2D Keypoints Semantic Segm. Scene Class. Normals Jiigsaaw Egomotion
Cam. Pose (fix) Cam. Pose Curvature Cam. Pose
Scene Class. 2.5D Segm. Layout
Distance 2D Segm. Colo
orizaation (fix) Curvature 2.5D Segm.
3D Keypoints Object Class. Cam. Pose (nonfix)
Jigssaw (nonfix)
(1000) Rando
om Proj
o. Randomm Proj
o. 3D Keypoints
Z-Depth Coloorizaation Cam. Pose (fix)
Denoising Curvature
Layout
Distance
Normals Semantic Egomotion Reshading 2D Segm.
Normals In-painting
2.5D Segm. Matching Object Class. (1000)
Cam. Pose (nonfix) Semantic
Segm. Distance
Scene Class. Distance . Jiigsaw
w Segm. Z-Depth
Reshading Occlusion Edges Cam. Pose (fix) Z-Depth Vanishing Pts. 2D Edges In-painting Scene Class. Reshading
2.5D Segm. 2D Keypoints In-ppaintting
Egomotion Normals Curvature Autoencoding Reshading
Colorization
Object Class. Cam. Pose (nonfix) Layout Vanishing Pts. Layout Matching Z-Depth Egomotion
(1000) Randdom Proj
o. Reshading Cam. Pose (nonfix)
Vanishing Pts. Jigsaaw 3D Keypoints Denoising
Coloorizaation Vanishing Pts.
Scene Class.
Egomotion Occlusion Edges 2D Segm.
In-ppainnting Object Class.
Jiigsaw
w Matching Denoising Autoencoding (1000) Randdom Proj
o.
3D Keypoints Occlusion Edges
2D Segm. Jiggsaaw Object Class.
2D Edges
In-paintting 2D Segm. (1000)
Matching
Colorrizattion Denoising Semantic
Autoencoding Cam. Pose (fix)
Autoencoding Object Class.
Egomotion In-ppainnting
Layout Jiigsaw
w Segm.
Cam. Pose (fix) Reshading Denoising (1000)
Transfer Order 4
2D Edges Z-Depth
Z-Depth
Semantic Segm.
2D Edges
Distance Egomotion
Vanishing Pts. 2.5D Random Proj
o.
2D Keypoints Cam. Pose (nonfix) 2D Segm. Cam. Pose (nonfix) Segm.
Cam. Pose (fix) Layout
Cam. Pose Vanishing Pts. Layout Denoising Curvature
Distance 2D Segm. 2D Keypoints Normals Matching Coloorizattion
(fix) Cam. Pose Autoencoding
3D Keypoints Vanishing Pts. Scene Class.
igsaaw Reshading
Denoising Curvature (nonfix) Autoencoding
Layout Distance Scene Class.
Normals Normals In-painting Z-Depth 2D Keypoints Ranndom Proj
o.
2.5D Segm. Matching Z-Depth Reshading 2D Edges Normals
Scene Class Coloorizaation 2D Keypoints
Colorization Occlusion Edges
Reshading Occlusion Edges Egomotion Distance
Scene Class. andoom Proj
o. Occlusion
Egomotion Matching Semantic Segm.
Object Class. Cam. Pose (nonfix) 3D Keypoints Occlusion Edges Edges
Semantic Segm.
(1000) Randdom Proj
o. Vanishing Pts. Matching 3D Keypoints Curvature
Semantic Segm. 2.5D Segm.
In-ppainnting Object Class. (1000) Curvature
Jiigsaw
w 2.5D Segm.Random Proj o.
Object Class. 3D Keypoints
Curvature 2.5D Segm.
(1000)
Figure 8: Computed taxonomies for solving 22 tasks given various supervision budgets (x-axes), and maximum allowed transfer orders (y-axes). One
is magnified for better visibility. Nodes with incoming edges are target tasks, and the number of their incoming edges is the order of their chosen transfer
function. Still transferring to some targets when tge budget is 26 (full budget) means certain transfers started performing better than their fully supervised
task-specific counterpart. See the interactive solver website for color coding of the nodes by Gain and Quality metrics. Dimmed nodes are the source-only
tasks, and thus, only participate in the taxonomy if found worthwhile by the BIP optimization to be one of the sources.
4. Experiments works are all trained using the same hyperparameters as the
task-specific networks, except that we anneal the learning
With 26 tasks in the dictionary (4 source-only tasks), our rate earlier since they train much faster. Detailed definitions
approach leads to training 26 fully supervised task-specific of architectures, training process, and experiments with dif-
networks, 22 × 25 transfer networks in 1st order, and 22 × ferent encoders can be found in the supplementary material.
25 th
k for k order, from which we sample according to the
Data Splits: Our dataset includes 4 million images. We
procedure in Sec. 3. The total number of transfer functions
made publicly available the models trained on full dataset,
trained for the taxonomy was ∼3,000 which took 47,886
but for the experiments reported in the main paper, we
GPU hours on the cloud.
used a subset of the dataset as the extracted structure stabi-
Out of 26 tasks, we usually use the following 4 as source- lized and did not change when using more data (explained
only tasks (described in Sec. 3) in the experiments: col- in Sec. 5.2). The used subset is partitioned into training
orization, jigsaw puzzle, in-painting, random projection. (120k), validation (16k), and test (17k) images, each from
However, the method is applicable to an arbitrary partition- non-overlapping sets of buildings. Our task-specific net-
ing of the dictionary into T and S. The interactive solver works are trained on the training set and the transfer net-
website allows the user to specify any desired partition. works are trained on a subset of validation set, ranging from
Network Architectures: We preserved the architectural 1k images to 16k, in order to model the transfer patterns un-
and training details across tasks as homogeneously as possi- der different data regimes. In the main paper, we report all
ble to avoid injecting any bias. The encoder architecture is results under the 16k transfer supervision regime (∼10% of
identical across all task-specific networks and is a fully con- the split) and defer the additional sizes to the supplementary
volutional ResNet-50 without pooling. All transfer func- material and website (see Sec. 5.2). Transfer functions are
tions include identical shallow networks with 2 conv layers evaluated on the test set.
(concatenated channel-wise if higher-order). The loss (Lt ) How good are the trained task-specific networks? Win
and decoder’s architecture, though, have to depend on the rate (%) is the proportion of test set images for which a
task as the output structures of different tasks vary; for all baseline is beaten. Table 1 provides win rates of the task-
pixel-to-pixel tasks, e.g. normal estimation, the decoder is a specifc networks vs. two baselines. Visual outputs for a ran-
15-layer fully convolutional network; for low dimensional dom test sample are in Fig. 3. The high win rates in Table 1
tasks, e.g. vanishing points, it consists of 2-3 FC layers. and qualitative results show the networks are well trained
All networks are trained using the same hyperparameters and stable and can be relied upon for modeling the task
regardless of task and on exactly the same input images. space. See results of applying the networks on a YouTube
Tasks with more than one input, e.g. relative camera pose, video frame-by-frame here. A live demo for user uploaded
share weights between the encoder towers. Transfer net- queries is available here.
3717
Task avg rand Task avg rand Task avg rand Supervision Budget Increase (→)
Budget
Denoising 100 99.9 Layout 99.6 89.1 Scene Class. 97.0 93.4
Autoenc. 100 99.8 2D Edges 100 99.9 Occ. Edges 100 95.4
Reshading 94.9 95.2 Pose (fix) 76.3 79.5 Pose (nonfix) 60.2 61.9
Inpainting 99.9 - 2D Segm. 97.7 95.7 2.5D Segm. 94.2 89.4
Curvature 78.7 93.4 Matching 86.8 84.6 Egomotion 67.5 72.3
Normals 99.4 99.5 Vanishing 99.5 96.4 2D Keypnt. 99.8 99.4
Z-Depth 92.3 91.1 Distance 92.4 92.1 3D Keypnt. 96.0 96.9
Mean 92.4 90.9
Table 1: Task-Specific Networks’ Sanity: Win rates vs. random (Gaus-
sian) network representation readout and statistically informed guess avg.
mator vs. released models of [51] which led to outperform- Figure 9: Evaluation of taxonomy computed for solving the full task
ing [51] with a win rate of 88% and losses of 0.35 vs. 0.47 dictionary. Gain (left) and Quality (right) values for each task using the
policy suggested by the computed taxonomy, as the supervision budget
(further details in the supplementary material). In general, increases(→). Shown for transfer orders 1 and 4.
we found the task-specific networks to perform on par or
better than state-of-the-art for many of the tasks, though we
Taxonomy
ImageNet[49]
Noroozi.[65]
Zhang.[100]
Agrawal.[1]
full sup.
do not formally benchmark or claim this.
Zamir.[97]
Wang.[93]
scratch
Order Increase (→)
4.1. Evaluation of Computed Taxonomies Order Task
Depth 88 88 93 89 88 84 86 43 -
.03 .04 .04 .03 .04 .03 .03 .02 .02
Fig. 8 shows the computed taxonomies optimized to 80 52 83 74 74 71 75 15 -
Scene Cls. 3.30
solve the full dictionary, i.e. all tasks are placed in T and S 2.76 3.56 3.15 3.17 3.09 3.19 2.23 2.63
(except for 4 source-only tasks that are in S only). This was Sem. Segm. 78 79 82 85 76 78 84 21 -
1.74 1.88 1.92 1.80 1.85 1.74 1.71 1.42 1.53
79 54 82 76 75 76 76 34 -
Object Cls. 4.08
done for various supervision budgets (columns) and max- 3.57 4.27 3.99 3.98 4.00 3.97 3.26 3.46
Curvature 88 94 89 85 88 92 88 29 -
While Fig. 8 shows the structure and connectivity, Fig. 9 .25 .28 .26 .25 .26 .26 .25 .21 .22
79 78 83 77 76 74 71 59 -
quantifies the results of taxonomy recommended transfer Egomotion 8.60 8.58 9.26 8.41 8.34 8.15 7.94 7.32 6.85
policies by two metrics of Gain and Quality, defined as: Layout 80 76 85 79 77 78 70 36 -
.66 .66 .85 .65 .65 .62 .54 .37 .41
Gain: win rate (%) against a network trained from scratch
using the same training data as transfer networks’. That Figure 10: Generalization to Novel Tasks. Each row shows a novel
test task. Left: Gain and Quality values using the devised “all-for-one”
is, the best that could be done if transfer learning was not transfer policies for novel tasks for orders 1-4. Right: Win rates (%) of the
utilized. This quantifies the gained value by transferring. transfer policy over various self-supervised methods, ImageNet features,
Quality: win rate (%) against a fully supervised network and scratch are shown in the colored rows. Note the large margin of win
trained with 120k images (gold standard). by taxonomy. The uncolored rows show corresponding loss values.
Each column in Fig. 9 shows a supervision budget. As Fig. 10 (left) shows the Gain and Quality of the transfer
apparent, good results can be achieved even when the super- policy found by the BIP for each task. Fig. 10 (right) com-
vision budget is notably smaller than the number of solved pares the taxonomy suggested policy against some of the
tasks, and as the budget increases, results improve (ex- best existing self-supervised methods [93, 100, 65, 97, 1],
pected). Results are shown for 2 maximum allowed orders. ImageNet FC7 features [49], training from scratch, and a
fully supervised network (gold standard).
4.2. Generalization to Novel Tasks
The results in Fig. 10 (right) are noteworthy. The large
The taxonomies in Sec. 4.1 were optimized for solving win margin for taxonomy shows that carefully selecting
all tasks in the dictionary. In many situations, a practitioner transfer policies depending on the target is superior to fixed
is interested in a single task which even may not be in the transfers, such as the ones employed by self-supervised
dictionary. Here we evaluate how taxonomy transfers to a methods. ImageNet features which are the most popular
novel out-of-dictionary task with little data. off-the-shelf features in vision are also outperformed by
This is done in an all-for-one scenario where we put one those policies. Additionally, though the taxonomy transfer
task in T and all others in S. The task in T is target-only policies lose to fully supervised networks (gold standard) in
and has no task-specific network. Its limited data (16k) is most cases, the results often get close with win rates in 40%
used to train small transfer networks to sources. This basi- range. These observations suggests the space has a rather
cally localizes where the target would be in the taxonomy. predicable and strong structure. For graph visualization of
3718
Taxonomy Significance Test Transferring to ImageNet Transferring to MIT Places
(Spearman’s correlation = 0.823) (Spearman’s correlation = 0.857)
Top-1 Top-1
9
Top-5 Top-5
Accuracy
Accuracy
3
3719
References [19] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever,
and P. Abbeel. Rl2: Fast reinforcement learning via slow
[1] P. Agrawal, J. Carreira, and J. Malik. Learning to see by reinforcement learning. arXiv preprint arXiv:1611.02779,
moving. In Proceedings of the IEEE International Confer- 2016. 2
ence on Computer Vision, pages 37–45, 2015. 7
[20] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vin-
[2] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, cent, and S. Bengio. Why does unsupervised pre-training
D. Pfau, T. Schaul, and N. de Freitas. Learning to learn help deep learning? Journal of Machine Learning Re-
by gradient descent by gradient descent. In Advances in search, 11(Feb):625–660, 2010. 2
Neural Information Processing Systems, pages 3981–3989,
[21] A. Faktor and M. Irani. clustering by composition–
2016. 2
unsupervised discovery of image categories. In European
[3] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese. Joint
Conference on Computer Vision, pages 474–487. Springer,
2d-3d-semantic data for indoor scene understanding. arXiv
2012. 2
preprint arXiv:1702.01105, 2017. 4
[22] L. Fe-Fei et al. A bayesian approach to unsupervised one-
[4] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds
shot learning of object categories. In Computer Vision,
for learning some deep representations. In International
2003. Proceedings. Ninth IEEE International Conference
Conference on Machine Learning, pages 584–592, 2014. 1
on, pages 1134–1141. IEEE, 2003. 2
[5] Y. Aytar and A. Zisserman. Tabula rasa: Model transfer
[23] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of
for object category detection. In Computer Vision (ICCV),
object categories. IEEE transactions on pattern analysis
2011 IEEE International Conference on, pages 2252–2259.
and machine intelligence, 28(4):594–611, 2006. 2
IEEE, 2011. 2
[24] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars.
[6] Y. Bengio, A. Courville, and P. Vincent. Representa-
Unsupervised visual domain adaptation using subspace
tion learning: A review and new perspectives. IEEE
alignment. In Proceedings of the IEEE international con-
transactions on pattern analysis and machine intelligence,
ference on computer vision, pages 2960–2967, 2013. 2
35(8):1798–1828, 2013. 2
[25] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-
[7] P. Berkhin et al. A survey of clustering data mining tech-
learning for fast adaptation of deep networks. arXiv
niques. Grouping multidimensional data, 25:71, 2006. 2
preprint arXiv:1703.03400, 2017. 2
[8] E. Bienenstock, S. Geman, and D. Potter. Compositionality,
mdl priors, and object recognition. In Advances in neural [26] C. Finn, S. Levine, and P. Abbeel. Guided cost learn-
information processing systems, pages 838–844, 1997. 2 ing: Deep inverse optimal control via policy optimization.
CoRR, abs/1603.00448, 2016. 2
[9] H. Bilen and A. Vedaldi. Integrated perception with re-
current multi-task neural networks. In Advances in neural [27] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and
information processing systems, pages 235–243, 2016. 2 P. Abbeel. Deep spatial autoencoders for visuomotor learn-
[10] J. Bingel and A. Søgaard. Identifying beneficial task rela- ing. In Robotics and Automation (ICRA), 2016 IEEE Inter-
tions for multi-task learning in deep neural networks. arXiv national Conference on, pages 512–519. IEEE, 2016. 2
preprint arXiv:1702.08303, 2017. 2 [28] C. Finn, T. Yu, J. Fu, P. Abbeel, and S. Levine. Generalizing
[11] O. Boiman and M. Irani. Similarity by composition. In skills with semi-supervised reinforcement learning. CoRR,
Advances in neural information processing systems, pages abs/1612.00429, 2016. 2
177–184, 2007. 2 [29] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-
[12] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, shot visual imitation learning via meta-learning. CoRR,
M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: abs/1709.04905, 2017. 2
Learning from rgb-d data in indoor environments. arXiv [30] I. K. Fodor. A survey of dimension reduction techniques.
preprint arXiv:1709.06158, 2017. 4 Technical report, Lawrence Livermore National Lab., CA
[13] Z. Chen and B. Liu. Lifelong Machine Learning. Morgan (US), 2002. 2
& Claypool Publishers, 2016. 2 [31] R. M. French. Catastrophic forgetting in connectionist net-
[14] I. I. CPLEX. V12. 1: Users manual for cplex. International works: Causes, consequences and solutions. Trends in Cog-
Business Machines Corporation, 46(53):157, 2009. 5 nitive Sciences, 3(4):128–135, 1999. 2
[15] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi- [32] R. Ge. Provable algorithms for machine learning problems.
sual representation learning by context prediction. In Pro- PhD thesis, Princeton University, 2013. 1
ceedings of the IEEE International Conference on Com- [33] S. Geman, D. F. Potter, and Z. Chi. Composition systems.
puter Vision, pages 1422–1430, 2015. 2 Quarterly of Applied Mathematics, 60(4):707–736, 2002. 2
[16] C. Doersch and A. Zisserman. Multi-task self-supervised [34] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation
visual learning. arXiv preprint arXiv:1708.07860, 2017. 2 for object recognition: An unsupervised approach. In Com-
[17] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, puter Vision (ICCV), 2011 IEEE International Conference
E. Tzeng, and T. Darrell. Decaf: A deep convolutional on, pages 999–1006. IEEE, 2011. 2
activation feature for generic visual recognition. In Inter- [35] A. Gopnik, C. Glymour, D. Sobel, L. Schulz, T. Kushnir,
national conference on machine learning, pages 647–655, and D. Danks. A theory of causal learning in children:
2014. 2 Causal maps and bayes nets. 111:3–32, 02 2004. 2
[18] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial fea- [36] A. Gopnik, C. Glymour, D. M. Sobel, L. E. Schulz,
ture learning. arXiv preprint arXiv:1605.09782, 2016. 2 T. Kushnir, and D. Danks. A theory of causal learning in
3720
children: causal maps and bayes nets. Psychological re- [55] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
view, 111(1):3, 2004. 2 manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
[37] A. Gopnik, A. N. Meltzoff, and P. K. Kuhl. The scientist in mon objects in context. In European conference on com-
the crib: Minds, brains, and how children learn. William puter vision, pages 740–755. Springer, 2014. 4
Morrow & Co, 1999. 2 [56] F. Liu, G. Lin, and C. Shen. CRF learning with CNN
[38] A. Graves, G. Wayne, and I. Danihelka. Neural turing ma- features for image segmentation. CoRR, abs/1503.08263,
chines. CoRR, abs/1410.5401, 2014. 2 2015. 2
[39] I. Gurobi Optimization. Gurobi optimizer reference man- [57] Z. Luo, Y. Zou, J. Hoffman, and L. Fei-Fei. Label efficient
ual, 2016. 5 learning of transferable representations across domains and
[40] K. Henry. The theory and applications of homomorphic tasks. 2
cryptography. Master’s thesis, University of Waterloo, [58] J. Malik, P. Arbeláez, J. Carreira, K. Fragkiadaki, R. Gir-
2008. 2 shick, G. Gkioxari, S. Gupta, B. Hariharan, A. Kar, and
[41] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowl- S. Tulsiani. The three rs of computer vision: Recognition,
edge in a neural network. arXiv preprint arXiv:1503.02531, reconstruction and reorganization. Pattern Recognition Let-
2015. 4 ters, 72:4–14, 2016. 2
[42] J. Hoffman, T. Darrell, and K. Saenko. Continuous mani- [59] N. Masuda, M. A. Porter, and R. Lambiotte. Random walks
fold based adaptation for evolving visual domains. In Pro- and diffusion on networks. Physics Reports, 716-717:1 –
ceedings of the IEEE Conference on Computer Vision and 58, 2017. Random walks and diffusion on networks. 5
Pattern Recognition, pages 867–874, 2014. 2 [60] M. Mccloskey and N. J. Cohen. Catastrophic interference in
[43] Y. Hoshen and S. Peleg. Visual learning of arithmetic oper- connectionist networks: The sequential learning problem.
ations. CoRR, abs/1506.02264, 2015. 2 The Psychology of Learning and Motivation, 24:104–169,
[44] F. Hu, G.-S. Xia, J. Hu, and L. Zhang. Transferring deep 1989. 2
convolutional neural networks for the scene classification of [61] L. Mihalkova, T. Huynh, and R. J. Mooney. Mapping and
high-resolution remote sensing imagery. Remote Sensing, revising markov logic networks for transfer learning. In
7(11):14680–14707, 2015. 2 AAAI, volume 7, pages 608–614, 2007. 2
[45] I.-H. Jhuo, D. Liu, D. Lee, and S.-F. Chang. Robust visual [62] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting simi-
domain adaptation with low-rank reconstruction. In Com- larities among languages for machine translation. CoRR,
puter Vision and Pattern Recognition (CVPR), 2012 IEEE abs/1309.4168, 2013. 2
Conference on, pages 2168–2175. IEEE, 2012. 2 [63] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-
[46] D. P. Kingma and J. Ba. Adam: A method for stochastic stitch networks for multi-task learning. In Proceedings
optimization. CoRR, abs/1412.6980, 2014. 2 of the IEEE Conference on Computer Vision and Pattern
[47] D. P. Kingma and M. Welling. Auto-encoding variational Recognition, pages 3994–4003, 2016. 2
bayes. arXiv preprint arXiv:1312.6114, 2013. 2 [64] A. Niculescu-Mizil and R. Caruana. Inductive transfer for
[48] I. Kokkinos. Ubernet: Training auniversal’convolutional bayesian network structure learning. In Artificial Intelli-
neural network for low-, mid-, and high-level vision us- gence and Statistics, pages 339–346, 2007. 2
ing diverse datasets and limited memory. arXiv preprint [65] M. Noroozi and P. Favaro. Unsupervised learning of vi-
arXiv:1609.02132, 2016. 2 sual representations by solving jigsaw puzzles. In European
[49] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet Conference on Computer Vision, pages 69–84. Springer,
classification with deep convolutional neural networks. In 2016. 2, 7
Advances in neural information processing systems, pages [66] M. Noroozi, H. Pirsiavash, and P. Favaro. Represen-
1097–1105, 2012. 7 tation learning by learning to count. arXiv preprint
[50] B. Kulis, K. Saenko, and T. Darrell. What you saw is not arXiv:1708.06734, 2017. 2
what you get: Domain adaptation using asymmetric ker- [67] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens,
nel transforms. In Computer Vision and Pattern Recogni- A. Frome, G. S. Corrado, and J. Dean. Zero-shot learn-
tion (CVPR), 2011 IEEE Conference on, pages 1785–1792. ing by convex combination of semantic embeddings. arXiv
IEEE, 2011. 2 preprint arXiv:1312.5650, 2013. 2
[51] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and [68] M. Ovsjanikov, M. Ben-Chen, J. Solomon, A. Butscher,
N. Navab. Deeper depth prediction with fully convolutional and L. Guibas. Functional maps: a flexible representation
residual networks. In 3D Vision (3DV), 2016 Fourth Inter- of maps between shapes. ACM Transactions on Graphics
national Conference on, pages 239–248. IEEE, 2016. 7 (TOG), 31(4):30, 2012. 2
[52] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. [69] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.
Human-level concept learning through probabilistic pro- Efros. Context encoders: Feature learning by inpainting. In
gram induction. Science, 350(6266):1332–1338, 2015. 2 Proceedings of the IEEE Conference on Computer Vision
[53] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Ger- and Pattern Recognition, pages 2536–2544, 2016. 2
shman. Building machines that learn and think like people. [70] A. Pentina and C. H. Lampert. Multi-task learning with
Behavioral and Brain Sciences, pages 1–101, 2016. 2 labeled and unlabeled tasks. stat, 1050:1, 2017. 2
[54] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully con- [71] J. Piaget and M. Cook. The origins of intelligence in chil-
volutional instance-aware semantic segmentation. arXiv dren, volume 8. International Universities Press New York,
preprint arXiv:1611.07709, 2016. 4 1952. 2
3721
[72] L. Y. Pratt. Discriminability-based transfer between neural [89] D. G. R. Tervo, J. B. Tenenbaum, and S. J. Gershman. To-
networks. In Advances in neural information processing ward the neural implementation of structure learning. Cur-
systems, pages 204–211, 1993. 2 rent opinion in neurobiology, 37:99–105, 2016. 2
[73] S. R. Richter, Z. Hayder, and V. Koltun. Playing for bench- [90] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and
marks. In International Conference on Computer Vision S. Mannor. A deep hierarchical approach to lifelong learn-
(ICCV), 2017. 2 ing in minecraft. In AAAI, pages 1553–1561, 2017. 2
[74] S. T. Roweis and L. K. Saul. Nonlinear dimension- [91] S. Thrun and L. Pratt. Learning to learn. Springer Science
ality reduction by locally linear embedding. science, & Business Media, 2012. 2
290(5500):2323–2326, 2000. 2 [92] A. M. Turing. Computing machinery and intelligence.
[75] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, Mind, 59(236):433–460, 1950. 2
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, [93] X. Wang and A. Gupta. Unsupervised learning of visual
et al. Imagenet large scale visual recognition challenge. representations using videos. In Proceedings of the IEEE
International Journal of Computer Vision, 115(3):211–252, International Conference on Computer Vision, pages 2794–
2015. 2, 4, 8 2802, 2015. 7
[76] R. W. Saaty. The analytic hierarchy process – what it is [94] X. Wang, K. He, and A. Gupta. Transitive invariance
and how it is used. Mathematical Modeling, 9(3-5):161– for self-supervised visual representation learning. arXiv
176, 1987. Mat/d Modelling, Vol. 9, No. 3-5, pp. 161-176, preprint arXiv:1708.02901, 2017. 2
1987. 5 [95] T. Winograd. Thinking machines: Can there be? Are we,
[77] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting volume 200. University of California Press, Berkeley, 1991.
visual category models to new domains. Computer Vision– 2
ECCV 2010, pages 213–226, 2010. 2
[96] J. Yang, R. Yan, and A. G. Hauptmann. Adapting svm clas-
[78] R. Salakhutdinov, J. Tenenbaum, and A. Torralba. One-shot
sifiers to data with shifted distributions. In Data Mining
learning with a hierarchical nonparametric bayesian model.
Workshops, 2007. ICDM Workshops 2007. Seventh IEEE
In Proceedings of ICML Workshop on Unsupervised and
International Conference on, pages 69–76. IEEE, 2007. 2
Transfer Learning, pages 195–206, 2012. 2
[97] A. R. Zamir, T. Wekel, P. Agrawal, C. Wei, J. Malik, and
[79] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and
S. Savarese. Generic 3d representation via pose estimation
P. Abbeel. Trust region policy optimization. CoRR,
and matching. In European Conference on Computer Vi-
abs/1502.05477, 2015. 2
sion, pages 535–553. Springer, 2016. 2, 7
[80] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carls-
[98] A. R. Zamir, F. Xia, J. He, A. Sax, J. Malik, and S. Savarese.
son. Cnn features off-the-shelf: an astounding baseline
Gibson Env: Real-world perception for embodied agents.
for recognition. In Proceedings of the IEEE conference on
In 2018 IEEE Conference on Computer Vision and Pattern
computer vision and pattern recognition workshops, pages
Recognition (CVPR). IEEE, 2018. 4
806–813, 2014. 2
[81] D. L. Silver and K. P. Bennett. Guest editors introduction: [99] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.
special issue on inductive transfer learning. Machine Learn- Understanding deep learning requires rethinking general-
ing, 73(3):215–220, 2008. 2 ization. CoRR, abs/1611.03530, 2016. 2
[82] D. L. Silver, Q. Yang, and L. Li. Lifelong machine learning [100] R. Zhang, P. Isola, and A. A. Efros. Colorful image col-
systems: Beyond learning algorithms. In in AAAI Spring orization. In European Conference on Computer Vision,
Symposium Series, 2013. 2 pages 649–666. Springer, 2016. 2, 7
[83] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero- [101] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
shot learning through cross-modal transfer. In Advances Learning deep features for scene recognition using places
in neural information processing systems, pages 935–943, database. In Advances in neural information processing
2013. 2 systems, pages 487–495, 2014. 2, 4, 8
[84] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov. Dropout: A simple way to prevent neural
networks from overfitting. Journal of Machine Learning
Research, 15:1929–1958, 2014. 2
[85] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,
I. J. Goodfellow, and R. Fergus. Intriguing properties of
neural networks. CoRR, abs/1312.6199, 2013. 2
[86] J. B. Tenenbaum and T. L. Griffiths. Generalization, sim-
ilarity, and bayesian inference. Behavioral and Brain Sci-
ences, 24(4):629640, 2001. 2
[87] J. B. Tenenbaum, C. Kemp, T. L. Griffiths, and N. D. Good-
man. How to grow a mind: Statistics, structure, and abstrac-
tion. science, 331(6022):1279–1285, 2011. 2
[88] J. B. Tenenbaum, C. Kemp, and P. Shafto. Theory-based
bayesian models of inductive learning and reasoning. In
Trends in Cognitive Sciences, pages 309–318, 2006. 2
3722