Provable Meta-Learning of Linear Representati
Provable Meta-Learning of Linear Representati
Theorem 1 (Informal). Suppose we are given n1 total sam- up to logarithmic and conditioning/eigenvalue factors in
ples from t diverse and normalized tasks which are used the matrix of task parameters (see Assumption 2). To our
in Algorithm 1 to learn a feature representation B̂, and n2 knowledge, this is the first information-theoretic lower
samples from a new (t + 1)st task which are used along with bound for representation learning in the multi-task setting
B̂ and Algorithm 2 to learn the parameters α̂ of this new (see Section 5).
(t + 1)st task. Then, the parameter B̂α̂ has the following
excess prediction error on a new test point x? drawn from
the training data covariate distribution: 1.1. Related Work
2
dr r
While there is a vast literature on papers proposing multi-
2
Ex? [hx? , B̂α̂ − Bαt+1 i ] ≤ Õ + , (2) task and transfer learning methods, the number of theoretical
n1 n2
investigations is much smaller. An important early contri-
with high probability over the training data. bution is due to Baxter (2000), who studied a model where
tasks with shared representations are sampled from the same
The naive complexity of linear regression which ignores underlying environment. Pontil & Maurer (2013) and Mau-
the information from the previous t tasks has complexity rer et al. (2016), using tools from empirical process theory,
O( nd2 ). Theorem 1 suggests that “positive” transfer from developed a generic and powerful framework to prove gener-
the first {1, . . . , t} tasks to the final (t + 1)st task can dra- alization bounds in multi-task and learning-to-learn settings
matically reduce the sample complexity of learning when that are related to ours. Indeed, the closest guarantee to that
r d and nn12 r2 ; that is, when (1) the complexity of the in our Theorem 1 that we are aware of is Maurer et al. (2016,
shared representation is much smaller than the dimension Theorem 5). Instantiated in our setting, Maurer et al. (2016,
of the underlying space and (2) when the ratio of the num- Theorem 5) provides an LTL guarantee showing that the ex-
ber of samples used for feature learning to the number of cess risk of the loss function with learned
√
representation on
q
samples present for a new unseen task exceeds the complex- a new datapoint is bounded by Õ( r√td + nr2 ), with high
ity of the shared representation. We believe that the LTL probability. There are several principal differences between
bound in Theorem 1 is the first bound, even in the linear our work and results of this kind. First, we address the
setting, to sharply exhibit this phenomenon (see Section 1.1 algorithmic component (or computational aspect) of meta-
for a detailed comparison to existing results). Prior work learning while the previous theoretical literature generally
provides rates for which the leading term in (2) decays as assumes access to a global empirical risk minimizer (ERM).
∼ √1t , not as ∼ n11 . We identify structural conditions on Computing the ERM in these settings requires solving a
the design of the covariates and diversity of the tasks that nonconvex optimization problem that is in general NP hard.
allow our algorithms to take full advantage of all samples Second, in contrast to Maurer et al. (2016), we also provide
available when learning the shared features. Our primary guarantees for feature recovery in terms of the parameter
contributions in this paper are to: estimation error—measured directly in the distance in the
feature space.
• Establish that all local minimizers of the (regularized)
empirical risk induced by (1) are close to the true linear Third, and most importantly, in Maurer et al. (2016), the
representation up to a small, statistical error. This pro- leading term capturing the complexity of learning the fea-
vides strong evidence that first-order algorithms, such as ture representation decays only in t but not in n1 (which
gradient descent (Jin et al., 2017), can efficiently recover √ much larger than t). Although, as they remark,
is typically
good feature representations (see Section 3.1). the 1/ t scaling they obtain is in general unimprovable in
their setting, our results leverage assumptions on the dis-
• Provide a method-of-moments estimator which can effi- tributional similarity between the underlying covariates x
ciently aggregate information across multiple differing and the potential diversity of tasks to improve this scaling
tasks to estimate B—even when it may be information- to 1/n1 . That is, our algorithms make benefit of all the
theoretically impossible to learn the parameters of any samples in the feature learning phase. We believe that for
given task (see Section 3.2). many settings (including the linear model that is our fo-
cus) such assumptions are natural and that our rates reflect
• Demonstrate the benefits and pitfalls of transferring
the practical efficacy of meta-learning techniques. Indeed,
learned representations to new, unseen tasks by analyzing
transfer learning is often successful even when we are pre-
the bias-variance trade-offs of the linear regression esti-
sented with only a few training tasks but with each having a
mator based on a biased, feature estimate (see Section 4).
significant number of samples per task (e.g., n1 t).3
• Develop an information-theoretic lower bound for the There has also been a line of recent work providing guaran-
problem of feature learning, demonstrating that the afore-
3
mentioned estimator is a close-to-optimal estimator of B, See Fig. 3 for a numerical simulation relevant to this setting.
Provable Meta-Learning of Linear Representations
tees for gradient-based meta-learning (MAML) (Finn et al., F is an orthonormal matrix whose columns form an or-
2017). Finn et al. (2019); Khodak et al. (2019a;b), and thonormal basis of q, then a singular value decomposition
Denevi et al. (2019) work in the framework of online con- of E> F = UDV> defines the principal angles as:
vex optimization (OCO) and use a notion of (a potentially
data-dependent) task similarity that assumes closeness of D = diag(cos θ1 , cos θ2 , . . . , cos θk ),
all tasks to a single fixed point in parameter space to pro- where 0 ≤ θk ≤ . . . ≤ θ1 ≤ π2 . The distance of interest
vide guarantees. In contrast to this work, we focus on the for us will be the subspace angle distance sin θ1 , and for
setting of learning a representation common to all tasks in a convenience we will use the shorthand sin θ(E, F) to refer
generative model. The task model parameters need not be to it. With some abuse of notation we will use B to refer to
close together in our setting. an explicit orthonormal feature matrix and the subspace in
In concurrent work, Du et al. (2020) obtain results similar to Grr,d (R) it represents. We now detail several assumptions
ours for multi-task linear regression and provide comparable we use in our analysis.
guarantees for a two-layer ReLU network using a notion Assumption 1 (Sub-Gaussian Design and Noise). The
of training task diversity akin to ours. Their generalization i.i.d. design vectors xi are zero mean with covariance
bounds use a distributional assumption over meta-test tasks, E[xx> ] = Id and are Id -sub-gaussian,
in the sense that
2
while our bounds for linear regression are sharp for fixed E[exp(v> xi )] ≤ exp kvk 2 for all v. Moreover, the addi-
meta-test tasks. Moreover, their focus is on purely statistical tive noise variables i are i.i.d. sub-gaussian with variance
guarantees—they assume access to an ERM oracle for non- parameter 1 and are independent of xi .
convex optimization problems. Our focus is on providing
statistical rates for efficient algorithmic procedures (i.e., the Throughout, we work in the setting of random design linear
method-of-moments and local minima reachable by gradient regression, and in this context Assumption 1 is standard.
descent). Finally, we also show a (minimax)-lower bound Our results do not critically rely on the identity covariance
for the problem of feature recovery (i.e., recovering B). assumption although its use simplifies several technical argu-
ments. In the following we define the population task diver-
>
2. Preliminaries sity matrix as A = (α1 , . . . , αt )> ∈ Rt×r , ν = σr ( A t A ),
>A
tr( A )
Throughout, we will use bold lower-case letters (e.g., x) the average condition number as κ̄ = rν
t
, and the
A> A
to refer to vectors and bold upper-case letters to refer to worst-case condition number as κ = σ1 ( t )/ν.
matrices (e.g., X). We exclusively use B ∈ Rd×r to re- Assumption 2 (Task Diversity and Normalization). The t
fer to a matrix with orthonormal columns spanning an r- underlying task parameters αj satisfy kαj k = Θ(1) for all
dimensional feature space, and B⊥ to refer a matrix with j ∈ {1, . . . , t}. Moreover, we assume ν > 0.
orthonormal columns spanning the orthogonal subspace of
this feature space. The norm k · k appearing on a vector or Recovering the feature matrix B is impossible without struc-
matrix refers to its `2 norm or spectral norm respectively. tural conditions on A. Consider the extreme case in which
The notation k·kF refers to a Frobenius norm. hx, yi is the α1 , . . . , αt are restricted to span only the first r−1 columns
Euclidean inner product, while hM, Ni = tr(MN> ) is the of the column space of the feature matrix B. None of the
inner product between matrices. Generically, we will use data points (xi , yi ) contain any information about the rth
“hatted” vectors and matrices (e.g., α̂ and B̂) to refer to column-feature which can be any arbitrary vector in the com-
(random) estimators of their underlying population quanti- plementary d − r − 1 subspace. In this case recovering B
ties. We will use &, ., and to denote greater than, less accurately is information-theoretically impossible. The pa-
than, and equal to up to a universal constant and use Õ to rameters ν, κ̄, and κ capture how “spread out” the tasks αj
denote an expression that hides polylogarithmic factors in are in the column space of B. The condition kαj k = Θ(1)
all problem parameters. Our use of O, Ω, and Θ is otherwise is also standard in the statistical literature and is equivalent
standard. to normalizing the signal-to-noise (snr) ratio to be Θ(1)4 .
In linear models, the snr is defined as the square of the `2
Formally, an orthonormal feature matrix B is an element of
norm of the underlying parameter divided by the variance
an equivalence class (under right rotation) of a representa-
of the additive noise.
tive lying in Grr,d (R)—the Grassmann manifold (Edelman
et al., 1998). The Grassmann manifold, which we denote Our overall approach to meta-learning of representations
as Grr,d (R), consists of the set of r-dimensional subspaces consists of two phases that we term “meta-train” and “meta-
within an underlying d-dimensional space. To define dis- test”. First, in the meta-train phase (see Section 3), we
tance in Grr,d (R) we define the notion of a principal angle 4
Note that for a well-conditioned population task diversity
between two subspaces p and q. If E is an orthonormal
matrix where κ̄ ≤ κ ≤ O(1), our snr normalization enforces that
matrix whose columns form an orthonormal basis of p and tr(A> A/t) = Θ(1) and ν ≥ Ω( r1 ).
Provable Meta-Learning of Linear Representations
provide algorithms to learn the underlying linear represen- main result states that all local minima of the regularized
tation from a set of diverse tasks. Second, in the meta-test empirical risk are in the neighborhood of the optimal V? ,
phase (see Section 4) we show how to transfer these learned and have subspaces that approximate B well. Before stat-
features to a new, unseen task to improve the sample com- ing our result we define the constraint set, which contains
plexity of learning. Detailed proofs of our main results can incoherent matrices with reasonable scales, as follows:
be found in the Appendix. √
C0 κ̄r κν
W = { (U, V) | max ke> i Uk 2
≤ √ ,
i∈[t] t
3. Meta-Train: Learning Linear Features √ √
kUk2 ≤ C0 tκν, kVk2 ≤ C0 tκν },
Here we address both the algorithmic and statistical chal-
lenges of provably learning the linear feature representation for some large constant C0 . Under Assumption 2, this set
B. contains the optimal parameters. Note that U? and V?
satisfy the final two constraints by definition and Lemma 16
3.1. Local Minimizers of the Empirical Risk can be used to show that Assumption 2 actually implies that
U? is incoherent, which satisfies the first constraint. Our
The remarkable, practical success of first-order methods main result follows.
for training nonconvex optimization problems (including
Theorem 2. Let Assumptions 1 and 2 hold in the uniform
meta/multi-task learning objectives) motivates us to study
task sampling model. If the number of samples n1 satisfies
the optimization landscape of the empirical risk induced
n1 & polylog(n1 , d, t)(κr)4 max{t, d}, then, with proba-
by the model in (1). We show in this section that all local
bility at least 1 − 1/poly(d), we have that given any local
minimizers of a regularized version of empirical risk recover
minimum (U, V) ∈ int(W) of the objective (4), the column
the true linear representation up to a small statistical error.
space of V, spanned by the orthonormal feature matrix B̂,
Jointly learning the population parameters B and satisfies:
(α1 , . . . , αt )> defined by (1) is reminiscent of a matrix s
sensing/completion problem. We leverage this connection 1 max{t, d}r log n1
for our analysis, building in particular on results from Ge sin θ(B̂, B) ≤ O √ .
ν n1
et al. (2017). Throughout this section we assume that we
are in a uniform task sampling model—at each iteration
the task t(i) for the ith datapoint is uniformly sampled We make several comments on this result:
from the t underlying tasks. We first recast our problem in
the language of matrices, by defining the matrix we hope • The guarantee in Theorem 2 suggests that all local
to recover as M? = (α1 , . . . , αt )> B> ∈ Rt×d . Since minimizers of the regularized empirical risk (4) will
rank(M? ) = r, we let X? D? (Y? )> = SVD(M? ), and produce a linear representation at a distance at most
denote U? = X? (D? )1/2 ∈ Rt×r , V? = (D? )1/2 Y? ∈
p
Õ( max{t, d}r/n1 ) from the true underlying represen-
Rd×r . In this notation, the responses of the regression model tation. Theorem 5 guarantees that any estimator (including
p
are written as follows: the empirical risk minimizer) must incur error & dr/n1 .
Therefore, in the regime t ≤ O(d), all local minimizers
yi = het(i) x>
i , M? i + i . (3)
are statistically close-to-optimal, up to logarithmic factors
To frame recovery as an optimization problem we consider and conditioning/eigenvalue factors in the task diversity
the Burer-Monteiro factorization of the parameter M = matrix.
UV> where U ∈ Rt×r and V ∈ Rd×r . This motivates the • Combined with a recent line of results showing that (noisy)
following objective: gradient descent can efficiently escape strict saddle points
2t X
n to find local minima (Jin et al., 2017), Theorem 2 provides
min f (U, V) = (yi − het(i) x> > 2
i , UV i) strong evidence that first-order methods can efficiently
U∈Rt×r ,V∈Rd×r n i=1
meta-learn linear features.5
1
+ kU> U − V> Vk2F . (4)
2 The proof of Theorem 2 is technical so we only sketch
The second term in (4) functions as a regularization to pre- the high-level ideas. The overall strategy is to analyze the
vent solutions which send kUkF → 0 while kVkF → ∞ Hessian of the objective (4) at a stationary point (U, V) in
or vice versa. If the value of this objective (4) is small we int(W) to exhibit a direction ∆ of negative curvature which
might hope that an estimate of B can be extracted from 5
To formally establish computational efficiency, we need to
the column space of the parameter V, since the column further verify the smoothness and the strict-saddle properties of
space of V? spans the same subspace as B. Informally, our the objective function (4) (see, e.g., (Jin et al., 2017)).
Provable Meta-Learning of Linear Representations
Algorithm 1 MoM Estimator for Learning Linear Features Algorithm 2 Linear Regression for Learning a New Task
Input: {(xi , yi )}ni=1
1
. with a Feature Estimate
Input: B̂, {(xi , yi )}ni=1
>
Pn1
B̂D1 B̂ ← top-r SVD of 1
n1 · i=1 yi2 xi x>
i Pn2
2
.
Pn2
> > † >
return B̂ α̂ ← ( i=1 B̂xi xi B̂ ) B̂ i=1 xi yi
return α̂
improve learning. In the context of the model in (1), the performance on new tasks (Wang et al., 2019). For
approach taken in Algorithm 2 uses B̂ as a plug-in surrogate diverse tasks (i.e. κ ≤ O(1)), using Algorithm 1 to
for the unknown B, and attempts to estimate αt+1 ∈ Rr . estimate B̂ suggests that ensuring δ 2 nd2 , where
Formally we define our estimator α as follows: dr
δ 2 = Õ( νn ), requires nn12 r/ν. That is, the ratio of
1
α̂ = arg min ky − XB̂αk2 , (5) the number of samples used for feature learning to the
α number of samples used for learning the new task should
where n2 samples (X, y) are generated from the model exceed the complexity of the feature representation to
in (1) from the (t + 1)st task. Effectively, the feature rep- achieve “positive” transfer.
resentation B̂ performs dimension reduction on the input
covariates X, allowing us to learn in a lower-dimensional In order to obtain the rate in Theorem 4 we use a bias-
space. Our focus is on understanding the generalization variance analysis of the estimator error B̂α̂ − Bαt+1 (and
properties of the estimator in Algorithm 2, since (5) is an do not appeal to uniform convergence arguments). Using
ordinary least-squares objective which can be analytically the definition of y we can write the error as,
solved.
B̂α̂ − Bα0 = B̂(B̂> X> XB̂)−1 B̂X> XBα0
Assuming we have produced an estimate B̂ of the true fea-
ture matrix B, we can present our main result on the sample − Bα0 + B̂(B̂> X> XB̂)−1 B̂> X> .
complexity of meta-learned linear regression.
The first term contributes the bias term to (6) while the
Theorem 4. Suppose n2 data points, {(xi , yi )}ni=1
2
, are second contributes the variance term. Analyzing the fluctu-
generated from the model in (1), where Assumption 1 holds, ations of the (mean-zero) variance term can be done by
from a single task satisfying kαt+1 k2 ≤ O(1). Then, if controlling the norm of its square, > A, where A =
sin θ(B̂, B) ≤ δ and n2 & r log n2 , the output α̂ from XB̂(B̂> X> XB̂)−2 B̂> X> . We can bound this (random)
Algorithm 2 satisfies quadratic form by first appealing to the Hanson-Wright in-
r
equality to show w.h.p. that > A . tr(A) + Õ(kAkF +
2 2
kB̂α̂ − Bαt+1 k ≤ Õ δ + , (6) kAk). The remaining randomness in A can be controlled
n2
using matrix concentration/perturbation arguments (see
with probability at least 1 − O(n−100
2 ). Lemma 17).
With access to the true feature matrix B̂ (i.e., setting B̂ = B)
Note that Bαt+1 is simply the underlying parameter in the
the term B̂(B> X> XB)−1 BX> XBα0 − Bα0 = 0, due
regression model in (1). We make several remarks about
to the cancellation in the empirical covariance matrices,
this result:
(B> X> XB)−1 BX> XB = Ir . This cancellation of the
empirical covariance is essential to obtaining a tight anal-
• Theorem 4 decomposes the error of transfer learning
ysis of the least-squares estimator. We cannot rely on this
into two components. The first term, Õ(δ 2 ), arises from
effect in full since B̂ 6= B. However, a naive analysis
the bias of using an imperfect feature estimate B̂ to trans- which splits these terms, (B̂> X> XB̂)−1 and B̂X> XB
fer knowledge across tasks. The second term, Õ( nr2 ), can lead to a large increase in the variance in the bound.
arises from the variance of learning in a space of reduced To exploit the fact B̂ ≈ B, we project the matrix B in
dimensionality. the leading XB term onto the column space of B̂ and its
• Standard generalization guarantees for random design complement—which allows a partial cancellation of the
linear regression ensure that the parameter recovery error empirical covariances in the subspace spanned by B̂. The
is bounded by O( nd2 ) w.h.p. under the same assumptions remaining variance can be controlled as in the previous term
(Hsu et al., 2012). Meta-learning of the linear representa- (see Lemma 18).
tion B̂ can provide a significant reduction in the sample
complexity of learning when δ 2 nd2 and r d. 5. Lower Bounds for Feature Learning
2 d
• Conversely, if δ n2the bounds in (6) imply that To complement the upper bounds provided in the previ-
the overhead of learning the feature representation may ous section, in this section we derive information-theoretic
overwhelm the potential benefits of transfer learning limits for feature learning in the model (1). To our knowl-
(with respect to baseline of learning the (t + 1)st task edge, these results provide the first sample-complexity lower
in isolation). This accords with the well-documented bounds for feature learning, with regards to subspace recov-
empirical phenomena of “negative” transfer observed in ery, in the multi-task setting. While there is existing litera-
large-scale deep learning problems where meta/transfer- ture on (minimax)-optimal estimation of low-rank matrices
learning techniques actually result in a degradation in (see, for example, Rohde et al. (2011)), that work focuses on
Provable Meta-Learning of Linear Representations
the (high-dimensional) estimation of matrices, whose only The dependency on the task diversity parameter ν1 (the first
constraint is to be low rank. Moreover, error is measured in term in Theorem 5) is achieved by constructing a pair of fea-
the additive prediction norm. In our setting, we must handle ture matrices and an ill-conditioned task matrix A that can-
the additional difficulties arising from the fact that we are not discern the direction along which they defer. The proof
interested in (1) learning a column space (i.e., an element in strategy to capture the second term uses a f -divergence
the Grr,d (R)) and (2) the error between such representatives based minimax technique from Guntuboyina (2011) (re-
is measured in the subspace angle distance. We begin by stated in Lemma 20 in the Appendix), similar in spirit to the
presenting our lower bound for feature recovery. global Fano (or Yang-Barron).
Theorem 5. Suppose a total of n data points are generated There are two key ingredients to using Lemma 20 and ob-
from the model in (1) satisfying Assumption 1 with xi ∼ taining a tight lower bound. First, we must exhibit a large
N (0, Id ), i ∼ N (0, 1), with an equal number from each family of distinct, well-separated feature matrices {Bi }M i=1
task, and that Assumption 2 holds with αj for each task (i.e., a packing at scale η). Second, we must argue this set
normalized to kαj k = 12. Then, there are αj for r ≤ d2 and of feature matrices induces a family of distributions over
1
n ≥ max 8ν , r(d − r) so that: data {(xi , yi )}Bi which are statistically “close” and funda-
r r r !! mentally difficult to distinguish amongst. This is captured
1 1 dr
inf sup sin θ(B̂, B) ≥ Ω max , , by the fact the -covering number, measured in the space
B̂ B∈Grr,d (R) ν n n of distributions with divergence measure Df (·, ·), is small.
The standard (global) Fano method, or Yang-Barron method
with probability at least 14 , where the infimum is taken over
(see Wainwright (2019, Ch. 15)), which uses the KL di-
the class of estimators that are functions of the n data points.
vergence to measure distance in the space of measures, is
known to provide rate-suboptimal lower bounds for para-
Again we make several comments on the result. metric estimation problems.7 Our case is no exception. To
circumvent this difficulty we use the framework of Gun-
• The result of Theorem 5 shows that the estimator in Al- tuboyina (2011), instantiated with the f -divergence chosen
gorithm 1 provides a close-to-optimal estimator of the as the χ2 -divergence, to obtain a tight lower bound.
feature representation parameterized by B–up to loga- The argument proceeds in two steps. First, although the
rithmic and conditioning factors (i.e. κ, ν)6 in the task geometry of Grr,d (R) is complex, we can adapt results from
diversity matrix–that is independent of the task number t. Pajor (1998) to provide sharp upper/lower bounds on the
Note that under the normalization for αi , as κ → ∞ (i.e. metric entropy (or global entropy) of the Grassmann man-
the task matrix A becomes ill-conditioned) we have that ifold (see Proposition 9). The second technical step of
ν → 0. So the first term in Theorem 5 establishes that the argument hinges on the ability to cover the space of
task diversity is necessary for recovery of the subspace B. distributions parametrized by B in the space of measures
• The dimension of Grr,d (R) (i.e., the number of free param- {PB : B ∈ Grr,d (R)}—with distance measured by an ap-
eters needed to specify a feature set) is r(d − r) ≥ Ω(dr) propriate f -divergence. In order to establish a covering
for d/2 ≥ r; hence the second term in Theorem 5 matches in the space of measures parametrized by B, the key step
the scaling that we intuit from parameter counting. is to bound the distance χ2 (PB1 , PB2 ) for two different
measures over data generated from the model (1) with two
• Obtaining tight dependence of our subspace recovery different feature matrices B1 and B2 (see Lemma 21). This
bounds on conditioning factors in the task diversity ma- control can be achieved in our random design setting by
trix (i.e. κ, ν) is an important and challenging research exploiting the Gaussianity of the marginals over data X and
question. We believe the gap between in condition- the Gaussianity of the conditionals of y|X, B, to ultimately
ing/eigenvalue
p factors between Theorem 3 and Theorem 5 be expressed as a function of the angular distance between
on the dr/n term is related to a problem that persists for B1 and B2 .
classical estimators in linear regression (i.e. for the Lasso
estimator in sparse linear regression). Even in this setting,
a gap remains with respect to condition number/eigenvalue 6. Simulations
factors of the data design matrix X, between existing up- We complement our theoretical analysis with a series of
per and lower bounds (see Chen et al. (2016, Section 7), numerical experiments highlighting the benefits (and lim-
Raskutti et al. (2011, Theorem 1, Theorem 2) and Zhang
7
et al. (2014) for example). In our setting the task diversity Even for the simple problems of Gaussian mean estimation
matrix A enters into the problem in a similar fashion to the classical Yang-Barron method is suboptimal; see Guntuboyina
(2011) for more details.
the data design matrix X in these aforementioned settings.
6
Note in the setting that κ ≤ O(1), ν ∼ r1 .
Provable Meta-Learning of Linear Representations
its) of meta-learning8 . For the purposes of feature learning that meta-learned regressions perform significantly worse
we compare the performance of the method-of-moments than simply ignoring first t tasks. Theorem 4 indicates the
estimator in Algorithm 1 vs. directly optimizing the objec- bias from the inability to learn an accurate feature estimate
tive in (4). Additional details on our set-up are provided of B overwhelms the benefits of transfer learning. In this
in Appendix G. We construct problem instances by gen- regime n2 d so the new task can be efficiently learned in
erating Gaussian covariates and noise as xi ∼ N (0, Id ), isolation. We believe this simulation represents a simple in-
i ∼ N (0, 1), and the tasks and features used for the first- stance of the empirically observed phenomena of “negative”
stage feature estimation as αi ∼ √1r · N (0, Ir ), with B transfer (Wang et al., 2019).
generated as a (uniform) random r-dimensional subspace We now turn to the more interesting use cases where meta-
of Rd . In all our experiments we generate an equal number learning is a powerful tool. We consider a setting where
of samples nt for each of the t tasks, so n1 = t · nt . In d = 100, r = 5, and nt = 25 for varying numbers of
the second stage we generate a new, (t + 1)st task instance tasks t. However, now we consider a new, unseen task
using the same feature estimate B used in the first stage and where data is scarce: n2 = 25 < d. As Fig. 2 shows, in
otherwise generate n2 samples, with the covariates, noise
and αt+1 constructed as before. Throughout this section we 1.0
LF-MoM 1.4
LR
`2 Parameter Error
refer to features learned via a first-order gradient method 0.8 LF-FO 1.2 meta-LR-MoM
as LF-FO and the corresponding meta-learned regression 1.0 meta-LR-FO
sin θ
0.6
0.4
estimator. We also use LR to refer to the baseline linear Number of Tasks Number of Tasks
regression estimator on a new task which only uses data
generated from that task. Figure 2. Left: LF-FO vs. LF-MoM estimator with error measured
in the subspace angle distance sin θ(B̂, B). Right: meta-LR-FO
We begin by considering a challenging setting for feature and meta-LR-MoM vs. LR on new task with error measured on
learning where d = 100, r = 5, but nt = 5 for varying new task parameter. Here d = 100, r = 5, nt = 25 while
numbers of tasks t. As Fig. 1 demonstrates, the method-of- n2 = 25 while the number of tasks is varied.
1.00
1.4 this regime both the method-of-moments estimator and the
LR loss-based approach can learn a non-trivial estimate of the
`2 Parameter Error
1.2
0.95
meta-LR-MoM
0.90 1.0
meta-LR-FO feature representation. The benefits of transferring this rep-
sin θ
0.85
0.8
0.80
0.6
resentation are also evident in the improved generalization
0.75
0.70
LF-MoM 0.4
performance seen by the meta-regression procedures on the
LF-FO
0.65 0.2 new task. Interestingly, the loss-based approach learns an
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
Number of Tasks Number of Tasks accurate feature representation B̂ with significantly fewer
samples then the method-of-moments estimator, in contrast
Figure 1. Left: LF-FO vs. LF-MoM estimator with error measured to the previous experiment. Finally, we consider an instance
in the subspace angle distance sin θ(B̂, B). Right: meta-LR-FO where d = 100, r = 5, t = 20, and n2 = 50 with varying
and meta-LR-MoM vs. LR on new task with error measured on numbers of training points nt per task. We see in Fig. 3
new task parameter. Here d = 100, r = 5, and nt = 5 while
that meta-learning of representations provides significant
n2 = 2500 as the number of tasks is varied.
value in a new task. Note that these numerical experiments
show that as the number of tasks is fixed, but nt increases,
moments estimator is able to aggregate information across the generalization ability of the meta-learned regressions
the tasks as t increases to slowly improve its feature estimate, significantly improves as reflected in the bound (2).
even though nt d. The loss-based approach struggles to
improve its estimate of the feature matrix B in this regime.
This accords with the extra t dependence in Theorem 2 rel- 7. Conclusions
ative to Theorem 3. In this setting, we also generated a In this paper we show how a shared linear representation
(t + 1)st test task with d n2 = 2500, to test the effect of may be efficiently learned and transferred between mul-
meta-learning the linear representation on generalization in tiple linear regression tasks. We provide both upper and
a new, unseen task against a baseline which simply performs lower bounds on the sample complexity of learning this
a regression on this new task in isolation. Fig. 1 also shows representation and for the problem of learning-to-learn. We
8
An open-source Python implementation to reproduce believe our bounds capture important qualitative phenom-
our experiments can be found at https://github.com/ ena observed in real meta-learning applications absent from
nileshtrip/MTL. previous theoretical treatments.
Provable Meta-Learning of Linear Representations
1.0 1.4
LR Edelman, A., Arias, T. A., and Smith, S. T. The geometry
`2 Parameter Error
1.2
0.8
1.0
meta-LR-MoM of algorithms with orthogonality constraints. SIAM jour-
meta-LR-FO
LF-MoM nal on Matrix Analysis and Applications, 20(2):303–353,
sin θ
0.6
0.8
LF-FO
0.4 0.6 1998.
0.4
0.2
Baxter, J. A model of inductive bias learning. Journal of Khodak, M., Balcan, M.-F., and Talwalkar, A. Provable guar-
artificial intelligence research, 12:149–198, 2000. antees for gradient-based meta-learning. arXiv preprint
arXiv:1902.10644, 2019a.
Bhatia, R. Matrix Analysis, volume 169. Springer Science Khodak, M., Balcan, M.-F. F., and Talwalkar, A. S. Adaptive
& Business Media, 2013. gradient-based meta-learning methods. In Advances in
Neural Information Processing Systems, pp. 5915–5926,
Candes, E. and Plan, Y. Tight oracle bounds for low-rank
2019b.
matrix recovery from a minimal number of noisy random
measurements. arXiv preprint arXiv:1001.0339, 2010. Liu, D. C. and Nocedal, J. On the limited memory BFGS
method for large scale optimization. Mathematical pro-
Chen, X., Guntuboyina, A., and Zhang, Y. On bayes risk gramming, 45(1-3):503–528, 1989.
lower bounds. The Journal of Machine Learning Re-
search, 17(1):7687–7744, 2016. Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neu-
ral networks for natural language understanding. arXiv
Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. preprint arXiv:1901.11504, 2019.
Learning-to-learn stochastic gradient descent with biased
Maclaurin, D., Duvenaud, D., and Adams, R. P. Autograd:
regularization. arXiv preprint arXiv:1903.10399, 2019.
Effortless gradients in numpy. In ICML 2015 AutoML
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Workshop, volume 238, 2015.
Tzeng, E., and Darrell, T. Decaf: A deep convolutional Maurer, A., Pontil, M., and Romera-Paredes, B. The ben-
activation feature for generic visual recognition. In Inter- efit of multitask representation learning. The Journal of
national conference on machine learning, pp. 647–655, Machine Learning Research, 17(1):2853–2884, 2016.
2014.
Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw,
Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan,
Q. Few-shot learning via learning the representation, M. I., et al. Ray: A distributed framework for emerging
provably. arXiv preprint arXiv:2002.09434, 2020. {AI} applications. In 13th {USENIX} Symposium on
Provable Meta-Learning of Linear Representations