0% found this document useful (0 votes)
13 views10 pages

Provable Meta-Learning of Linear Representati

This paper addresses the theoretical foundations of meta-learning, specifically focusing on the efficient learning of transferable linear representations across multiple tasks. It presents algorithms for multi-task linear regression that improve sample efficiency when adapting to new tasks, highlighting the importance of shared low-dimensional features. The authors also provide information-theoretic lower bounds on sample complexity, demonstrating the effectiveness of their approach in the context of meta-learning.

Uploaded by

ranaimransa227
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

Provable Meta-Learning of Linear Representati

This paper addresses the theoretical foundations of meta-learning, specifically focusing on the efficient learning of transferable linear representations across multiple tasks. It presents algorithms for multi-task linear regression that improve sample efficiency when adapting to new tasks, highlighting the importance of shared low-dimensional features. The authors also provide information-theoretic lower bounds on sample complexity, demonstrating the effectiveness of their approach in the context of meta-learning.

Uploaded by

ranaimransa227
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Provable Meta-Learning of Linear Representations

Nilesh Tripuraneni 1 Chi Jin 2 Michael I. Jordan 1

Abstract ranging from deep reinforcement learning (Baevski et al.,


Meta-learning, or learning-to-learn, seeks to de- 2019) to natural language processing (Ando & Zhang, 2005;
sign algorithms that can utilize previous experi- Liu et al., 2019). Accordingly, researchers have begun to
ence to rapidly learn new skills or adapt to new highlight the need to develop (and understand) generic algo-
environments. Representation learning—a key rithms for transfer (or meta) learning applicable in diverse
tool for performing meta-learning—learns a data domains (Finn et al., 2017). Surprisingly, however, despite
representation that can transfer knowledge across a long line of work on transfer learning, there is limited the-
multiple tasks, which is essential in regimes where oretical characterization of the underlying problem. Indeed,
data is scarce. Despite a recent surge of interest there are few efficient algorithms for feature learning that
in the practice of meta-learning, the theoretical provably generalize to new, unseen tasks. Sharp guarantees
underpinnings of meta-learning algorithms are are even lacking in the linear setting.
lacking, especially in the context of learning trans- In this paper, we study the problem of meta-learning of
ferable representations. In this paper, we focus on features in a linear model in which multiple tasks share
the problem of multi-task linear regression—in a common set of low-dimensional features. Our aim is
which multiple linear regression models share a twofold. First, we ask: given a set of diverse samples from
common, low-dimensional linear representation. t different tasks how we can efficiently (and optimally)
Here, we provide provably fast, sample-efficient learn a common feature representation? Second, having
algorithms to address the dual challenges of (1) learned a common feature representation, how can we use
learning a common set of features from multiple, this representation to improve sample efficiency in a new
related tasks, and (2) transferring this knowledge (t + 1)st task where data may be scarce?1
to new, unseen tasks. Both are central to the gen-
eral problem of meta-learning. Finally, we com- Formally, given an (unobserved) linear feature matrix B =
plement these results by providing information- (b1 , . . . , br ) ∈ Rd×r with orthonormal columns, our statis-
theoretic lower bounds on the sample complexity tical model for data pairs (xi , yi ) is:
of learning these linear features.
yi = x>
i Bαt(i) + i ; βt(i) = Bαt(i) , (1)

where there are t (unobserved) underlying task parameters


1. Introduction αj for j ∈ {1, . . . , t}. Here t(i) ∈ {1, . . . , t} is the index
The ability of a learner to transfer knowledge between tasks of the task associated with the ith datapoint, xi ∈ Rd is a
is crucial for robust, sample-efficient inference and predic- random covariate, and i is additive noise. We assume the
tion. One of the most well-known examples of such transfer sequence {αt(i) }∞ i=1 is independent of all other randomness
learning has been in few-shot image classification where in the problem. In this framework, the aforementioned
the idea is to initialize neural network weights in early lay- questions reduce to recovering B from data from the first
ers using ImageNet pre-training/features, and subsequently {1, . . . , t} tasks, and using this feature representation to
re-train the final layers on a new task (Donahue et al., 2014; recover a better estimate of a new task parameter, βt+1 =
Vinyals et al., 2016). However, the need for methods that Bαt+1 , where αt+1 is also unobserved.
can learn data representations that generalize to multiple, Our main result targets the problem of learning-to-learn
unseen tasks has also become vital in other applications, (LTL), and shows how a feature representation B̂ learned
1
Department of EECS, University of California, from t diverse tasks can improve learning on an unseen
Berkeley 2 Department of Electrical Engineering, Prince- (t + 1)st task which shares the same underlying linear repre-
ton University. Correspondence to: Nilesh Tripuraneni sentation. We informally state this result below.2
<nilesh [email protected]>.
1
This is sometimes referred to as learning-to-learn (LTL).
th 2
Proceedings of the 38 International Conference on Machine Theorem 1 follows immediately from combining Theorems 3
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). and 4; see Theorem 6 for a formal statement.
Provable Meta-Learning of Linear Representations

Theorem 1 (Informal). Suppose we are given n1 total sam- up to logarithmic and conditioning/eigenvalue factors in
ples from t diverse and normalized tasks which are used the matrix of task parameters (see Assumption 2). To our
in Algorithm 1 to learn a feature representation B̂, and n2 knowledge, this is the first information-theoretic lower
samples from a new (t + 1)st task which are used along with bound for representation learning in the multi-task setting
B̂ and Algorithm 2 to learn the parameters α̂ of this new (see Section 5).
(t + 1)st task. Then, the parameter B̂α̂ has the following
excess prediction error on a new test point x? drawn from
the training data covariate distribution: 1.1. Related Work
 2
dr r
 While there is a vast literature on papers proposing multi-
2
Ex? [hx? , B̂α̂ − Bαt+1 i ] ≤ Õ + , (2) task and transfer learning methods, the number of theoretical
n1 n2
investigations is much smaller. An important early contri-
with high probability over the training data. bution is due to Baxter (2000), who studied a model where
tasks with shared representations are sampled from the same
The naive complexity of linear regression which ignores underlying environment. Pontil & Maurer (2013) and Mau-
the information from the previous t tasks has complexity rer et al. (2016), using tools from empirical process theory,
O( nd2 ). Theorem 1 suggests that “positive” transfer from developed a generic and powerful framework to prove gener-
the first {1, . . . , t} tasks to the final (t + 1)st task can dra- alization bounds in multi-task and learning-to-learn settings
matically reduce the sample complexity of learning when that are related to ours. Indeed, the closest guarantee to that
r  d and nn12  r2 ; that is, when (1) the complexity of the in our Theorem 1 that we are aware of is Maurer et al. (2016,
shared representation is much smaller than the dimension Theorem 5). Instantiated in our setting, Maurer et al. (2016,
of the underlying space and (2) when the ratio of the num- Theorem 5) provides an LTL guarantee showing that the ex-
ber of samples used for feature learning to the number of cess risk of the loss function with learned

representation on
q
samples present for a new unseen task exceeds the complex- a new datapoint is bounded by Õ( r√td + nr2 ), with high
ity of the shared representation. We believe that the LTL probability. There are several principal differences between
bound in Theorem 1 is the first bound, even in the linear our work and results of this kind. First, we address the
setting, to sharply exhibit this phenomenon (see Section 1.1 algorithmic component (or computational aspect) of meta-
for a detailed comparison to existing results). Prior work learning while the previous theoretical literature generally
provides rates for which the leading term in (2) decays as assumes access to a global empirical risk minimizer (ERM).
∼ √1t , not as ∼ n11 . We identify structural conditions on Computing the ERM in these settings requires solving a
the design of the covariates and diversity of the tasks that nonconvex optimization problem that is in general NP hard.
allow our algorithms to take full advantage of all samples Second, in contrast to Maurer et al. (2016), we also provide
available when learning the shared features. Our primary guarantees for feature recovery in terms of the parameter
contributions in this paper are to: estimation error—measured directly in the distance in the
feature space.
• Establish that all local minimizers of the (regularized)
empirical risk induced by (1) are close to the true linear Third, and most importantly, in Maurer et al. (2016), the
representation up to a small, statistical error. This pro- leading term capturing the complexity of learning the fea-
vides strong evidence that first-order algorithms, such as ture representation decays only in t but not in n1 (which
gradient descent (Jin et al., 2017), can efficiently recover √ much larger than t). Although, as they remark,
is typically
good feature representations (see Section 3.1). the 1/ t scaling they obtain is in general unimprovable in
their setting, our results leverage assumptions on the dis-
• Provide a method-of-moments estimator which can effi- tributional similarity between the underlying covariates x
ciently aggregate information across multiple differing and the potential diversity of tasks to improve this scaling
tasks to estimate B—even when it may be information- to 1/n1 . That is, our algorithms make benefit of all the
theoretically impossible to learn the parameters of any samples in the feature learning phase. We believe that for
given task (see Section 3.2). many settings (including the linear model that is our fo-
cus) such assumptions are natural and that our rates reflect
• Demonstrate the benefits and pitfalls of transferring
the practical efficacy of meta-learning techniques. Indeed,
learned representations to new, unseen tasks by analyzing
transfer learning is often successful even when we are pre-
the bias-variance trade-offs of the linear regression esti-
sented with only a few training tasks but with each having a
mator based on a biased, feature estimate (see Section 4).
significant number of samples per task (e.g., n1  t).3
• Develop an information-theoretic lower bound for the There has also been a line of recent work providing guaran-
problem of feature learning, demonstrating that the afore-
3
mentioned estimator is a close-to-optimal estimator of B, See Fig. 3 for a numerical simulation relevant to this setting.
Provable Meta-Learning of Linear Representations

tees for gradient-based meta-learning (MAML) (Finn et al., F is an orthonormal matrix whose columns form an or-
2017). Finn et al. (2019); Khodak et al. (2019a;b), and thonormal basis of q, then a singular value decomposition
Denevi et al. (2019) work in the framework of online con- of E> F = UDV> defines the principal angles as:
vex optimization (OCO) and use a notion of (a potentially
data-dependent) task similarity that assumes closeness of D = diag(cos θ1 , cos θ2 , . . . , cos θk ),
all tasks to a single fixed point in parameter space to pro- where 0 ≤ θk ≤ . . . ≤ θ1 ≤ π2 . The distance of interest
vide guarantees. In contrast to this work, we focus on the for us will be the subspace angle distance sin θ1 , and for
setting of learning a representation common to all tasks in a convenience we will use the shorthand sin θ(E, F) to refer
generative model. The task model parameters need not be to it. With some abuse of notation we will use B to refer to
close together in our setting. an explicit orthonormal feature matrix and the subspace in
In concurrent work, Du et al. (2020) obtain results similar to Grr,d (R) it represents. We now detail several assumptions
ours for multi-task linear regression and provide comparable we use in our analysis.
guarantees for a two-layer ReLU network using a notion Assumption 1 (Sub-Gaussian Design and Noise). The
of training task diversity akin to ours. Their generalization i.i.d. design vectors xi are zero mean with covariance
bounds use a distributional assumption over meta-test tasks, E[xx> ] = Id and are Id -sub-gaussian,
 in the sense that
2
while our bounds for linear regression are sharp for fixed E[exp(v> xi )] ≤ exp kvk 2 for all v. Moreover, the addi-
meta-test tasks. Moreover, their focus is on purely statistical tive noise variables i are i.i.d. sub-gaussian with variance
guarantees—they assume access to an ERM oracle for non- parameter 1 and are independent of xi .
convex optimization problems. Our focus is on providing
statistical rates for efficient algorithmic procedures (i.e., the Throughout, we work in the setting of random design linear
method-of-moments and local minima reachable by gradient regression, and in this context Assumption 1 is standard.
descent). Finally, we also show a (minimax)-lower bound Our results do not critically rely on the identity covariance
for the problem of feature recovery (i.e., recovering B). assumption although its use simplifies several technical argu-
ments. In the following we define the population task diver-
>
2. Preliminaries sity matrix as A = (α1 , . . . , αt )> ∈ Rt×r , ν = σr ( A t A ),
>A
tr( A )
Throughout, we will use bold lower-case letters (e.g., x) the average condition number as κ̄ = rν
t
, and the
A> A
to refer to vectors and bold upper-case letters to refer to worst-case condition number as κ = σ1 ( t )/ν.
matrices (e.g., X). We exclusively use B ∈ Rd×r to re- Assumption 2 (Task Diversity and Normalization). The t
fer to a matrix with orthonormal columns spanning an r- underlying task parameters αj satisfy kαj k = Θ(1) for all
dimensional feature space, and B⊥ to refer a matrix with j ∈ {1, . . . , t}. Moreover, we assume ν > 0.
orthonormal columns spanning the orthogonal subspace of
this feature space. The norm k · k appearing on a vector or Recovering the feature matrix B is impossible without struc-
matrix refers to its `2 norm or spectral norm respectively. tural conditions on A. Consider the extreme case in which
The notation k·kF refers to a Frobenius norm. hx, yi is the α1 , . . . , αt are restricted to span only the first r−1 columns
Euclidean inner product, while hM, Ni = tr(MN> ) is the of the column space of the feature matrix B. None of the
inner product between matrices. Generically, we will use data points (xi , yi ) contain any information about the rth
“hatted” vectors and matrices (e.g., α̂ and B̂) to refer to column-feature which can be any arbitrary vector in the com-
(random) estimators of their underlying population quanti- plementary d − r − 1 subspace. In this case recovering B
ties. We will use &, ., and  to denote greater than, less accurately is information-theoretically impossible. The pa-
than, and equal to up to a universal constant and use Õ to rameters ν, κ̄, and κ capture how “spread out” the tasks αj
denote an expression that hides polylogarithmic factors in are in the column space of B. The condition kαj k = Θ(1)
all problem parameters. Our use of O, Ω, and Θ is otherwise is also standard in the statistical literature and is equivalent
standard. to normalizing the signal-to-noise (snr) ratio to be Θ(1)4 .
In linear models, the snr is defined as the square of the `2
Formally, an orthonormal feature matrix B is an element of
norm of the underlying parameter divided by the variance
an equivalence class (under right rotation) of a representa-
of the additive noise.
tive lying in Grr,d (R)—the Grassmann manifold (Edelman
et al., 1998). The Grassmann manifold, which we denote Our overall approach to meta-learning of representations
as Grr,d (R), consists of the set of r-dimensional subspaces consists of two phases that we term “meta-train” and “meta-
within an underlying d-dimensional space. To define dis- test”. First, in the meta-train phase (see Section 3), we
tance in Grr,d (R) we define the notion of a principal angle 4
Note that for a well-conditioned population task diversity
between two subspaces p and q. If E is an orthonormal
matrix where κ̄ ≤ κ ≤ O(1), our snr normalization enforces that
matrix whose columns form an orthonormal basis of p and tr(A> A/t) = Θ(1) and ν ≥ Ω( r1 ).
Provable Meta-Learning of Linear Representations

provide algorithms to learn the underlying linear represen- main result states that all local minima of the regularized
tation from a set of diverse tasks. Second, in the meta-test empirical risk are in the neighborhood of the optimal V? ,
phase (see Section 4) we show how to transfer these learned and have subspaces that approximate B well. Before stat-
features to a new, unseen task to improve the sample com- ing our result we define the constraint set, which contains
plexity of learning. Detailed proofs of our main results can incoherent matrices with reasonable scales, as follows:
be found in the Appendix. √
C0 κ̄r κν
W = { (U, V) | max ke> i Uk 2
≤ √ ,
i∈[t] t
3. Meta-Train: Learning Linear Features √ √
kUk2 ≤ C0 tκν, kVk2 ≤ C0 tκν },
Here we address both the algorithmic and statistical chal-
lenges of provably learning the linear feature representation for some large constant C0 . Under Assumption 2, this set
B. contains the optimal parameters. Note that U? and V?
satisfy the final two constraints by definition and Lemma 16
3.1. Local Minimizers of the Empirical Risk can be used to show that Assumption 2 actually implies that
U? is incoherent, which satisfies the first constraint. Our
The remarkable, practical success of first-order methods main result follows.
for training nonconvex optimization problems (including
Theorem 2. Let Assumptions 1 and 2 hold in the uniform
meta/multi-task learning objectives) motivates us to study
task sampling model. If the number of samples n1 satisfies
the optimization landscape of the empirical risk induced
n1 & polylog(n1 , d, t)(κr)4 max{t, d}, then, with proba-
by the model in (1). We show in this section that all local
bility at least 1 − 1/poly(d), we have that given any local
minimizers of a regularized version of empirical risk recover
minimum (U, V) ∈ int(W) of the objective (4), the column
the true linear representation up to a small statistical error.
space of V, spanned by the orthonormal feature matrix B̂,
Jointly learning the population parameters B and satisfies:
(α1 , . . . , αt )> defined by (1) is reminiscent of a matrix  s 
sensing/completion problem. We leverage this connection 1 max{t, d}r log n1 
for our analysis, building in particular on results from Ge sin θ(B̂, B) ≤ O  √ .
ν n1
et al. (2017). Throughout this section we assume that we
are in a uniform task sampling model—at each iteration
the task t(i) for the ith datapoint is uniformly sampled We make several comments on this result:
from the t underlying tasks. We first recast our problem in
the language of matrices, by defining the matrix we hope • The guarantee in Theorem 2 suggests that all local
to recover as M? = (α1 , . . . , αt )> B> ∈ Rt×d . Since minimizers of the regularized empirical risk (4) will
rank(M? ) = r, we let X? D? (Y? )> = SVD(M? ), and produce a linear representation at a distance at most
denote U? = X? (D? )1/2 ∈ Rt×r , V? = (D? )1/2 Y? ∈
p
Õ( max{t, d}r/n1 ) from the true underlying represen-
Rd×r . In this notation, the responses of the regression model tation. Theorem 5 guarantees that any estimator (including
p
are written as follows: the empirical risk minimizer) must incur error & dr/n1 .
Therefore, in the regime t ≤ O(d), all local minimizers
yi = het(i) x>
i , M? i + i . (3)
are statistically close-to-optimal, up to logarithmic factors
To frame recovery as an optimization problem we consider and conditioning/eigenvalue factors in the task diversity
the Burer-Monteiro factorization of the parameter M = matrix.
UV> where U ∈ Rt×r and V ∈ Rd×r . This motivates the • Combined with a recent line of results showing that (noisy)
following objective: gradient descent can efficiently escape strict saddle points
2t X
n to find local minima (Jin et al., 2017), Theorem 2 provides
min f (U, V) = (yi − het(i) x> > 2
i , UV i) strong evidence that first-order methods can efficiently
U∈Rt×r ,V∈Rd×r n i=1
meta-learn linear features.5
1
+ kU> U − V> Vk2F . (4)
2 The proof of Theorem 2 is technical so we only sketch
The second term in (4) functions as a regularization to pre- the high-level ideas. The overall strategy is to analyze the
vent solutions which send kUkF → 0 while kVkF → ∞ Hessian of the objective (4) at a stationary point (U, V) in
or vice versa. If the value of this objective (4) is small we int(W) to exhibit a direction ∆ of negative curvature which
might hope that an estimate of B can be extracted from 5
To formally establish computational efficiency, we need to
the column space of the parameter V, since the column further verify the smoothness and the strict-saddle properties of
space of V? spans the same subspace as B. Informally, our the objective function (4) (see, e.g., (Jin et al., 2017)).
Provable Meta-Learning of Linear Representations

Algorithm 1 MoM Estimator for Learning Linear Features Algorithm 2 Linear Regression for Learning a New Task
Input: {(xi , yi )}ni=1
1
. with a Feature Estimate
Input: B̂, {(xi , yi )}ni=1
>
Pn1
B̂D1 B̂ ← top-r SVD of 1
n1 · i=1 yi2 xi x>
i Pn2
2
.
Pn2
> > † >
return B̂ α̂ ← ( i=1 B̂xi xi B̂ ) B̂ i=1 xi yi
return α̂

can serve as a direction of local improvement pointing to-


wards M? (and hence show (U, V) is not a local minimum). • Theorem 3 is flexible—the only dependence of the estima-
Implementing this idea requires surmounting several techni- tor on the distribution of samples across the various tasks
cal hurdles including (1) establishing various concentration is factored into the empirical task diversity parameters ν̃
of measure results (e.g., RIP-like conditions) for the sensing and κ̃. Under a uniform observation model the guarantee
matrices et(i) x> also immediately translates into an analogous statement
i unique to our setting and (2) handling the
interaction of the optimization analysis with the regularizer which holds with the population task diversity parameters
and noise terms. Performing this analysis establishes that ν and κ̄.
under the aforementioned conditions all q local minima in • Theorem 3 provides a non-trivial guarantee even in the
int(W) satisfy kUV> − M? kF ≤ O( t max{t,d}r n1
log n1
) setting where we only have Θ(1) samples from each task,
(see Theorem 8). Guaranteeing that this loss is small is but t = Θ̃(dr). In this setting, recovering the parame-
not sufficient to ensure recovery of the underlying features. ters of any given task is information-theoretically impos-
Transferring this guarantee in the Frobenius norm to a re- sible. However, the method-of-moments estimator can
sult on the subspace angle critically uses the task diversity efficiently aggregate information across the tasks to learn
assumption (see Lemma 15) to give the final result. B.
• The estimator does rely on the moment structure implicit
3.2. Method-of-Moments Estimator in the Gaussian design to extract B. However, Theorem 3
Next, we present a method-of-moments algorithm to re- has no explicit dependence on t and is close-to-optimal
cover the feature matrix B with sharper statistical guaran- in the constant-snr regime; see Theorem 5 for our lower
tees. An alternative to optimization-based approaches such bound.
as maximum likelihood estimation, the method-of-moments
is among the oldest statistical techniques (Pearson, 1894) We now provide a summary of the proof. P Under ora-
and has recently been used to estimate parameters in latent cle access to the population meanPE[ n1 i yi2 xi x> i ] =
n
variable models (Anandkumar et al., 2012). (2Γ̄ + (1 + tr(Γ̄))Id ), where Γ̄ = n1 i=1 Bαt(i) α> t(i) B
>

(see Lemma 1), we can extract the features B by directly


As we will see, the technique is well-suited to our formu- applying PCA to this matrix, under the condition that κ̃ > 0,
lation of multi-task feature learning. We present our esti- to extract its column space. In practice, we only have
mator in Algorithm 1, which simply Pncomputes the top-r access to the samples {(xP n
i , yi )}i=1 . Algorithm 1 uses
eigenvectors of the matrix (1/n1 ) i=1 1
yi2 xi x>
i . Before the empirical moments n i yi xi x>
1 2
i in lieu of the pop-
presenting our result,
Pn we define the averaged empirical task ulation mean. Thus, to show the result, we argue that
matrix as Λ̄ = n1 i=1 αt(i) α> t(i) where ν̃ = σ r ( Λ̄), and 1
Pn
y 2
xi x >
= E[ 1
Pn
y 2
x i x>
] + E where kEk
n i=1 i i n i=1 i i
κ̃ = tr(Λ̄)/(rν̃) in analogy with Assumption 2. is a small, stochastic error (see Theorem 7). If this holds,
Theorem 3. Suppose the n1 data samples {(xi , yi )}ni=1 1 the Davis-Kahan sin θ theorem (Bhatia, 2013) shows that
are generated from the model in (1) and that Assumptions 1 PCA applied to the empirical moments provides an accurate
and 2 hold, but additionally that xi ∼ N (0, Id ). Then, if estimate of B under perturbation by a sufficiently small E.
n1 & polylog(d, n1 )rdκ̃/ν̃, the output B̂ of Algorithm 1 The key technical step in this argument is to show sharp
satisfies concentration (in spectral norm) of the matrix-valued noise
! terms contained in E which are neither bounded (in spectral
norm) nor sub-gaussian/sub-exponential-like; we refer the
r
κ̃ dr
sin θ(B̂, B) ≤ Õ , reader to the Appendix for further details on this argument.
ν̃ n1

with probability at least 1 − O(n−100


1 ). Moreover, if the 4. Meta-Test: Transfer of Features to New
number of samples generated from each task are equal (i.e., Tasks
Λ̄ = 1t A> A), then the aforementioned guarantee holds
with κ̃ = κ̄ and ν̃ = ν. Having estimated a linear feature representation B̂ shared
across related tasks, our second goal is to transfer this rep-
We first make several remarks regarding this result. resentation to a new, unseen task—the (t + 1)st task—to
Provable Meta-Learning of Linear Representations

improve learning. In the context of the model in (1), the performance on new tasks (Wang et al., 2019). For
approach taken in Algorithm 2 uses B̂ as a plug-in surrogate diverse tasks (i.e. κ ≤ O(1)), using Algorithm 1 to
for the unknown B, and attempts to estimate αt+1 ∈ Rr . estimate B̂ suggests that ensuring δ 2  nd2 , where
Formally we define our estimator α as follows: dr
δ 2 = Õ( νn ), requires nn12  r/ν. That is, the ratio of
1

α̂ = arg min ky − XB̂αk2 , (5) the number of samples used for feature learning to the
α number of samples used for learning the new task should
where n2 samples (X, y) are generated from the model exceed the complexity of the feature representation to
in (1) from the (t + 1)st task. Effectively, the feature rep- achieve “positive” transfer.
resentation B̂ performs dimension reduction on the input
covariates X, allowing us to learn in a lower-dimensional In order to obtain the rate in Theorem 4 we use a bias-
space. Our focus is on understanding the generalization variance analysis of the estimator error B̂α̂ − Bαt+1 (and
properties of the estimator in Algorithm 2, since (5) is an do not appeal to uniform convergence arguments). Using
ordinary least-squares objective which can be analytically the definition of y we can write the error as,
solved.
B̂α̂ − Bα0 = B̂(B̂> X> XB̂)−1 B̂X> XBα0
Assuming we have produced an estimate B̂ of the true fea-
ture matrix B, we can present our main result on the sample − Bα0 + B̂(B̂> X> XB̂)−1 B̂> X> .
complexity of meta-learned linear regression.
The first term contributes the bias term to (6) while the
Theorem 4. Suppose n2 data points, {(xi , yi )}ni=1
2
, are second contributes the variance term. Analyzing the fluctu-
generated from the model in (1), where Assumption 1 holds, ations of the (mean-zero) variance term can be done by
from a single task satisfying kαt+1 k2 ≤ O(1). Then, if controlling the norm of its square, > A, where A =
sin θ(B̂, B) ≤ δ and n2 & r log n2 , the output α̂ from XB̂(B̂> X> XB̂)−2 B̂> X> . We can bound this (random)
Algorithm 2 satisfies quadratic form by first appealing to the Hanson-Wright in-

r
 equality to show w.h.p. that > A . tr(A) + Õ(kAkF +
2 2
kB̂α̂ − Bαt+1 k ≤ Õ δ + , (6) kAk). The remaining randomness in A can be controlled
n2
using matrix concentration/perturbation arguments (see
with probability at least 1 − O(n−100
2 ). Lemma 17).
With access to the true feature matrix B̂ (i.e., setting B̂ = B)
Note that Bαt+1 is simply the underlying parameter in the
the term B̂(B> X> XB)−1 BX> XBα0 − Bα0 = 0, due
regression model in (1). We make several remarks about
to the cancellation in the empirical covariance matrices,
this result:
(B> X> XB)−1 BX> XB = Ir . This cancellation of the
empirical covariance is essential to obtaining a tight anal-
• Theorem 4 decomposes the error of transfer learning
ysis of the least-squares estimator. We cannot rely on this
into two components. The first term, Õ(δ 2 ), arises from
effect in full since B̂ 6= B. However, a naive analysis
the bias of using an imperfect feature estimate B̂ to trans- which splits these terms, (B̂> X> XB̂)−1 and B̂X> XB
fer knowledge across tasks. The second term, Õ( nr2 ), can lead to a large increase in the variance in the bound.
arises from the variance of learning in a space of reduced To exploit the fact B̂ ≈ B, we project the matrix B in
dimensionality. the leading XB term onto the column space of B̂ and its
• Standard generalization guarantees for random design complement—which allows a partial cancellation of the
linear regression ensure that the parameter recovery error empirical covariances in the subspace spanned by B̂. The
is bounded by O( nd2 ) w.h.p. under the same assumptions remaining variance can be controlled as in the previous term
(Hsu et al., 2012). Meta-learning of the linear representa- (see Lemma 18).
tion B̂ can provide a significant reduction in the sample
complexity of learning when δ 2  nd2 and r  d. 5. Lower Bounds for Feature Learning
2 d
• Conversely, if δ  n2the bounds in (6) imply that To complement the upper bounds provided in the previ-
the overhead of learning the feature representation may ous section, in this section we derive information-theoretic
overwhelm the potential benefits of transfer learning limits for feature learning in the model (1). To our knowl-
(with respect to baseline of learning the (t + 1)st task edge, these results provide the first sample-complexity lower
in isolation). This accords with the well-documented bounds for feature learning, with regards to subspace recov-
empirical phenomena of “negative” transfer observed in ery, in the multi-task setting. While there is existing litera-
large-scale deep learning problems where meta/transfer- ture on (minimax)-optimal estimation of low-rank matrices
learning techniques actually result in a degradation in (see, for example, Rohde et al. (2011)), that work focuses on
Provable Meta-Learning of Linear Representations

the (high-dimensional) estimation of matrices, whose only The dependency on the task diversity parameter ν1 (the first
constraint is to be low rank. Moreover, error is measured in term in Theorem 5) is achieved by constructing a pair of fea-
the additive prediction norm. In our setting, we must handle ture matrices and an ill-conditioned task matrix A that can-
the additional difficulties arising from the fact that we are not discern the direction along which they defer. The proof
interested in (1) learning a column space (i.e., an element in strategy to capture the second term uses a f -divergence
the Grr,d (R)) and (2) the error between such representatives based minimax technique from Guntuboyina (2011) (re-
is measured in the subspace angle distance. We begin by stated in Lemma 20 in the Appendix), similar in spirit to the
presenting our lower bound for feature recovery. global Fano (or Yang-Barron).
Theorem 5. Suppose a total of n data points are generated There are two key ingredients to using Lemma 20 and ob-
from the model in (1) satisfying Assumption 1 with xi ∼ taining a tight lower bound. First, we must exhibit a large
N (0, Id ), i ∼ N (0, 1), with an equal number from each family of distinct, well-separated feature matrices {Bi }M i=1
task, and that Assumption 2 holds with αj for each task (i.e., a packing at scale η). Second, we must argue this set
normalized to kαj k = 12. Then, there are αj for r ≤ d2 and of feature matrices induces a family of distributions over
1
n ≥ max 8ν , r(d − r) so that: data {(xi , yi )}Bi which are statistically “close” and funda-
r r r !! mentally difficult to distinguish amongst. This is captured
1 1 dr
inf sup sin θ(B̂, B) ≥ Ω max , , by the fact the -covering number, measured in the space
B̂ B∈Grr,d (R) ν n n of distributions with divergence measure Df (·, ·), is small.
The standard (global) Fano method, or Yang-Barron method
with probability at least 14 , where the infimum is taken over
(see Wainwright (2019, Ch. 15)), which uses the KL di-
the class of estimators that are functions of the n data points.
vergence to measure distance in the space of measures, is
known to provide rate-suboptimal lower bounds for para-
Again we make several comments on the result. metric estimation problems.7 Our case is no exception. To
circumvent this difficulty we use the framework of Gun-
• The result of Theorem 5 shows that the estimator in Al- tuboyina (2011), instantiated with the f -divergence chosen
gorithm 1 provides a close-to-optimal estimator of the as the χ2 -divergence, to obtain a tight lower bound.
feature representation parameterized by B–up to loga- The argument proceeds in two steps. First, although the
rithmic and conditioning factors (i.e. κ, ν)6 in the task geometry of Grr,d (R) is complex, we can adapt results from
diversity matrix–that is independent of the task number t. Pajor (1998) to provide sharp upper/lower bounds on the
Note that under the normalization for αi , as κ → ∞ (i.e. metric entropy (or global entropy) of the Grassmann man-
the task matrix A becomes ill-conditioned) we have that ifold (see Proposition 9). The second technical step of
ν → 0. So the first term in Theorem 5 establishes that the argument hinges on the ability to cover the space of
task diversity is necessary for recovery of the subspace B. distributions parametrized by B in the space of measures
• The dimension of Grr,d (R) (i.e., the number of free param- {PB : B ∈ Grr,d (R)}—with distance measured by an ap-
eters needed to specify a feature set) is r(d − r) ≥ Ω(dr) propriate f -divergence. In order to establish a covering
for d/2 ≥ r; hence the second term in Theorem 5 matches in the space of measures parametrized by B, the key step
the scaling that we intuit from parameter counting. is to bound the distance χ2 (PB1 , PB2 ) for two different
measures over data generated from the model (1) with two
• Obtaining tight dependence of our subspace recovery different feature matrices B1 and B2 (see Lemma 21). This
bounds on conditioning factors in the task diversity ma- control can be achieved in our random design setting by
trix (i.e. κ, ν) is an important and challenging research exploiting the Gaussianity of the marginals over data X and
question. We believe the gap between in condition- the Gaussianity of the conditionals of y|X, B, to ultimately
ing/eigenvalue
p factors between Theorem 3 and Theorem 5 be expressed as a function of the angular distance between
on the dr/n term is related to a problem that persists for B1 and B2 .
classical estimators in linear regression (i.e. for the Lasso
estimator in sparse linear regression). Even in this setting,
a gap remains with respect to condition number/eigenvalue 6. Simulations
factors of the data design matrix X, between existing up- We complement our theoretical analysis with a series of
per and lower bounds (see Chen et al. (2016, Section 7), numerical experiments highlighting the benefits (and lim-
Raskutti et al. (2011, Theorem 1, Theorem 2) and Zhang
7
et al. (2014) for example). In our setting the task diversity Even for the simple problems of Gaussian mean estimation
matrix A enters into the problem in a similar fashion to the classical Yang-Barron method is suboptimal; see Guntuboyina
(2011) for more details.
the data design matrix X in these aforementioned settings.
6
Note in the setting that κ ≤ O(1), ν ∼ r1 .
Provable Meta-Learning of Linear Representations

its) of meta-learning8 . For the purposes of feature learning that meta-learned regressions perform significantly worse
we compare the performance of the method-of-moments than simply ignoring first t tasks. Theorem 4 indicates the
estimator in Algorithm 1 vs. directly optimizing the objec- bias from the inability to learn an accurate feature estimate
tive in (4). Additional details on our set-up are provided of B overwhelms the benefits of transfer learning. In this
in Appendix G. We construct problem instances by gen- regime n2  d so the new task can be efficiently learned in
erating Gaussian covariates and noise as xi ∼ N (0, Id ), isolation. We believe this simulation represents a simple in-
i ∼ N (0, 1), and the tasks and features used for the first- stance of the empirically observed phenomena of “negative”
stage feature estimation as αi ∼ √1r · N (0, Ir ), with B transfer (Wang et al., 2019).
generated as a (uniform) random r-dimensional subspace We now turn to the more interesting use cases where meta-
of Rd . In all our experiments we generate an equal number learning is a powerful tool. We consider a setting where
of samples nt for each of the t tasks, so n1 = t · nt . In d = 100, r = 5, and nt = 25 for varying numbers of
the second stage we generate a new, (t + 1)st task instance tasks t. However, now we consider a new, unseen task
using the same feature estimate B used in the first stage and where data is scarce: n2 = 25 < d. As Fig. 2 shows, in
otherwise generate n2 samples, with the covariates, noise
and αt+1 constructed as before. Throughout this section we 1.0
LF-MoM 1.4
LR

`2 Parameter Error
refer to features learned via a first-order gradient method 0.8 LF-FO 1.2 meta-LR-MoM
as LF-FO and the corresponding meta-learned regression 1.0 meta-LR-FO

sin θ
0.6

parameter on a new task by meta-LR-FO. We use LF-MoM 0.4


0.8

and meta-LR-MoM to refer to the same quantities save with 0.2


0.6

0.4

the feature estimate learned via the method-of-moments 0.0


0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000

estimator. We also use LR to refer to the baseline linear Number of Tasks Number of Tasks
regression estimator on a new task which only uses data
generated from that task. Figure 2. Left: LF-FO vs. LF-MoM estimator with error measured
in the subspace angle distance sin θ(B̂, B). Right: meta-LR-FO
We begin by considering a challenging setting for feature and meta-LR-MoM vs. LR on new task with error measured on
learning where d = 100, r = 5, but nt = 5 for varying new task parameter. Here d = 100, r = 5, nt = 25 while
numbers of tasks t. As Fig. 1 demonstrates, the method-of- n2 = 25 while the number of tasks is varied.

1.00
1.4 this regime both the method-of-moments estimator and the
LR loss-based approach can learn a non-trivial estimate of the
`2 Parameter Error

1.2
0.95
meta-LR-MoM
0.90 1.0
meta-LR-FO feature representation. The benefits of transferring this rep-
sin θ

0.85
0.8
0.80
0.6
resentation are also evident in the improved generalization
0.75

0.70
LF-MoM 0.4
performance seen by the meta-regression procedures on the
LF-FO
0.65 0.2 new task. Interestingly, the loss-based approach learns an
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000

Number of Tasks Number of Tasks accurate feature representation B̂ with significantly fewer
samples then the method-of-moments estimator, in contrast
Figure 1. Left: LF-FO vs. LF-MoM estimator with error measured to the previous experiment. Finally, we consider an instance
in the subspace angle distance sin θ(B̂, B). Right: meta-LR-FO where d = 100, r = 5, t = 20, and n2 = 50 with varying
and meta-LR-MoM vs. LR on new task with error measured on numbers of training points nt per task. We see in Fig. 3
new task parameter. Here d = 100, r = 5, and nt = 5 while
that meta-learning of representations provides significant
n2 = 2500 as the number of tasks is varied.
value in a new task. Note that these numerical experiments
show that as the number of tasks is fixed, but nt increases,
moments estimator is able to aggregate information across the generalization ability of the meta-learned regressions
the tasks as t increases to slowly improve its feature estimate, significantly improves as reflected in the bound (2).
even though nt  d. The loss-based approach struggles to
improve its estimate of the feature matrix B in this regime.
This accords with the extra t dependence in Theorem 2 rel- 7. Conclusions
ative to Theorem 3. In this setting, we also generated a In this paper we show how a shared linear representation
(t + 1)st test task with d  n2 = 2500, to test the effect of may be efficiently learned and transferred between mul-
meta-learning the linear representation on generalization in tiple linear regression tasks. We provide both upper and
a new, unseen task against a baseline which simply performs lower bounds on the sample complexity of learning this
a regression on this new task in isolation. Fig. 1 also shows representation and for the problem of learning-to-learn. We
8
An open-source Python implementation to reproduce believe our bounds capture important qualitative phenom-
our experiments can be found at https://github.com/ ena observed in real meta-learning applications absent from
nileshtrip/MTL. previous theoretical treatments.
Provable Meta-Learning of Linear Representations

1.0 1.4
LR Edelman, A., Arias, T. A., and Smith, S. T. The geometry

`2 Parameter Error
1.2
0.8

1.0
meta-LR-MoM of algorithms with orthogonality constraints. SIAM jour-
meta-LR-FO
LF-MoM nal on Matrix Analysis and Applications, 20(2):303–353,
sin θ

0.6
0.8
LF-FO
0.4 0.6 1998.
0.4
0.2

0 1000 2000 3000 4000 5000 6000


0.2
0 1000 2000 3000 4000 5000 6000
Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-
Number of Training Points (per Task) Number of Training Points (per Task) learning for fast adaptation of deep networks. In Proceed-
ings of the 34th International Conference on Machine
Figure 3. Left: LF-FO vs. LF-MoM estimator with error measured
Learning-Volume 70, pp. 1126–1135. JMLR. org, 2017.
in the subspace angle distance sin θ(B̂, B). Right: meta-LR-FO
and meta-LR-MoM vs. LR on new task with error measured on Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online
new task parameter. Here d = 100, r = 5, t = 20, and n2 = 50 meta-learning. arXiv preprint arXiv:1902.08438, 2019.
while the number of training points per task (nt ) is varied.
Ge, R., Jin, C., and Zheng, Y. No spurious local minima
in nonconvex low rank problems: A unified geometric
References analysis. arXiv preprint arXiv:1704.00708, 2017.
Anandkumar, A., Hsu, D., and Kakade, S. M. A method of Guntuboyina, A. Lower bounds for the minimax risk using
moments for mixture models and hidden Markov models. f -divergences, and applications. IEEE Transactions on
In Conference on Learning Theory, pp. 33–1, 2012. Information Theory, 57(4):2386–2399, 2011.
Ando, R. K. and Zhang, T. A framework for learning pre- Hsu, D., Kakade, S. M., and Zhang, T. Random design
dictive structures from multiple tasks and unlabeled data. analysis of ridge regression. In Conference on learning
Journal of Machine Learning Research, 6(Nov):1817– theory, pp. 9–1, 2012.
1853, 2005.
Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan,
Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., and Auli, M. I. How to escape saddle points efficiently. In Proceed-
M. Cloze-driven pretraining of self-attention networks. ings of the 34th International Conference on Machine
arXiv preprint arXiv:1903.07785, 2019. Learning, pp. 1724–1732. JMLR. org, 2017.

Baxter, J. A model of inductive bias learning. Journal of Khodak, M., Balcan, M.-F., and Talwalkar, A. Provable guar-
artificial intelligence research, 12:149–198, 2000. antees for gradient-based meta-learning. arXiv preprint
arXiv:1902.10644, 2019a.
Bhatia, R. Matrix Analysis, volume 169. Springer Science Khodak, M., Balcan, M.-F. F., and Talwalkar, A. S. Adaptive
& Business Media, 2013. gradient-based meta-learning methods. In Advances in
Neural Information Processing Systems, pp. 5915–5926,
Candes, E. and Plan, Y. Tight oracle bounds for low-rank
2019b.
matrix recovery from a minimal number of noisy random
measurements. arXiv preprint arXiv:1001.0339, 2010. Liu, D. C. and Nocedal, J. On the limited memory BFGS
method for large scale optimization. Mathematical pro-
Chen, X., Guntuboyina, A., and Zhang, Y. On bayes risk gramming, 45(1-3):503–528, 1989.
lower bounds. The Journal of Machine Learning Re-
search, 17(1):7687–7744, 2016. Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neu-
ral networks for natural language understanding. arXiv
Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. preprint arXiv:1901.11504, 2019.
Learning-to-learn stochastic gradient descent with biased
Maclaurin, D., Duvenaud, D., and Adams, R. P. Autograd:
regularization. arXiv preprint arXiv:1903.10399, 2019.
Effortless gradients in numpy. In ICML 2015 AutoML
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Workshop, volume 238, 2015.
Tzeng, E., and Darrell, T. Decaf: A deep convolutional Maurer, A., Pontil, M., and Romera-Paredes, B. The ben-
activation feature for generic visual recognition. In Inter- efit of multitask representation learning. The Journal of
national conference on machine learning, pp. 647–655, Machine Learning Research, 17(1):2853–2884, 2016.
2014.
Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw,
Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan,
Q. Few-shot learning via learning the representation, M. I., et al. Ray: A distributed framework for emerging
provably. arXiv preprint arXiv:2002.09434, 2020. {AI} applications. In 13th {USENIX} Symposium on
Provable Meta-Learning of Linear Representations

Operating Systems Design and Implementation ({OSDI}


18), pp. 561–577, 2018.
Pajor, A. Metric entropy of the Grassmann manifold. Con-
vex Geometric Analysis, 34:181–188, 1998.
Pearson, K. Contributions to the mathematical theory of evo-
lution. Philosophical Transactions of the Royal Society
of London. A, 185:71–110, 1894.
Pontil, M. and Maurer, A. Excess risk bounds for multitask
learning with trace norm regularization. In Conference
on Learning Theory, pp. 55–76, 2013.
Raskutti, G., Wainwright, M. J., and Yu, B. Minimax rates
of estimation for high-dimensional linear regression over
`q -balls. IEEE transactions on information theory, 57
(10):6976–6994, 2011.
Recht, B. A simpler approach to matrix completion. Jour-
nal of Machine Learning Research, 12(Dec):3413–3430,
2011.
Rohde, A., Tsybakov, A. B., et al. Estimation of high-
dimensional low-rank matrices. The Annals of Statistics,
39(2):887–930, 2011.

Tropp, J. A. User-friendly tail bounds for sums of random


matrices. Foundations of computational mathematics, 12
(4):389–434, 2012.
Tsybakov, A. B. Introduction to Nonparametric Estimation.
Springer Science & Business Media, 2008.

Vershynin, R. High-Dimensional Probability: An Intro-


duction with Applications in Data Science, volume 47.
Cambridge University Press, 2018.
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
neural information processing systems, pp. 3630–3638,
2016.
Wainwright, M. J. High-Dimensional Statistics: A Non-
Asymptotic Viewpoint, volume 48. Cambridge University
Press, 2019.

Wang, Z., Dai, Z., Póczos, B., and Carbonell, J. Charac-


terizing and avoiding negative transfer. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 11293–11302, 2019.
Zhang, Y., Wainwright, M. J., and Jordan, M. I. Lower
bounds on the performance of polynomial-time algo-
rithms for sparse linear regression. In Conference on
Learning Theory, pp. 921–948, 2014.

You might also like