0% found this document useful (0 votes)

13 views10 pages

Provable Meta-Learning of Linear Representati

This paper addresses the theoretical foundations of meta-learning, specifically focusing on the efficient learning of transferable linear representations across multiple tasks. It presents algorithms for multi-task linear regression that improve sample efficiency when adapting to new tasks, highlighting the importance of shared low-dimensional features. The authors also provide information-theoretic lower bounds on sample complexity, demonstrating the effectiveness of their approach in the context of meta-learning.

Uploaded by

ranaimransa227

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views10 pages

Provable Meta-Learning of Linear Representati

Uploaded by

ranaimransa227

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Provable Meta-Learning of Linear Representations

Nilesh Tripuraneni 1 Chi Jin 2 Michael I. Jordan 1

Abstract ranging from deep reinforcement learning (Baevski et al.,

Meta-learning, or learning-to-learn, seeks to de- 2019) to natural language processing (Ando & Zhang, 2005;
sign algorithms that can utilize previous experi- Liu et al., 2019). Accordingly, researchers have begun to
ence to rapidly learn new skills or adapt to new highlight the need to develop (and understand) generic algo-
environments. Representation learning—a key rithms for transfer (or meta) learning applicable in diverse
tool for performing meta-learning—learns a data domains (Finn et al., 2017). Surprisingly, however, despite
representation that can transfer knowledge across a long line of work on transfer learning, there is limited the-
multiple tasks, which is essential in regimes where oretical characterization of the underlying problem. Indeed,
data is scarce. Despite a recent surge of interest there are few efficient algorithms for feature learning that
in the practice of meta-learning, the theoretical provably generalize to new, unseen tasks. Sharp guarantees
underpinnings of meta-learning algorithms are are even lacking in the linear setting.
lacking, especially in the context of learning trans- In this paper, we study the problem of meta-learning of
ferable representations. In this paper, we focus on features in a linear model in which multiple tasks share
the problem of multi-task linear regression—in a common set of low-dimensional features. Our aim is
which multiple linear regression models share a twofold. First, we ask: given a set of diverse samples from
common, low-dimensional linear representation. t different tasks how we can efficiently (and optimally)
Here, we provide provably fast, sample-efficient learn a common feature representation? Second, having
algorithms to address the dual challenges of (1) learned a common feature representation, how can we use
learning a common set of features from multiple, this representation to improve sample efficiency in a new
related tasks, and (2) transferring this knowledge (t + 1)st task where data may be scarce?1
to new, unseen tasks. Both are central to the gen-
eral problem of meta-learning. Finally, we com- Formally, given an (unobserved) linear feature matrix B =
plement these results by providing information- (b1 , . . . , br ) ∈ Rd×r with orthonormal columns, our statis-
theoretic lower bounds on the sample complexity tical model for data pairs (xi , yi ) is:
of learning these linear features.
yi = x>
i Bαt(i) + i ; βt(i) = Bαt(i) , (1)

where there are t (unobserved) underlying task parameters

1. Introduction αj for j ∈ {1, . . . , t}. Here t(i) ∈ {1, . . . , t} is the index
The ability of a learner to transfer knowledge between tasks of the task associated with the ith datapoint, xi ∈ Rd is a
is crucial for robust, sample-efficient inference and predic- random covariate, and i is additive noise. We assume the
tion. One of the most well-known examples of such transfer sequence {αt(i) }∞ i=1 is independent of all other randomness
learning has been in few-shot image classification where in the problem. In this framework, the aforementioned
the idea is to initialize neural network weights in early lay- questions reduce to recovering B from data from the first
ers using ImageNet pre-training/features, and subsequently {1, . . . , t} tasks, and using this feature representation to
re-train the final layers on a new task (Donahue et al., 2014; recover a better estimate of a new task parameter, βt+1 =
Vinyals et al., 2016). However, the need for methods that Bαt+1 , where αt+1 is also unobserved.
can learn data representations that generalize to multiple, Our main result targets the problem of learning-to-learn
unseen tasks has also become vital in other applications, (LTL), and shows how a feature representation B̂ learned
1
Department of EECS, University of California, from t diverse tasks can improve learning on an unseen
Berkeley 2 Department of Electrical Engineering, Prince- (t + 1)st task which shares the same underlying linear repre-
ton University. Correspondence to: Nilesh Tripuraneni sentation. We informally state this result below.2
<nilesh [email protected]>.
1
This is sometimes referred to as learning-to-learn (LTL).
th 2
Proceedings of the 38 International Conference on Machine Theorem 1 follows immediately from combining Theorems 3
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). and 4; see Theorem 6 for a formal statement.
Provable Meta-Learning of Linear Representations

Theorem 1 (Informal). Suppose we are given n1 total sam- up to logarithmic and conditioning/eigenvalue factors in
ples from t diverse and normalized tasks which are used the matrix of task parameters (see Assumption 2). To our
in Algorithm 1 to learn a feature representation B̂, and n2 knowledge, this is the first information-theoretic lower
samples from a new (t + 1)st task which are used along with bound for representation learning in the multi-task setting
B̂ and Algorithm 2 to learn the parameters α̂ of this new (see Section 5).
(t + 1)st task. Then, the parameter B̂α̂ has the following
excess prediction error on a new test point x? drawn from
the training data covariate distribution: 1.1. Related Work
2
dr r
While there is a vast literature on papers proposing multi-
2
Ex? [hx? , B̂α̂ − Bαt+1 i ] ≤ Õ + , (2) task and transfer learning methods, the number of theoretical
n1 n2
investigations is much smaller. An important early contri-
with high probability over the training data. bution is due to Baxter (2000), who studied a model where
tasks with shared representations are sampled from the same
The naive complexity of linear regression which ignores underlying environment. Pontil & Maurer (2013) and Mau-
the information from the previous t tasks has complexity rer et al. (2016), using tools from empirical process theory,
O( nd2 ). Theorem 1 suggests that “positive” transfer from developed a generic and powerful framework to prove gener-
the first {1, . . . , t} tasks to the final (t + 1)st task can dra- alization bounds in multi-task and learning-to-learn settings
matically reduce the sample complexity of learning when that are related to ours. Indeed, the closest guarantee to that
r d and nn12 r2 ; that is, when (1) the complexity of the in our Theorem 1 that we are aware of is Maurer et al. (2016,
shared representation is much smaller than the dimension Theorem 5). Instantiated in our setting, Maurer et al. (2016,
of the underlying space and (2) when the ratio of the num- Theorem 5) provides an LTL guarantee showing that the ex-
ber of samples used for feature learning to the number of cess risk of the loss function with learned
√
representation on
q
samples present for a new unseen task exceeds the complex- a new datapoint is bounded by Õ( r√td + nr2 ), with high
ity of the shared representation. We believe that the LTL probability. There are several principal differences between
bound in Theorem 1 is the first bound, even in the linear our work and results of this kind. First, we address the
setting, to sharply exhibit this phenomenon (see Section 1.1 algorithmic component (or computational aspect) of meta-
for a detailed comparison to existing results). Prior work learning while the previous theoretical literature generally
provides rates for which the leading term in (2) decays as assumes access to a global empirical risk minimizer (ERM).
∼ √1t , not as ∼ n11 . We identify structural conditions on Computing the ERM in these settings requires solving a
the design of the covariates and diversity of the tasks that nonconvex optimization problem that is in general NP hard.
allow our algorithms to take full advantage of all samples Second, in contrast to Maurer et al. (2016), we also provide
available when learning the shared features. Our primary guarantees for feature recovery in terms of the parameter
contributions in this paper are to: estimation error—measured directly in the distance in the
feature space.
• Establish that all local minimizers of the (regularized)
empirical risk induced by (1) are close to the true linear Third, and most importantly, in Maurer et al. (2016), the
representation up to a small, statistical error. This pro- leading term capturing the complexity of learning the fea-
vides strong evidence that first-order algorithms, such as ture representation decays only in t but not in n1 (which
gradient descent (Jin et al., 2017), can efficiently recover √ much larger than t). Although, as they remark,
is typically
good feature representations (see Section 3.1). the 1/ t scaling they obtain is in general unimprovable in
their setting, our results leverage assumptions on the dis-
• Provide a method-of-moments estimator which can effi- tributional similarity between the underlying covariates x
ciently aggregate information across multiple differing and the potential diversity of tasks to improve this scaling
tasks to estimate B—even when it may be information- to 1/n1 . That is, our algorithms make benefit of all the
theoretically impossible to learn the parameters of any samples in the feature learning phase. We believe that for
given task (see Section 3.2). many settings (including the linear model that is our fo-
cus) such assumptions are natural and that our rates reflect
• Demonstrate the benefits and pitfalls of transferring
the practical efficacy of meta-learning techniques. Indeed,
learned representations to new, unseen tasks by analyzing
transfer learning is often successful even when we are pre-
the bias-variance trade-offs of the linear regression esti-
sented with only a few training tasks but with each having a
mator based on a biased, feature estimate (see Section 4).
significant number of samples per task (e.g., n1 t).3
• Develop an information-theoretic lower bound for the There has also been a line of recent work providing guaran-
problem of feature learning, demonstrating that the afore-
3
mentioned estimator is a close-to-optimal estimator of B, See Fig. 3 for a numerical simulation relevant to this setting.
Provable Meta-Learning of Linear Representations

tees for gradient-based meta-learning (MAML) (Finn et al., F is an orthonormal matrix whose columns form an or-
2017). Finn et al. (2019); Khodak et al. (2019a;b), and thonormal basis of q, then a singular value decomposition
Denevi et al. (2019) work in the framework of online con- of E> F = UDV> defines the principal angles as:
vex optimization (OCO) and use a notion of (a potentially
data-dependent) task similarity that assumes closeness of D = diag(cos θ1 , cos θ2 , . . . , cos θk ),
all tasks to a single fixed point in parameter space to pro- where 0 ≤ θk ≤ . . . ≤ θ1 ≤ π2 . The distance of interest
vide guarantees. In contrast to this work, we focus on the for us will be the subspace angle distance sin θ1 , and for
setting of learning a representation common to all tasks in a convenience we will use the shorthand sin θ(E, F) to refer
generative model. The task model parameters need not be to it. With some abuse of notation we will use B to refer to
close together in our setting. an explicit orthonormal feature matrix and the subspace in
In concurrent work, Du et al. (2020) obtain results similar to Grr,d (R) it represents. We now detail several assumptions
ours for multi-task linear regression and provide comparable we use in our analysis.
guarantees for a two-layer ReLU network using a notion Assumption 1 (Sub-Gaussian Design and Noise). The
of training task diversity akin to ours. Their generalization i.i.d. design vectors xi are zero mean with covariance
bounds use a distributional assumption over meta-test tasks, E[xx> ] = Id and are Id -sub-gaussian,
in the sense that
2
while our bounds for linear regression are sharp for fixed E[exp(v> xi )] ≤ exp kvk 2 for all v. Moreover, the addi-
meta-test tasks. Moreover, their focus is on purely statistical tive noise variables i are i.i.d. sub-gaussian with variance
guarantees—they assume access to an ERM oracle for non- parameter 1 and are independent of xi .
convex optimization problems. Our focus is on providing
statistical rates for efficient algorithmic procedures (i.e., the Throughout, we work in the setting of random design linear
method-of-moments and local minima reachable by gradient regression, and in this context Assumption 1 is standard.
descent). Finally, we also show a (minimax)-lower bound Our results do not critically rely on the identity covariance
for the problem of feature recovery (i.e., recovering B). assumption although its use simplifies several technical argu-
ments. In the following we define the population task diver-
>
2. Preliminaries sity matrix as A = (α1 , . . . , αt )> ∈ Rt×r , ν = σr ( A t A ),
>A
tr( A )
Throughout, we will use bold lower-case letters (e.g., x) the average condition number as κ̄ = rν
t
, and the
A> A
to refer to vectors and bold upper-case letters to refer to worst-case condition number as κ = σ1 ( t )/ν.
matrices (e.g., X). We exclusively use B ∈ Rd×r to re- Assumption 2 (Task Diversity and Normalization). The t
fer to a matrix with orthonormal columns spanning an r- underlying task parameters αj satisfy kαj k = Θ(1) for all
dimensional feature space, and B⊥ to refer a matrix with j ∈ {1, . . . , t}. Moreover, we assume ν > 0.
orthonormal columns spanning the orthogonal subspace of
this feature space. The norm k · k appearing on a vector or Recovering the feature matrix B is impossible without struc-
matrix refers to its `2 norm or spectral norm respectively. tural conditions on A. Consider the extreme case in which
The notation k·kF refers to a Frobenius norm. hx, yi is the α1 , . . . , αt are restricted to span only the first r−1 columns
Euclidean inner product, while hM, Ni = tr(MN> ) is the of the column space of the feature matrix B. None of the
inner product between matrices. Generically, we will use data points (xi , yi ) contain any information about the rth
“hatted” vectors and matrices (e.g., α̂ and B̂) to refer to column-feature which can be any arbitrary vector in the com-
(random) estimators of their underlying population quanti- plementary d − r − 1 subspace. In this case recovering B
ties. We will use &, ., and to denote greater than, less accurately is information-theoretically impossible. The pa-
than, and equal to up to a universal constant and use Õ to rameters ν, κ̄, and κ capture how “spread out” the tasks αj
denote an expression that hides polylogarithmic factors in are in the column space of B. The condition kαj k = Θ(1)
all problem parameters. Our use of O, Ω, and Θ is otherwise is also standard in the statistical literature and is equivalent
standard. to normalizing the signal-to-noise (snr) ratio to be Θ(1)4 .
In linear models, the snr is defined as the square of the `2
Formally, an orthonormal feature matrix B is an element of
norm of the underlying parameter divided by the variance
an equivalence class (under right rotation) of a representa-
of the additive noise.
tive lying in Grr,d (R)—the Grassmann manifold (Edelman
et al., 1998). The Grassmann manifold, which we denote Our overall approach to meta-learning of representations
as Grr,d (R), consists of the set of r-dimensional subspaces consists of two phases that we term “meta-train” and “meta-
within an underlying d-dimensional space. To define dis- test”. First, in the meta-train phase (see Section 3), we
tance in Grr,d (R) we define the notion of a principal angle 4
Note that for a well-conditioned population task diversity
between two subspaces p and q. If E is an orthonormal
matrix where κ̄ ≤ κ ≤ O(1), our snr normalization enforces that
matrix whose columns form an orthonormal basis of p and tr(A> A/t) = Θ(1) and ν ≥ Ω( r1 ).
Provable Meta-Learning of Linear Representations

provide algorithms to learn the underlying linear represen- main result states that all local minima of the regularized
tation from a set of diverse tasks. Second, in the meta-test empirical risk are in the neighborhood of the optimal V? ,
phase (see Section 4) we show how to transfer these learned and have subspaces that approximate B well. Before stat-
features to a new, unseen task to improve the sample com- ing our result we define the constraint set, which contains
plexity of learning. Detailed proofs of our main results can incoherent matrices with reasonable scales, as follows:
be found in the Appendix. √
C0 κ̄r κν
W = { (U, V) | max ke> i Uk 2
≤ √ ,
i∈[t] t
3. Meta-Train: Learning Linear Features √ √
kUk2 ≤ C0 tκν, kVk2 ≤ C0 tκν },
Here we address both the algorithmic and statistical chal-
lenges of provably learning the linear feature representation for some large constant C0 . Under Assumption 2, this set
B. contains the optimal parameters. Note that U? and V?
satisfy the final two constraints by definition and Lemma 16
3.1. Local Minimizers of the Empirical Risk can be used to show that Assumption 2 actually implies that
U? is incoherent, which satisfies the first constraint. Our
The remarkable, practical success of first-order methods main result follows.
for training nonconvex optimization problems (including
Theorem 2. Let Assumptions 1 and 2 hold in the uniform
meta/multi-task learning objectives) motivates us to study
task sampling model. If the number of samples n1 satisfies
the optimization landscape of the empirical risk induced
n1 & polylog(n1 , d, t)(κr)4 max{t, d}, then, with proba-
by the model in (1). We show in this section that all local
bility at least 1 − 1/poly(d), we have that given any local
minimizers of a regularized version of empirical risk recover
minimum (U, V) ∈ int(W) of the objective (4), the column
the true linear representation up to a small statistical error.
space of V, spanned by the orthonormal feature matrix B̂,
Jointly learning the population parameters B and satisfies:
(α1 , . . . , αt )> defined by (1) is reminiscent of a matrix  s 
sensing/completion problem. We leverage this connection 1 max{t, d}r log n1 
for our analysis, building in particular on results from Ge sin θ(B̂, B) ≤ O  √ .
ν n1
et al. (2017). Throughout this section we assume that we
are in a uniform task sampling model—at each iteration
the task t(i) for the ith datapoint is uniformly sampled We make several comments on this result:
from the t underlying tasks. We first recast our problem in
the language of matrices, by defining the matrix we hope • The guarantee in Theorem 2 suggests that all local
to recover as M? = (α1 , . . . , αt )> B> ∈ Rt×d . Since minimizers of the regularized empirical risk (4) will
rank(M? ) = r, we let X? D? (Y? )> = SVD(M? ), and produce a linear representation at a distance at most
denote U? = X? (D? )1/2 ∈ Rt×r , V? = (D? )1/2 Y? ∈
p
Õ( max{t, d}r/n1 ) from the true underlying represen-
Rd×r . In this notation, the responses of the regression model tation. Theorem 5 guarantees that any estimator (including
p
are written as follows: the empirical risk minimizer) must incur error & dr/n1 .
Therefore, in the regime t ≤ O(d), all local minimizers
yi = het(i) x>
i , M? i + i . (3)
are statistically close-to-optimal, up to logarithmic factors
To frame recovery as an optimization problem we consider and conditioning/eigenvalue factors in the task diversity
the Burer-Monteiro factorization of the parameter M = matrix.
UV> where U ∈ Rt×r and V ∈ Rd×r . This motivates the • Combined with a recent line of results showing that (noisy)
following objective: gradient descent can efficiently escape strict saddle points
2t X
n to find local minima (Jin et al., 2017), Theorem 2 provides
min f (U, V) = (yi − het(i) x> > 2
i , UV i) strong evidence that first-order methods can efficiently
U∈Rt×r ,V∈Rd×r n i=1
meta-learn linear features.5
1
+ kU> U − V> Vk2F . (4)
2 The proof of Theorem 2 is technical so we only sketch
The second term in (4) functions as a regularization to pre- the high-level ideas. The overall strategy is to analyze the
vent solutions which send kUkF → 0 while kVkF → ∞ Hessian of the objective (4) at a stationary point (U, V) in
or vice versa. If the value of this objective (4) is small we int(W) to exhibit a direction ∆ of negative curvature which
might hope that an estimate of B can be extracted from 5
To formally establish computational efficiency, we need to
the column space of the parameter V, since the column further verify the smoothness and the strict-saddle properties of
space of V? spans the same subspace as B. Informally, our the objective function (4) (see, e.g., (Jin et al., 2017)).
Provable Meta-Learning of Linear Representations

Algorithm 1 MoM Estimator for Learning Linear Features Algorithm 2 Linear Regression for Learning a New Task
Input: {(xi , yi )}ni=1
1
. with a Feature Estimate
Input: B̂, {(xi , yi )}ni=1
>
Pn1
B̂D1 B̂ ← top-r SVD of 1
n1 · i=1 yi2 xi x>
i Pn2
2
.
Pn2
> > † >
return B̂ α̂ ← ( i=1 B̂xi xi B̂ ) B̂ i=1 xi yi
return α̂

can serve as a direction of local improvement pointing to-

wards M? (and hence show (U, V) is not a local minimum). • Theorem 3 is flexible—the only dependence of the estima-
Implementing this idea requires surmounting several techni- tor on the distribution of samples across the various tasks
cal hurdles including (1) establishing various concentration is factored into the empirical task diversity parameters ν̃
of measure results (e.g., RIP-like conditions) for the sensing and κ̃. Under a uniform observation model the guarantee
matrices et(i) x> also immediately translates into an analogous statement
i unique to our setting and (2) handling the
interaction of the optimization analysis with the regularizer which holds with the population task diversity parameters
and noise terms. Performing this analysis establishes that ν and κ̄.
under the aforementioned conditions all q local minima in • Theorem 3 provides a non-trivial guarantee even in the
int(W) satisfy kUV> − M? kF ≤ O( t max{t,d}r n1
log n1
) setting where we only have Θ(1) samples from each task,
(see Theorem 8). Guaranteeing that this loss is small is but t = Θ̃(dr). In this setting, recovering the parame-
not sufficient to ensure recovery of the underlying features. ters of any given task is information-theoretically impos-
Transferring this guarantee in the Frobenius norm to a re- sible. However, the method-of-moments estimator can
sult on the subspace angle critically uses the task diversity efficiently aggregate information across the tasks to learn
assumption (see Lemma 15) to give the final result. B.
• The estimator does rely on the moment structure implicit
3.2. Method-of-Moments Estimator in the Gaussian design to extract B. However, Theorem 3
Next, we present a method-of-moments algorithm to re- has no explicit dependence on t and is close-to-optimal
cover the feature matrix B with sharper statistical guaran- in the constant-snr regime; see Theorem 5 for our lower
tees. An alternative to optimization-based approaches such bound.
as maximum likelihood estimation, the method-of-moments
is among the oldest statistical techniques (Pearson, 1894) We now provide a summary of the proof. P Under ora-
and has recently been used to estimate parameters in latent cle access to the population meanPE[ n1 i yi2 xi x> i ] =
n
variable models (Anandkumar et al., 2012). (2Γ̄ + (1 + tr(Γ̄))Id ), where Γ̄ = n1 i=1 Bαt(i) α> t(i) B
>

(see Lemma 1), we can extract the features B by directly

As we will see, the technique is well-suited to our formu- applying PCA to this matrix, under the condition that κ̃ > 0,
lation of multi-task feature learning. We present our esti- to extract its column space. In practice, we only have
mator in Algorithm 1, which simply Pncomputes the top-r access to the samples {(xP n
i , yi )}i=1 . Algorithm 1 uses
eigenvectors of the matrix (1/n1 ) i=1 1
yi2 xi x>
i . Before the empirical moments n i yi xi x>
1 2
i in lieu of the pop-
presenting our result,
Pn we define the averaged empirical task ulation mean. Thus, to show the result, we argue that
matrix as Λ̄ = n1 i=1 αt(i) α> t(i) where ν̃ = σ r ( Λ̄), and 1
Pn
y 2
xi x >
= E[ 1
Pn
y 2
x i x>
] + E where kEk
n i=1 i i n i=1 i i
κ̃ = tr(Λ̄)/(rν̃) in analogy with Assumption 2. is a small, stochastic error (see Theorem 7). If this holds,
Theorem 3. Suppose the n1 data samples {(xi , yi )}ni=1 1 the Davis-Kahan sin θ theorem (Bhatia, 2013) shows that
are generated from the model in (1) and that Assumptions 1 PCA applied to the empirical moments provides an accurate
and 2 hold, but additionally that xi ∼ N (0, Id ). Then, if estimate of B under perturbation by a sufficiently small E.
n1 & polylog(d, n1 )rdκ̃/ν̃, the output B̂ of Algorithm 1 The key technical step in this argument is to show sharp
satisfies concentration (in spectral norm) of the matrix-valued noise
! terms contained in E which are neither bounded (in spectral
norm) nor sub-gaussian/sub-exponential-like; we refer the
r
κ̃ dr
sin θ(B̂, B) ≤ Õ , reader to the Appendix for further details on this argument.
ν̃ n1

with probability at least 1 − O(n−100

1 ). Moreover, if the 4. Meta-Test: Transfer of Features to New
number of samples generated from each task are equal (i.e., Tasks
Λ̄ = 1t A> A), then the aforementioned guarantee holds
with κ̃ = κ̄ and ν̃ = ν. Having estimated a linear feature representation B̂ shared
across related tasks, our second goal is to transfer this rep-
We first make several remarks regarding this result. resentation to a new, unseen task—the (t + 1)st task—to
Provable Meta-Learning of Linear Representations

improve learning. In the context of the model in (1), the performance on new tasks (Wang et al., 2019). For
approach taken in Algorithm 2 uses B̂ as a plug-in surrogate diverse tasks (i.e. κ ≤ O(1)), using Algorithm 1 to
for the unknown B, and attempts to estimate αt+1 ∈ Rr . estimate B̂ suggests that ensuring δ 2 nd2 , where
Formally we define our estimator α as follows: dr
δ 2 = Õ( νn ), requires nn12 r/ν. That is, the ratio of
1

α̂ = arg min ky − XB̂αk2 , (5) the number of samples used for feature learning to the
α number of samples used for learning the new task should
where n2 samples (X, y) are generated from the model exceed the complexity of the feature representation to
in (1) from the (t + 1)st task. Effectively, the feature rep- achieve “positive” transfer.
resentation B̂ performs dimension reduction on the input
covariates X, allowing us to learn in a lower-dimensional In order to obtain the rate in Theorem 4 we use a bias-
space. Our focus is on understanding the generalization variance analysis of the estimator error B̂α̂ − Bαt+1 (and
properties of the estimator in Algorithm 2, since (5) is an do not appeal to uniform convergence arguments). Using
ordinary least-squares objective which can be analytically the definition of y we can write the error as,
solved.
B̂α̂ − Bα0 = B̂(B̂> X> XB̂)−1 B̂X> XBα0
Assuming we have produced an estimate B̂ of the true fea-
ture matrix B, we can present our main result on the sample − Bα0 + B̂(B̂> X> XB̂)−1 B̂> X> .
complexity of meta-learned linear regression.
The first term contributes the bias term to (6) while the
Theorem 4. Suppose n2 data points, {(xi , yi )}ni=1
2
, are second contributes the variance term. Analyzing the fluctu-
generated from the model in (1), where Assumption 1 holds, ations of the (mean-zero) variance term can be done by
from a single task satisfying kαt+1 k2 ≤ O(1). Then, if controlling the norm of its square, > A, where A =
sin θ(B̂, B) ≤ δ and n2 & r log n2 , the output α̂ from XB̂(B̂> X> XB̂)−2 B̂> X> . We can bound this (random)
Algorithm 2 satisfies quadratic form by first appealing to the Hanson-Wright in-

r
equality to show w.h.p. that > A . tr(A) + Õ(kAkF +
2 2
kB̂α̂ − Bαt+1 k ≤ Õ δ + , (6) kAk). The remaining randomness in A can be controlled
n2
using matrix concentration/perturbation arguments (see
with probability at least 1 − O(n−100
2 ). Lemma 17).
With access to the true feature matrix B̂ (i.e., setting B̂ = B)
Note that Bαt+1 is simply the underlying parameter in the
the term B̂(B> X> XB)−1 BX> XBα0 − Bα0 = 0, due
regression model in (1). We make several remarks about
to the cancellation in the empirical covariance matrices,
this result:
(B> X> XB)−1 BX> XB = Ir . This cancellation of the
empirical covariance is essential to obtaining a tight anal-
• Theorem 4 decomposes the error of transfer learning
ysis of the least-squares estimator. We cannot rely on this
into two components. The first term, Õ(δ 2 ), arises from
effect in full since B̂ 6= B. However, a naive analysis
the bias of using an imperfect feature estimate B̂ to trans- which splits these terms, (B̂> X> XB̂)−1 and B̂X> XB
fer knowledge across tasks. The second term, Õ( nr2 ), can lead to a large increase in the variance in the bound.
arises from the variance of learning in a space of reduced To exploit the fact B̂ ≈ B, we project the matrix B in
dimensionality. the leading XB term onto the column space of B̂ and its
• Standard generalization guarantees for random design complement—which allows a partial cancellation of the
linear regression ensure that the parameter recovery error empirical covariances in the subspace spanned by B̂. The
is bounded by O( nd2 ) w.h.p. under the same assumptions remaining variance can be controlled as in the previous term
(Hsu et al., 2012). Meta-learning of the linear representa- (see Lemma 18).
tion B̂ can provide a significant reduction in the sample
complexity of learning when δ 2 nd2 and r d. 5. Lower Bounds for Feature Learning
2 d
• Conversely, if δ n2the bounds in (6) imply that To complement the upper bounds provided in the previ-
the overhead of learning the feature representation may ous section, in this section we derive information-theoretic
overwhelm the potential benefits of transfer learning limits for feature learning in the model (1). To our knowl-
(with respect to baseline of learning the (t + 1)st task edge, these results provide the first sample-complexity lower
in isolation). This accords with the well-documented bounds for feature learning, with regards to subspace recov-
empirical phenomena of “negative” transfer observed in ery, in the multi-task setting. While there is existing litera-
large-scale deep learning problems where meta/transfer- ture on (minimax)-optimal estimation of low-rank matrices
learning techniques actually result in a degradation in (see, for example, Rohde et al. (2011)), that work focuses on
Provable Meta-Learning of Linear Representations

the (high-dimensional) estimation of matrices, whose only The dependency on the task diversity parameter ν1 (the first
constraint is to be low rank. Moreover, error is measured in term in Theorem 5) is achieved by constructing a pair of fea-
the additive prediction norm. In our setting, we must handle ture matrices and an ill-conditioned task matrix A that can-
the additional difficulties arising from the fact that we are not discern the direction along which they defer. The proof
interested in (1) learning a column space (i.e., an element in strategy to capture the second term uses a f -divergence
the Grr,d (R)) and (2) the error between such representatives based minimax technique from Guntuboyina (2011) (re-
is measured in the subspace angle distance. We begin by stated in Lemma 20 in the Appendix), similar in spirit to the
presenting our lower bound for feature recovery. global Fano (or Yang-Barron).
Theorem 5. Suppose a total of n data points are generated There are two key ingredients to using Lemma 20 and ob-
from the model in (1) satisfying Assumption 1 with xi ∼ taining a tight lower bound. First, we must exhibit a large
N (0, Id ), i ∼ N (0, 1), with an equal number from each family of distinct, well-separated feature matrices {Bi }M i=1
task, and that Assumption 2 holds with αj for each task (i.e., a packing at scale η). Second, we must argue this set
normalized to kαj k = 12. Then, there are αj for r ≤ d2 and of feature matrices induces a family of distributions over
1
n ≥ max 8ν , r(d − r) so that: data {(xi , yi )}Bi which are statistically “close” and funda-
r r r !! mentally difficult to distinguish amongst. This is captured
1 1 dr
inf sup sin θ(B̂, B) ≥ Ω max , , by the fact the -covering number, measured in the space
B̂ B∈Grr,d (R) ν n n of distributions with divergence measure Df (·, ·), is small.
The standard (global) Fano method, or Yang-Barron method
with probability at least 14 , where the infimum is taken over
(see Wainwright (2019, Ch. 15)), which uses the KL di-
the class of estimators that are functions of the n data points.
vergence to measure distance in the space of measures, is
known to provide rate-suboptimal lower bounds for para-
Again we make several comments on the result. metric estimation problems.7 Our case is no exception. To
circumvent this difficulty we use the framework of Gun-
• The result of Theorem 5 shows that the estimator in Al- tuboyina (2011), instantiated with the f -divergence chosen
gorithm 1 provides a close-to-optimal estimator of the as the χ2 -divergence, to obtain a tight lower bound.
feature representation parameterized by B–up to loga- The argument proceeds in two steps. First, although the
rithmic and conditioning factors (i.e. κ, ν)6 in the task geometry of Grr,d (R) is complex, we can adapt results from
diversity matrix–that is independent of the task number t. Pajor (1998) to provide sharp upper/lower bounds on the
Note that under the normalization for αi , as κ → ∞ (i.e. metric entropy (or global entropy) of the Grassmann man-
the task matrix A becomes ill-conditioned) we have that ifold (see Proposition 9). The second technical step of
ν → 0. So the first term in Theorem 5 establishes that the argument hinges on the ability to cover the space of
task diversity is necessary for recovery of the subspace B. distributions parametrized by B in the space of measures
• The dimension of Grr,d (R) (i.e., the number of free param- {PB : B ∈ Grr,d (R)}—with distance measured by an ap-
eters needed to specify a feature set) is r(d − r) ≥ Ω(dr) propriate f -divergence. In order to establish a covering
for d/2 ≥ r; hence the second term in Theorem 5 matches in the space of measures parametrized by B, the key step
the scaling that we intuit from parameter counting. is to bound the distance χ2 (PB1 , PB2 ) for two different
measures over data generated from the model (1) with two
• Obtaining tight dependence of our subspace recovery different feature matrices B1 and B2 (see Lemma 21). This
bounds on conditioning factors in the task diversity ma- control can be achieved in our random design setting by
trix (i.e. κ, ν) is an important and challenging research exploiting the Gaussianity of the marginals over data X and
question. We believe the gap between in condition- the Gaussianity of the conditionals of y|X, B, to ultimately
ing/eigenvalue
p factors between Theorem 3 and Theorem 5 be expressed as a function of the angular distance between
on the dr/n term is related to a problem that persists for B1 and B2 .
classical estimators in linear regression (i.e. for the Lasso
estimator in sparse linear regression). Even in this setting,
a gap remains with respect to condition number/eigenvalue 6. Simulations
factors of the data design matrix X, between existing up- We complement our theoretical analysis with a series of
per and lower bounds (see Chen et al. (2016, Section 7), numerical experiments highlighting the benefits (and lim-
Raskutti et al. (2011, Theorem 1, Theorem 2) and Zhang
7
et al. (2014) for example). In our setting the task diversity Even for the simple problems of Gaussian mean estimation
matrix A enters into the problem in a similar fashion to the classical Yang-Barron method is suboptimal; see Guntuboyina
(2011) for more details.
the data design matrix X in these aforementioned settings.
6
Note in the setting that κ ≤ O(1), ν ∼ r1 .
Provable Meta-Learning of Linear Representations

its) of meta-learning8 . For the purposes of feature learning that meta-learned regressions perform significantly worse
we compare the performance of the method-of-moments than simply ignoring first t tasks. Theorem 4 indicates the
estimator in Algorithm 1 vs. directly optimizing the objec- bias from the inability to learn an accurate feature estimate
tive in (4). Additional details on our set-up are provided of B overwhelms the benefits of transfer learning. In this
in Appendix G. We construct problem instances by gen- regime n2 d so the new task can be efficiently learned in
erating Gaussian covariates and noise as xi ∼ N (0, Id ), isolation. We believe this simulation represents a simple in-
i ∼ N (0, 1), and the tasks and features used for the first- stance of the empirically observed phenomena of “negative”
stage feature estimation as αi ∼ √1r · N (0, Ir ), with B transfer (Wang et al., 2019).
generated as a (uniform) random r-dimensional subspace We now turn to the more interesting use cases where meta-
of Rd . In all our experiments we generate an equal number learning is a powerful tool. We consider a setting where
of samples nt for each of the t tasks, so n1 = t · nt . In d = 100, r = 5, and nt = 25 for varying numbers of
the second stage we generate a new, (t + 1)st task instance tasks t. However, now we consider a new, unseen task
using the same feature estimate B used in the first stage and where data is scarce: n2 = 25 < d. As Fig. 2 shows, in
otherwise generate n2 samples, with the covariates, noise
and αt+1 constructed as before. Throughout this section we 1.0
LF-MoM 1.4
LR

`2 Parameter Error
refer to features learned via a first-order gradient method 0.8 LF-FO 1.2 meta-LR-MoM
as LF-FO and the corresponding meta-learned regression 1.0 meta-LR-FO

sin θ
0.6

parameter on a new task by meta-LR-FO. We use LF-MoM 0.4

0.8

and meta-LR-MoM to refer to the same quantities save with 0.2

0.6

0.4

the feature estimate learned via the method-of-moments 0.0

0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000

estimator. We also use LR to refer to the baseline linear Number of Tasks Number of Tasks
regression estimator on a new task which only uses data
generated from that task. Figure 2. Left: LF-FO vs. LF-MoM estimator with error measured
in the subspace angle distance sin θ(B̂, B). Right: meta-LR-FO
We begin by considering a challenging setting for feature and meta-LR-MoM vs. LR on new task with error measured on
learning where d = 100, r = 5, but nt = 5 for varying new task parameter. Here d = 100, r = 5, nt = 25 while
numbers of tasks t. As Fig. 1 demonstrates, the method-of- n2 = 25 while the number of tasks is varied.

1.00
1.4 this regime both the method-of-moments estimator and the
LR loss-based approach can learn a non-trivial estimate of the
`2 Parameter Error

1.2
0.95
meta-LR-MoM
0.90 1.0
meta-LR-FO feature representation. The benefits of transferring this rep-
sin θ

0.85
0.8
0.80
0.6
resentation are also evident in the improved generalization
0.75

0.70
LF-MoM 0.4
performance seen by the meta-regression procedures on the
LF-FO
0.65 0.2 new task. Interestingly, the loss-based approach learns an
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000

Number of Tasks Number of Tasks accurate feature representation B̂ with significantly fewer
samples then the method-of-moments estimator, in contrast
Figure 1. Left: LF-FO vs. LF-MoM estimator with error measured to the previous experiment. Finally, we consider an instance
in the subspace angle distance sin θ(B̂, B). Right: meta-LR-FO where d = 100, r = 5, t = 20, and n2 = 50 with varying
and meta-LR-MoM vs. LR on new task with error measured on numbers of training points nt per task. We see in Fig. 3
new task parameter. Here d = 100, r = 5, and nt = 5 while
that meta-learning of representations provides significant
n2 = 2500 as the number of tasks is varied.
value in a new task. Note that these numerical experiments
show that as the number of tasks is fixed, but nt increases,
moments estimator is able to aggregate information across the generalization ability of the meta-learned regressions
the tasks as t increases to slowly improve its feature estimate, significantly improves as reflected in the bound (2).
even though nt d. The loss-based approach struggles to
improve its estimate of the feature matrix B in this regime.
This accords with the extra t dependence in Theorem 2 rel- 7. Conclusions
ative to Theorem 3. In this setting, we also generated a In this paper we show how a shared linear representation
(t + 1)st test task with d n2 = 2500, to test the effect of may be efficiently learned and transferred between mul-
meta-learning the linear representation on generalization in tiple linear regression tasks. We provide both upper and
a new, unseen task against a baseline which simply performs lower bounds on the sample complexity of learning this
a regression on this new task in isolation. Fig. 1 also shows representation and for the problem of learning-to-learn. We
8
An open-source Python implementation to reproduce believe our bounds capture important qualitative phenom-
our experiments can be found at https://github.com/ ena observed in real meta-learning applications absent from
nileshtrip/MTL. previous theoretical treatments.
Provable Meta-Learning of Linear Representations

1.0 1.4
LR Edelman, A., Arias, T. A., and Smith, S. T. The geometry

`2 Parameter Error
1.2
0.8

1.0
meta-LR-MoM of algorithms with orthogonality constraints. SIAM jour-
meta-LR-FO
LF-MoM nal on Matrix Analysis and Applications, 20(2):303–353,
sin θ

0.6
0.8
LF-FO
0.4 0.6 1998.
0.4
0.2

0 1000 2000 3000 4000 5000 6000

0.2
0 1000 2000 3000 4000 5000 6000
Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-
Number of Training Points (per Task) Number of Training Points (per Task) learning for fast adaptation of deep networks. In Proceed-
ings of the 34th International Conference on Machine
Figure 3. Left: LF-FO vs. LF-MoM estimator with error measured
Learning-Volume 70, pp. 1126–1135. JMLR. org, 2017.
in the subspace angle distance sin θ(B̂, B). Right: meta-LR-FO
and meta-LR-MoM vs. LR on new task with error measured on Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online
new task parameter. Here d = 100, r = 5, t = 20, and n2 = 50 meta-learning. arXiv preprint arXiv:1902.08438, 2019.
while the number of training points per task (nt ) is varied.
Ge, R., Jin, C., and Zheng, Y. No spurious local minima
in nonconvex low rank problems: A unified geometric
References analysis. arXiv preprint arXiv:1704.00708, 2017.
Anandkumar, A., Hsu, D., and Kakade, S. M. A method of Guntuboyina, A. Lower bounds for the minimax risk using
moments for mixture models and hidden Markov models. f -divergences, and applications. IEEE Transactions on
In Conference on Learning Theory, pp. 33–1, 2012. Information Theory, 57(4):2386–2399, 2011.
Ando, R. K. and Zhang, T. A framework for learning pre- Hsu, D., Kakade, S. M., and Zhang, T. Random design
dictive structures from multiple tasks and unlabeled data. analysis of ridge regression. In Conference on learning
Journal of Machine Learning Research, 6(Nov):1817– theory, pp. 9–1, 2012.
1853, 2005.
Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan,
Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., and Auli, M. I. How to escape saddle points efficiently. In Proceed-
M. Cloze-driven pretraining of self-attention networks. ings of the 34th International Conference on Machine
arXiv preprint arXiv:1903.07785, 2019. Learning, pp. 1724–1732. JMLR. org, 2017.

Baxter, J. A model of inductive bias learning. Journal of Khodak, M., Balcan, M.-F., and Talwalkar, A. Provable guar-
artificial intelligence research, 12:149–198, 2000. antees for gradient-based meta-learning. arXiv preprint
arXiv:1902.10644, 2019a.
Bhatia, R. Matrix Analysis, volume 169. Springer Science Khodak, M., Balcan, M.-F. F., and Talwalkar, A. S. Adaptive
& Business Media, 2013. gradient-based meta-learning methods. In Advances in
Neural Information Processing Systems, pp. 5915–5926,
Candes, E. and Plan, Y. Tight oracle bounds for low-rank
2019b.
matrix recovery from a minimal number of noisy random
measurements. arXiv preprint arXiv:1001.0339, 2010. Liu, D. C. and Nocedal, J. On the limited memory BFGS
method for large scale optimization. Mathematical pro-
Chen, X., Guntuboyina, A., and Zhang, Y. On bayes risk gramming, 45(1-3):503–528, 1989.
lower bounds. The Journal of Machine Learning Re-
search, 17(1):7687–7744, 2016. Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neu-
ral networks for natural language understanding. arXiv
Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. preprint arXiv:1901.11504, 2019.
Learning-to-learn stochastic gradient descent with biased
Maclaurin, D., Duvenaud, D., and Adams, R. P. Autograd:
regularization. arXiv preprint arXiv:1903.10399, 2019.
Effortless gradients in numpy. In ICML 2015 AutoML
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Workshop, volume 238, 2015.
Tzeng, E., and Darrell, T. Decaf: A deep convolutional Maurer, A., Pontil, M., and Romera-Paredes, B. The ben-
activation feature for generic visual recognition. In Inter- efit of multitask representation learning. The Journal of
national conference on machine learning, pp. 647–655, Machine Learning Research, 17(1):2853–2884, 2016.
2014.
Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw,
Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan,
Q. Few-shot learning via learning the representation, M. I., et al. Ray: A distributed framework for emerging
provably. arXiv preprint arXiv:2002.09434, 2020. {AI} applications. In 13th {USENIX} Symposium on
Provable Meta-Learning of Linear Representations

Operating Systems Design and Implementation ({OSDI}

18), pp. 561–577, 2018.
Pajor, A. Metric entropy of the Grassmann manifold. Con-
vex Geometric Analysis, 34:181–188, 1998.
Pearson, K. Contributions to the mathematical theory of evo-
lution. Philosophical Transactions of the Royal Society
of London. A, 185:71–110, 1894.
Pontil, M. and Maurer, A. Excess risk bounds for multitask
learning with trace norm regularization. In Conference
on Learning Theory, pp. 55–76, 2013.
Raskutti, G., Wainwright, M. J., and Yu, B. Minimax rates
of estimation for high-dimensional linear regression over
`q -balls. IEEE transactions on information theory, 57
(10):6976–6994, 2011.
Recht, B. A simpler approach to matrix completion. Jour-
nal of Machine Learning Research, 12(Dec):3413–3430,
2011.
Rohde, A., Tsybakov, A. B., et al. Estimation of high-
dimensional low-rank matrices. The Annals of Statistics,
39(2):887–930, 2011.

Tropp, J. A. User-friendly tail bounds for sums of random

matrices. Foundations of computational mathematics, 12
(4):389–434, 2012.
Tsybakov, A. B. Introduction to Nonparametric Estimation.
Springer Science & Business Media, 2008.

Vershynin, R. High-Dimensional Probability: An Intro-

duction with Applications in Data Science, volume 47.
Cambridge University Press, 2018.
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.
Matching networks for one shot learning. In Advances in
neural information processing systems, pp. 3630–3638,
2016.
Wainwright, M. J. High-Dimensional Statistics: A Non-
Asymptotic Viewpoint, volume 48. Cambridge University
Press, 2019.

Wang, Z., Dai, Z., Póczos, B., and Carbonell, J. Charac-

terizing and avoiding negative transfer. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 11293–11302, 2019.
Zhang, Y., Wainwright, M. J., and Jordan, M. I. Lower
bounds on the performance of polynomial-time algo-
rithms for sparse linear regression. In Conference on
Learning Theory, pp. 921–948, 2014.

Signal Using Reinforcement
No ratings yet
Signal Using Reinforcement
28 pages
Introduction To Multitasking Notes Unit-5
No ratings yet
Introduction To Multitasking Notes Unit-5
23 pages
Transfer Learning Through Embedding Spaces (Z-Lib - Io)
No ratings yet
Transfer Learning Through Embedding Spaces (Z-Lib - Io)
223 pages
Model-Free Representation Learning and Exploration in Low-Rank Mdps
No ratings yet
Model-Free Representation Learning and Exploration in Low-Rank Mdps
76 pages
Automated Relational Meta-Learning
No ratings yet
Automated Relational Meta-Learning
19 pages
Few-Shot Learning Via Learning The Representation, Provably
No ratings yet
Few-Shot Learning Via Learning The Representation, Provably
41 pages
Meta-Learning With Temporal Convolutions
No ratings yet
Meta-Learning With Temporal Convolutions
14 pages
Meta-Learning & Transfer Learning
No ratings yet
Meta-Learning & Transfer Learning
56 pages
Transformers As Decision Makers Provable In-Context Reinforcement Learning Via Supervised Pretraining
No ratings yet
Transformers As Decision Makers Provable In-Context Reinforcement Learning Via Supervised Pretraining
61 pages
Unified Transfer Learning Models in High-Dimensional Linear Regression
No ratings yet
Unified Transfer Learning Models in High-Dimensional Linear Regression
28 pages
Multitask Learning
No ratings yet
Multitask Learning
35 pages
Data Augmentation For Meta-Learning
No ratings yet
Data Augmentation For Meta-Learning
10 pages
Thijs Van Der Laan s3986721 Bachelors Thesis
No ratings yet
Thijs Van Der Laan s3986721 Bachelors Thesis
42 pages
Meta Learning With
No ratings yet
Meta Learning With
15 pages
Multimodality in Meta-Learning - A Comprehensive Survey
No ratings yet
Multimodality in Meta-Learning - A Comprehensive Survey
21 pages
NeurIPS 2023 Context Shift Reduction For Offline Meta Reinforcement Learning Paper Conference
No ratings yet
NeurIPS 2023 Context Shift Reduction For Offline Meta Reinforcement Learning Paper Conference
20 pages
2018 - A Brief Review On Multi-Task Learning - Thung - Wee - Multimedia Tools and Applications
No ratings yet
2018 - A Brief Review On Multi-Task Learning - Thung - Wee - Multimedia Tools and Applications
21 pages
Supervised Pretraining Can Learn In-Context Reinforcement Learning
No ratings yet
Supervised Pretraining Can Learn In-Context Reinforcement Learning
27 pages
Sensors 23 00583 v2
No ratings yet
Sensors 23 00583 v2
17 pages
M L S H: ETA Earning Hared Ierarchies
No ratings yet
M L S H: ETA Earning Hared Ierarchies
11 pages
Information Dropout Learning Optimal Representations Through Noisy Computation
No ratings yet
Information Dropout Learning Optimal Representations Through Noisy Computation
9 pages
2018 - Learning To Multitask - Zhang Et Al - Curran Associates, Inc.
No ratings yet
2018 - Learning To Multitask - Zhang Et Al - Curran Associates, Inc.
12 pages
2018 - Multi-Task Learning As Multi-Objective Optimization - Sener - Koltun - Advances in Neural Information Processing Systems
No ratings yet
2018 - Multi-Task Learning As Multi-Objective Optimization - Sener - Koltun - Advances in Neural Information Processing Systems
12 pages
Optimistic Linear Support and Successor Features As A Basis For Optimal Policy Transfer (Alegre, 2022)
No ratings yet
Optimistic Linear Support and Successor Features As A Basis For Optimal Policy Transfer (Alegre, 2022)
20 pages
2024 - A Survey On Kernel-Based Multi-Task Learning - Neurocomputing
No ratings yet
2024 - A Survey On Kernel-Based Multi-Task Learning - Neurocomputing
12 pages
Robust Task Representations For Offline Meta-Reinforcement Learning Via Contrastive Learning
No ratings yet
Robust Task Representations For Offline Meta-Reinforcement Learning Via Contrastive Learning
13 pages
Alexander G 2019
No ratings yet
Alexander G 2019
22 pages
22 Self Supervised Representation
No ratings yet
22 Self Supervised Representation
15 pages
GNN Foundations Frontiers and Applications Chapter1
No ratings yet
GNN Foundations Frontiers and Applications Chapter1
13 pages
1 s2.0 S0893608021003919 Main
No ratings yet
1 s2.0 S0893608021003919 Main
11 pages
A Survey On Multi-Task Learning
No ratings yet
A Survey On Multi-Task Learning
24 pages
Representation Meta Learning
No ratings yet
Representation Meta Learning
9 pages
Meta-Learning With Versatile Loss Geometries - For Fast Adaptation Using Mirror Descent
No ratings yet
Meta-Learning With Versatile Loss Geometries - For Fast Adaptation Using Mirror Descent
7 pages
Multi Task Learning (MTL)
No ratings yet
Multi Task Learning (MTL)
15 pages
Learning Functional Transduction: Mathieu Chalvidal, Thomas Serre, Rufin Vanrullen
No ratings yet
Learning Functional Transduction: Mathieu Chalvidal, Thomas Serre, Rufin Vanrullen
25 pages
Recurrent Hypernetworks Are Surprisingly Strong in Meta-RL
No ratings yet
Recurrent Hypernetworks Are Surprisingly Strong in Meta-RL
18 pages
I - E - E R - L: N Context Xploration Xploitation For E Inforcement Earning
No ratings yet
I - E - E R - L: N Context Xploration Xploitation For E Inforcement Earning
16 pages
Unsupervised Meta Learning
No ratings yet
Unsupervised Meta Learning
12 pages
MRCL
No ratings yet
MRCL
15 pages
Linear Programming Word Problems Formulation Using
No ratings yet
Linear Programming Word Problems Formulation Using
10 pages
03-NDL-Midterm Scheme of Evaluation
No ratings yet
03-NDL-Midterm Scheme of Evaluation
7 pages
Representation Learning
No ratings yet
Representation Learning
6 pages
Meta-Learning in Neural Networks A Survey
No ratings yet
Meta-Learning in Neural Networks A Survey
21 pages
2010 07140 PDF
No ratings yet
2010 07140 PDF
34 pages
ML - Meta-Pretraining Then Meta-Learning For Few-Shot Text Classification
No ratings yet
ML - Meta-Pretraining Then Meta-Learning For Few-Shot Text Classification
3 pages
Convex MTFL
No ratings yet
Convex MTFL
39 pages
Meta-Learning in Neural Networks A Survey
No ratings yet
Meta-Learning in Neural Networks A Survey
20 pages
Online Meta-Learning: y 0. An Algorithm That Understands The Underlying Struc
No ratings yet
Online Meta-Learning: y 0. An Algorithm That Understands The Underlying Struc
19 pages
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
No ratings yet
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
20 pages
Learning Potential Functions and Their R PDF
No ratings yet
Learning Potential Functions and Their R PDF
45 pages
SSRN Id4355794
No ratings yet
SSRN Id4355794
11 pages
Da FX 2020 Proceedings
No ratings yet
Da FX 2020 Proceedings
345 pages
6709 One Shot Imitation Learning
No ratings yet
6709 One Shot Imitation Learning
12 pages
Entropy 12 01975
No ratings yet
Entropy 12 01975
70 pages
Meta Learning Via Learned Loss: Sarah Bechtle Artem Molchanov Yevgen Chebotar
No ratings yet
Meta Learning Via Learned Loss: Sarah Bechtle Artem Molchanov Yevgen Chebotar
9 pages
Turing Machine
No ratings yet
Turing Machine
99 pages
Pentina Curriculum Learning of 2015 CVPR Paper PDF
No ratings yet
Pentina Curriculum Learning of 2015 CVPR Paper PDF
9 pages
Physics
No ratings yet
Physics
42 pages
Gate-Controlled Quantum Dots Based On Two-Dimensional Materials
No ratings yet
Gate-Controlled Quantum Dots Based On Two-Dimensional Materials
37 pages
Paper Repro-Deep Metalearning Using "MAML" and "Reptile" - by Adrien Lucas Ecoffet - Towards Data Science
No ratings yet
Paper Repro-Deep Metalearning Using "MAML" and "Reptile" - by Adrien Lucas Ecoffet - Towards Data Science
11 pages
Top-Down Fabrication of Bulk-Insulating Topological Insulator Nanowires For Quantum Devices
No ratings yet
Top-Down Fabrication of Bulk-Insulating Topological Insulator Nanowires For Quantum Devices
32 pages
Meta-Learning in Distributed Data Mining Systems
No ratings yet
Meta-Learning in Distributed Data Mining Systems
38 pages
Quantum Nanophotonics in Two-Dimensional Materials
No ratings yet
Quantum Nanophotonics in Two-Dimensional Materials
27 pages
Simulating 2D Lattice Gauge Theories On A Qudit Quantum Computer
No ratings yet
Simulating 2D Lattice Gauge Theories On A Qudit Quantum Computer
23 pages
Superconducting Phase Qubits
No ratings yet
Superconducting Phase Qubits
23 pages
Making Scalable Meta Learning Practical
No ratings yet
Making Scalable Meta Learning Practical
20 pages
Meta Learning For Semi Supervised Few Shot
No ratings yet
Meta Learning For Semi Supervised Few Shot
15 pages
RSC Advances
No ratings yet
RSC Advances
14 pages
Impurity in A Bose-Einstein Condensate - Study of The Attractive
No ratings yet
Impurity in A Bose-Einstein Condensate - Study of The Attractive
13 pages
One-Shot Learning With Memory-Augmented Neural Networks
No ratings yet
One-Shot Learning With Memory-Augmented Neural Networks
13 pages
Clock-Work Trade-Off Relation For Coherence in Quantum Thermodynamics
No ratings yet
Clock-Work Trade-Off Relation For Coherence in Quantum Thermodynamics
16 pages
High-Mobility Free-Standing InSb Nanoflags Grown On InP Nanowire Stems For Quantum Devices
No ratings yet
High-Mobility Free-Standing InSb Nanoflags Grown On InP Nanowire Stems For Quantum Devices
9 pages
Meta-Data - Characterization of Input Features For Meta-Learning
No ratings yet
Meta-Data - Characterization of Input Features For Meta-Learning
12 pages
HM Removal by Iron NP
No ratings yet
HM Removal by Iron NP
11 pages
Federated Meta-Learning With Fast Convergence and
No ratings yet
Federated Meta-Learning With Fast Convergence and
14 pages
Topological Quantum Devices - A Review
No ratings yet
Topological Quantum Devices - A Review
56 pages
10.1351 Pac200476040801
No ratings yet
10.1351 Pac200476040801
13 pages
Meta-Learning Assisted Robust Control of Universal Quantum Gates With Uncertainties
No ratings yet
Meta-Learning Assisted Robust Control of Universal Quantum Gates With Uncertainties
10 pages
Quantum Information Processing For A Coherent Superposition State Via A Mixed Entangled Coherent Channel
No ratings yet
Quantum Information Processing For A Coherent Superposition State Via A Mixed Entangled Coherent Channel
7 pages
Vessel Design Calculation
No ratings yet
Vessel Design Calculation
22 pages
Operational Interpretation of Quantum Fisher Information in Quantum Thermodynamics
No ratings yet
Operational Interpretation of Quantum Fisher Information in Quantum Thermodynamics
7 pages
Study of Residential Land Use Transport Interaction For Madurai Lpa
No ratings yet
Study of Residential Land Use Transport Interaction For Madurai Lpa
78 pages
9 First-Order Circuits Noted
No ratings yet
9 First-Order Circuits Noted
67 pages
Parallel Computing and Monte Carlo Algorithms
No ratings yet
Parallel Computing and Monte Carlo Algorithms
27 pages
Quantum Coherent Tunable Coupling of Superconducting Qubits: Reports
No ratings yet
Quantum Coherent Tunable Coupling of Superconducting Qubits: Reports
5 pages
Quantum Sensing With Erasure Qubits
No ratings yet
Quantum Sensing With Erasure Qubits
6 pages
Non-Hermitian Quantum Thermodynamics
No ratings yet
Non-Hermitian Quantum Thermodynamics
8 pages
Nano-Alumina Modified With 2,4-Dinitrophenylhydrazine
No ratings yet
Nano-Alumina Modified With 2,4-Dinitrophenylhydrazine
9 pages
Markov Chain Monte Carlo Without Likelihoods
No ratings yet
Markov Chain Monte Carlo Without Likelihoods
5 pages
OULD - bammOUNE Preparation Tp1
No ratings yet
OULD - bammOUNE Preparation Tp1
13 pages
Frame Design Using Web-Tapered Members: Problem
No ratings yet
Frame Design Using Web-Tapered Members: Problem
27 pages
Part 1 Functions Equations and Their Graphs
No ratings yet
Part 1 Functions Equations and Their Graphs
30 pages
Quarter 2 Module 5
50% (2)
Quarter 2 Module 5
4 pages
Grade 4 Mathematics Term 4 Mock Exam: Place Value
0% (1)
Grade 4 Mathematics Term 4 Mock Exam: Place Value
4 pages
EViews Guide
100% (1)
EViews Guide
14 pages
Osborne (2008) CH 22 Testing The Assumptions of Analysis of Variance
No ratings yet
Osborne (2008) CH 22 Testing The Assumptions of Analysis of Variance
29 pages
Presentation Regression
No ratings yet
Presentation Regression
12 pages
Submitted in Partial Fulfilment For The Award of Degree of
No ratings yet
Submitted in Partial Fulfilment For The Award of Degree of
13 pages
Raw Cashew Moisture Tester: Operating Manual
No ratings yet
Raw Cashew Moisture Tester: Operating Manual
24 pages
Customer Churn Analysis - Jupyter Notebook
No ratings yet
Customer Churn Analysis - Jupyter Notebook
10 pages
E2001 Circuit Analysis: Academic Year 2020-2021
No ratings yet
E2001 Circuit Analysis: Academic Year 2020-2021
15 pages
DPP-1 2D Projectile Motion Op
No ratings yet
DPP-1 2D Projectile Motion Op
2 pages
NCERT Grade 09 Mathematics Introduction-To-Euclids-Geometry
No ratings yet
NCERT Grade 09 Mathematics Introduction-To-Euclids-Geometry
8 pages
Lec10 3
No ratings yet
Lec10 3
3 pages
GOVT 702: Advanced Political Analysis Georgetown University
No ratings yet
GOVT 702: Advanced Political Analysis Georgetown University
5 pages
ST 16 2-5 (-4)
No ratings yet
ST 16 2-5 (-4)
9 pages
Mathematics in The Modern World Answer Key
No ratings yet
Mathematics in The Modern World Answer Key
2 pages
Matrix Multiplication1
No ratings yet
Matrix Multiplication1
10 pages
Test of Homogeneity Based On Geometric Mean of Variances
No ratings yet
Test of Homogeneity Based On Geometric Mean of Variances
11 pages
Social Capital and Fear of Crime: A Test of Organizational Participation Effect in Nigeria
No ratings yet
Social Capital and Fear of Crime: A Test of Organizational Participation Effect in Nigeria
11 pages
Bài Tập Về Nhà Buổi 1: YÊU CẦU: Viết mô hình hồi quy mẫu và tính R, RSS, Fqs của các bài sau
No ratings yet
Bài Tập Về Nhà Buổi 1: YÊU CẦU: Viết mô hình hồi quy mẫu và tính R, RSS, Fqs của các bài sau
2 pages
Antminer S19 Pro 2
No ratings yet
Antminer S19 Pro 2
4 pages
Badmephisto's Speedcubing Guide First 2 Layers: Arranged by Andy Klise
No ratings yet
Badmephisto's Speedcubing Guide First 2 Layers: Arranged by Andy Klise
3 pages
Assume That One Third of All Used Cars Are Lemons If
No ratings yet
Assume That One Third of All Used Cars Are Lemons If
2 pages
Optimization Theory with Applications
From Everand
Optimization Theory with Applications
Donald A. Pierre
4/5 (4)
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Provable Meta-Learning of Linear Representati

Uploaded by

Provable Meta-Learning of Linear Representati

Uploaded by

Provable Meta-Learning of Linear Representations

Nilesh Tripuraneni 1 Chi Jin 2 Michael I. Jordan 1

Abstract ranging from deep reinforcement learning (Baevski et al.,

where there are t (unobserved) underlying task parameters

can serve as a direction of local improvement pointing to-

(see Lemma 1), we can extract the features B by directly

with probability at least 1 − O(n−100

parameter on a new task by meta-LR-FO. We use LF-MoM 0.4

and meta-LR-MoM to refer to the same quantities save with 0.2

the feature estimate learned via the method-of-moments 0.0

0 1000 2000 3000 4000 5000 6000

Operating Systems Design and Implementation ({OSDI}

Tropp, J. A. User-friendly tail bounds for sums of random

Vershynin, R. High-Dimensional Probability: An Intro-

Wang, Z., Dai, Z., Póczos, B., and Carbonell, J. Charac-

You might also like