A Notion of Task Relatedness Yielding Provable Mul
A Notion of Task Relatedness Yielding Provable Mul
net/publication/220343569
CITATIONS READS
81 163
2 authors:
All content following this page was uploaded by Shai Ben-David on 16 May 2014.
We provide a formal framework for this notion of task relatedness, which captures
a sub-domain of the wide scope of issues in which one may apply a multiple task
learning approach. Our notion of task similarity is relevant to a variety of real
life multitask learning scenarios and allows the formal derivation of generalization
bounds that are strictly stronger than the previously known bounds for both the
learning-to-learn and the multitask learning scenarios. We give precise conditions
under which our bounds guarantee generalization on the basis of smaller sample
sizes than the standard single-task approach3 .
1.1 Introduction
Most of the work in machine learning focuses on learning tasks that are en-
countered separately, one task at a time. While great success has been achieved
in this type of framework, it is clear that it neglects certain fundamental as-
pects of human and animal learning. Human beings face each new learning
task equipped with knowledge gained from previous similar learning tasks.
3
A preliminary version of this paper appears in the proceedings of COLT’03,
[BDS03]
2 Shai Ben-David and Reba Schuller Borbely
The only theoretical analysis of multitask learning that we are aware of is the
work of Baxter [Bax00], and the recent work of Ben-David et. al. [BDGS02].
The main question that we are interested in is when does multitask learning
provide an advantage over the single task approach. In order to achieve this, we
introduce a concrete notion of what it means for tasks to be ”related,” and
evaluate multi- versus single-task learning for tasks related in this manner.
Our notion of relatedness between tasks is inspired by [BDGS02] which deals
with the problem of integrating disparate databases. We extend the main
generalization error result from [BDGS02] to the multitask learning setting,
strengthen it, and analyze the extent to which it offers an improvement over
single task learning.
The main technical tool that we use is the generalized VC-dimension of
Baxter [Bax00]. Baxter applies his version of VC-dimension to bound the
average error of a set of predictors over a class of tasks in terms of the average
empirical error of these predictors. In contrast with Baxter’s analysis, we view
multitask learning as having one ’focus of interest’ task that one wishes to
learn and view the extra related tasks as just an aid towards learning the
main task. In this context, bounds on the average error over all tasks are not
good enough. We show that when one is dealing with tasks that are related in
the sense that we define, the Baxter generalization bound can be strengthened
to hold for the error of each single task.
We should point out the distinction between the problem considered herein
and the co-training approach of [BM98]. Co-training makes use of extra
“tasks” to compensate for having only a small amount of labeled data. How-
ever, in co-training, the extra tasks are assumed to be different “views” of the
same sample, whereas our extra tasks are independent samples from different
distributions. Thus, despite its relevance to multitask learning, previous work
on co-training cannot be directly applied to the problem at hand.
In our learning scenario, we assume the data (both the training samples and
the test examples) are generated by some probability distributions, {Pi : i ≤
n} that are (pairwise) F-related. We assume that the learner knows the set
of indices of the distributions, {1, . . . , n} and the family of functions F but
does not know the data-generating distributions nor which specific function
f relates any given pair of distributions. As input, the learner gets samples,
{Si : i ≤ n}, each Si drawn i.i.d. from Pi . Consequently, the advantage that a
learner can derive for a specific task from access to a sample drawn from some
other F-related task depends of the richness of the family of transformations
F. The larger this set gets, the looser the notion of F- relatedness is.
Title Suppressed Due to Excessive Length 5
In this section, we analyze multiple task learning for F-related tasks. Our
main idea is to separate the information contained in the training data into
information that is invariant under transformations from F and data that is
F-sensitive. We utilize the training samples from the extra tasks to learn the
F-invariant aspects of the predictor. For example, if our domain is the two
dimensional Euclidean space, and F is a family of isometries of the plane,
then geometric shapes are F-invariant and can be deduced from translated
images of the original data distributions, while their location in the plane is
F-sensitive. We formalize this basic idea in a way that allows precise quan-
tification of the potential benefits to be drawn from such multi-task training
data. We derive sample complexity bounds that demonstrate the merits of
this algorithmic approach.
We formalize our notion of F-relevant information through an appropriate
partitioning of the learner’s set of potential label predictors. Given hypothesis
space, H, we create a family, H, of hypothesis spaces consisting of sets of
hypotheses in H which are equivalent up to transformations in F. We assume
that F forms a group under function composition and that H is closed under
6 Shai Ben-David and Reba Schuller Borbely
It suffices to show that for every h ∈ H there exist h0 , h00 ∈ H such that
ErP2 (h0 ) ≤ ErP1 (h) and ErP1 (h00 ) ≤ ErP2 (h).
Since P1 , P2 be F-related, and F is a group (so each f ∈ F has its inverse
there) there exist f, f 0 ∈ F such that P1 = f [P2 ] and P2 = f 0 [P1 ]. Since H is
closed under the action of F, both h0 = h ◦ f , and h00 = h0 ◦ f 0 are members of
H. Applying Lemma 1, we get ErP2 (h0 ) = ErP1 (h) and ErP1 (h00 ) = ErP2 (h),
so we are done. t u
Note that this theorem only bounds the average generalization error over
the different tasks.
We are now ready to state and prove one of our main results, which gives
an upper bound on the sample complexity of finding a ∼F -equivalence class
which is near-optimal for each of the tasks. This is significant, since the goal
of multitask learning is to use extra tasks to improve performance on one
particular task.
Proof. Observe that lemma 2 implies that for any h ∈ H and any 1 ≤ j ≤ n,
n
1 X Pi
ErPj ([h]∼F ) = inf Er (hi ).
h1 ,...,hn ∈[h]∼F n i=1
Proof. Let h] be the best P1 label predictor in H. That is, h] = arg minh∈H ErP1 (h).
Let [h∗ ] be the equivalence class picked in the first stage of the MT-ERM
Pn ˆ Si
paradigm. I.e., [h∗ ] is a minimizer of inf h1 ,...,hn ∈[h] i=1 Er (hi ) over all
[h] ∈ H/ ∼F .
By the choice of h∗ ,
n
X n
X
inf ˆ Si (hi ) ≤
Er inf ˆ Si (hi ).
Er
h1 ,...,hn ∈[h∗ ] h1 ,...,hn ∈[h] ]
i=1 i=1
and also,
n
X
ErP1 ([h∗ ]) ≤ inf ˆ Si (hi ) + ²1 .
Er
h1 ,...,hn ∈[h∗ ]
i=1
Combining these three inequalities, we get that with probability greater than
(1 − δ/2),
ErP1 ([h∗ ]) ≤ ErP1 ([h] ]) + 2²1 .
Finally, the second stage of the MT-ERM algorithm is just a standard ERM
algorithm yielding, with probability greater than (1 − δ/2), a hypothesis h¦ ∈
[h∗ ], whose P1 error is within 2²2 of the best hypothesis there, namely of
ErP1 ([h∗ ]).
Recall, that the common (single task) ERM paradigm requires a sample
of size
|S1 | ≥ (64/²2 )[VC-dim(H log(12/²) + log(4/δ)] (1.3)
to find a hypothesis h◦ that with probability greater than (1 − δ) has
ErP1 (h◦ ) ≤ inf ErP1 (h) + 2²
h∈H
It follows that the extent by which the samples from the extra Pi ’s help
depends on the gap between the parameters dmax and dH and VC-dim(H).
Next we examine what the values of these parameters for the specific case
of learning axis-aligned rectangles in <d . Finally, in section 1.6 we analyze
these parameters for general classes H and F.
Proof. Suppose [h] shatters set U . (I.e., for any V ⊆ U , there exists h0 ∈ [h]
such that for all x ∈ U , h0 (x) = 1 ⇐⇒ x ∈ V . We say that such an h0 obtains
subset V of U .)
Then, in order to obtain the complements of each of the singleton subsets
of U , each point x ∈ U must have some coordinate kx in which its value is
either the greatest or the least among the kx th coordinate of all points in U .
For a given point, p ∈ Rm , let p(k) denote its kth coordinate.
Assume |U | > d + d2 . Then, there must exist at least d + 1 points p ∈ U
for which kp is unique, i.e., for every other coordinate k, there exist points
y, z ∈ U such that y(k) > p(k) and z(k) < p(k). And since we are in Rd ,
there exist two such points, p and q such that kp = kq and both kp and kq are
unique. Call this coordinate k.
Now, what we have is points p, q ∈ U such that p(k) > x(k), and q(k) >
x(k) for all x ∈ U − {p, q}, and for every k 0 6= k there exist points y, z ∈ S
such that y(k) > p(k), q(k) and z(k) < p(k), q(k).
We proceed to show that no h0 ∈ [h] obtains the subset U − {p, q}.
Since [h] must obtain the subset of S consisting of U itself, the length of
the side in coordinate j for any h0 ∈ [h] must be at least maxx,y∈U |x(j)−y(j)|.
Without loss of generality, let us say that h obtains U . Then, any subset of
U obtained by any h0 ∈ [h] consists of those points in U that remain after
removing axis-parallel slices of h on up to d of its faces, with no two opposing
faces sliced.
However, the only slices that can remove p and q without removing any
other points from S are the two opposing slices in coordinate k, so, indeed,
the subset U − {p, q} cannot be obtained by any h0 ∈ [h] . u t
Claim. For the purpose of learning the rectangle side lengths, VC-dimension
considerations provide better accuracy guarantees for n shifted samples each
8
of size m than for a single sample of size n( 11 m − c), where c is a constant
depending on the desired accuracy and the Euclidean dimension. Furthermore,
c is small enough so that each sample may be smaller than that needed to
obtain the same guarantees for a single data set of size m.
Previously, [BDGS02] considered the PAC setting, that is, the setting in
which the learner is guaranteed that there exists a hypothesis h ∈ H that
achieves zero error under the data generating distribution (in our case, P1 ).
For that setting, they showed that for the H and F of the example above,
n shifted samples each of size m provide better accuracy guarantees than a
Title Suppressed Due to Excessive Length 13
Notation:
Let |h| denote the cardinality of the support of h, i.e., |h| denotes |{x ∈
X : h(x) = 1}|. Also, for a function h and a vector x = (x1 , . . . , xn )), let
h(x1 ) = (h(x1 ), . . . , h(xn )).
Theorem 4. If there exists M such that |h| ≤ M for all h ∈ H, then there
exists n0 such that for all n ≥ n0 ,
dH (n) = max VC-dim([h]∼F ).
h∈H
Note that for each i, there exists Si ⊆ h0 such that xi is some permutation
of {fi−1 (z) : z ∈ Si }. ³¡ ¢ ´
Say |h0 | = K. Then if n > K m − 1 2m , then there exists S ⊆ h0 and
i1 , . . . i2m such that Sij = S for 1 ≤ j ≤ 2m . Let σ1 , . . . σ2m be the corre-
sponding permutations.
Finally, letting v1 , . . . , v2m be an enumeration of all vectors of length m
over {0, 1}, letting N be any m×n matrix over {0, 1} whose ijth row is σj (vij ),
and letting h∗ and f10 , . . . , fn0 be such that
h∗ ◦ f10 (x1 )
..
. = N,
h∗ ◦ fn0 (xn )
So, we see that for any class of hypotheses bounded in size (i.e., the size of
their support), for sufficiently large n, dH (n) obtains it’s minimum possible
value of dmax . However, many natural hypothesis spaces consist of hypotheses
that are not only unbounded, but infinite in size. In the following theorem, we
show that this conjecture also holds for a natural hypothesis space consisting
of infinite hypotheses.
Theorem 5. Let X , H, and F be the rectangles with shifts as in section 1.5,
and let H denote H/ ∼F as usual. For n > d,
Ben-David, et. al. [BDGS02] provide the following further results on dH (n).
n
Theorem 6. If F is finite and log(n) ≥ VC-dim(H), then
dH (n) ≤ 2 log(|F|).
Note that this result leads us to scenarios under which dH (n) is arbitrarily
smaller than VC-dim(H). Indeed, as long as F is finite, no matter how com-
plex H is, dH (n) remains bounded by 2 log(|F|). Furthermore, in practice, the
requirement that F be finite is not an unreasonable one, since real world prob-
lems come with bounded domains and real world computations have limited
numerical accuracy.
Furthermore, [BDGS02] provides the following generalization of this result.
log k
Theorem 7. If ∼F is of finite index6 , k, and n ≥ 4b log b , then
log k
dH (n) ≤ + 4b log b,
n
where µ ¶
b = max max VC-dim(H), 3 .
H∈H/∼F
This shows that even if F is infinite, dH (n) cannot grow arbitrarily with
increasing complexity of H.
6
The index of an equivalence relation is the number of equivalence classes into
which it partitions its domain.
16 Shai Ben-David and Reba Schuller Borbely
18 References