0% found this document useful (0 votes)
11 views19 pages

A Notion of Task Relatedness Yielding Provable Mul

Uploaded by

Partha DebRoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views19 pages

A Notion of Task Relatedness Yielding Provable Mul

Uploaded by

Partha DebRoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220343569

A notion of task relatedness yielding provable multiple-task learning


guarantees

Article in Machine Learning · December 2008


DOI: 10.1007/s10994-007-5043-5 · Source: DBLP

CITATIONS READS

81 163

2 authors:

Shai Ben-David R. Schuller Borbely


University of Waterloo CyberPoint International
187 PUBLICATIONS 18,867 CITATIONS 9 PUBLICATIONS 877 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Shai Ben-David on 16 May 2014.

The user has requested enhancement of the downloaded file.


1
A Notion of Task Relatedness Yielding
Provable Multiple-Task Learning Guarantees

Shai Ben-David1 and Reba Schuller Borbely2


1
David, R., Cheriton School of Computer Science,
University of Waterloo,
Waterloo, ON, N2l 1G3 Canada
[email protected],
2
reba [email protected]

Summary. The approach of learning of multiple “related” tasks simultaneously has


proven quite successful in practice; however, theoretical justification for this success
has remained elusive. The starting point for previous work on multiple task learning
has been that the tasks to be learned jointly are somehow “algorithmically related”,
in the sense that the results of applying a specific learning algorithm to these tasks
are assumed to be similar. We offer an alternative approach, defining relatedness of
tasks on the basis of similarity between the example generating distributions that
underlie these tasks.

We provide a formal framework for this notion of task relatedness, which captures
a sub-domain of the wide scope of issues in which one may apply a multiple task
learning approach. Our notion of task similarity is relevant to a variety of real
life multitask learning scenarios and allows the formal derivation of generalization
bounds that are strictly stronger than the previously known bounds for both the
learning-to-learn and the multitask learning scenarios. We give precise conditions
under which our bounds guarantee generalization on the basis of smaller sample
sizes than the standard single-task approach3 .

1.1 Introduction

Most of the work in machine learning focuses on learning tasks that are en-
countered separately, one task at a time. While great success has been achieved
in this type of framework, it is clear that it neglects certain fundamental as-
pects of human and animal learning. Human beings face each new learning
task equipped with knowledge gained from previous similar learning tasks.
3
A preliminary version of this paper appears in the proceedings of COLT’03,
[BDS03]
2 Shai Ben-David and Reba Schuller Borbely

Furthermore, human learning frequently involves approaching several learning


tasks simultaneously; in particular, humans take advantage of the opportunity
to compare and contrast similar categories in learning to classify entities into
those categories.
It is natural to attempt to apply these observations to machine learning–
what kind of advantage is there in setting a learner to work on several tasks
sequentially or simultaneously? Intuitively, there should certainly be some
advantage, especially if the tasks are closely related in some way. And, indeed,
much experimental work [Bax95, IE96, Thr96, Hes98, Car97] has validated
this intuition. However, thus far, there has been relatively little progress on
theoretical justification for these results.
Relatedness of tasks is key to the multitask learning (MTL) approach.
Obviously, one cannot expect that information gathered through the learning
of a set of tasks will be relevant to the learning of another task that has
nothing in common with the already learned tasks.
Previous work on MTL, or Learning to Learn, treated the notion of relat-
edness using a ’functional’ approach. For example, consider one of the more
systematic theoretical analysis of a simultaneous learning model to date, Bax-
ter’s Learning To Learn work, e.g., [Bax00]. In Baxter’s work the similarity
between jointly learned tasks is manifested solely through a model selection
criterion. Namely, the advantage of learning tasks together relies on the as-
sumption that the tasks share a common optimal inductive bias, reflected by
a common optimal (or near-optimal) hypothesis class.
We try to determine under what circumstances one can expect different
tasks to be related in a ’learning useful’ way. We focus on the sample generat-
ing distributions underlying the learning tasks, and define task relatedness as
an explicit relationship between these distributions. Our notion seems to cap-
ture a sub-domain of the realm of applications to which multi-task learning
may be relevant.
Not surprisingly, by limiting the discussion to problems that can be mod-
eled by our data generating mechanism we leave many potential MTL scenar-
ios outside the scope of our discussion. However, there are several interesting
problems that can be treated within our framework. For these problems we
can reap the benefits of having a mathematical notion of relatedness and prove
sample size upper bounds for MTL learning that are better than any previous
proven bounds.
The rest of the paper is organized as follows: Section 1.2 formally intro-
duces multiple task learning and describes our notion of task relatedness. We
state and prove our generalization error bound for this framework in sec-
tion 1.3. In section 1.6, we analyze the generalized VC-dimension parameter
on which this bound depends, and we compare this bound for multiple task
learning to the analogous bounds for the single task approach. That is, we ex-
amine when can the error bounds for learning a given task improve by allowing
the learner to access samples generated by different but related tasks.
Title Suppressed Due to Excessive Length 3

1.1.1 Previous Work

The only theoretical analysis of multitask learning that we are aware of is the
work of Baxter [Bax00], and the recent work of Ben-David et. al. [BDGS02].
The main question that we are interested in is when does multitask learning
provide an advantage over the single task approach. In order to achieve this, we
introduce a concrete notion of what it means for tasks to be ”related,” and
evaluate multi- versus single-task learning for tasks related in this manner.
Our notion of relatedness between tasks is inspired by [BDGS02] which deals
with the problem of integrating disparate databases. We extend the main
generalization error result from [BDGS02] to the multitask learning setting,
strengthen it, and analyze the extent to which it offers an improvement over
single task learning.
The main technical tool that we use is the generalized VC-dimension of
Baxter [Bax00]. Baxter applies his version of VC-dimension to bound the
average error of a set of predictors over a class of tasks in terms of the average
empirical error of these predictors. In contrast with Baxter’s analysis, we view
multitask learning as having one ’focus of interest’ task that one wishes to
learn and view the extra related tasks as just an aid towards learning the
main task. In this context, bounds on the average error over all tasks are not
good enough. We show that when one is dealing with tasks that are related in
the sense that we define, the Baxter generalization bound can be strengthened
to hold for the error of each single task.
We should point out the distinction between the problem considered herein
and the co-training approach of [BM98]. Co-training makes use of extra
“tasks” to compensate for having only a small amount of labeled data. How-
ever, in co-training, the extra tasks are assumed to be different “views” of the
same sample, whereas our extra tasks are independent samples from different
distributions. Thus, despite its relevance to multitask learning, previous work
on co-training cannot be directly applied to the problem at hand.

1.2 A Data Generation Model for Related Tasks

Formally, the typical (single-task) classification learning problem is modeled


as follows: Given a domain X and a random sample S drawn from some
unknown distribution P on X × {0, 1}, find a hypothesis h : X → {0, 1} such
that for randomly drawn (x, b), with high probability h(x) = b. This problem
is some times referred to as “statistical regression”.
The multiple task learning problem is the analogous problem for multiple
distributions. However, the focus is on the potential advantage to each learning
task from the data available for the other tasks. Given domain X and unknown
distributions P0 , . . . Pn over X × {0, 1}, a learner is presented with a sequence
of random samples S0 , . . . , Sn drawn from these Pi ’s respectively, and has to
come up with a hypothesis h : X → {0, 1} such that, for (x, b) drawn randomly
4 Shai Ben-David and Reba Schuller Borbely

from P0 , h(x) = b with high probability. What we focus on is the extent to


which the samples Si , for i 6= 0 be utilized to help find a good hypothesis for
predicting the labels of P0 .
As we have mentioned previously, it is intuitive that the benefit of having
access to samples from multiple tasks depends on the ”relatedness” between
the different tasks. While there has been empirical success with sets of tasks
related in various ways, thus far, no formal definition of ”relatedness” has
yielded any theoretical results to that effect.

1.2.1 Our Notion of Relatedness Between Learning Tasks

We define a data generation mechanism which serves to determine our notion


of related tasks.
The basic ingredient in our definition is a set F of transformations f : X →
X . We say that tasks are F- related if, for some fixed probability distribution
over X × {0, 1}, the data in each of these tasks is generated by applying some
f ∈ F to that distribution. The next definition formalizes this notion.

Definition 1. For a measure space (X , A), where X denotes a domain set,


and A is a σ-algebra of its subsets, we discuss probability distributions, P over
X × {0, 1}, for which the P -measurable sets are the σ-algebra generated by the
sets of the form A × B, for A ∈ A and B ⊆ {0, 1}.
• For a function, f : X → X , let f [A] be {A ⊆ X : f −1 (A) ∈ A}, and let
f [P ] be the probability distribution over X × {0, 1} defined by having the
probability distribution f [P ] assign to a set T ⊆ X × {0, 1}, the probability
f [P ](T ) = P ({(f (x), b) | (x, b) ∈ T )}) .
Let F be a set of transformations f : X → X , and let P1 , P2 be probability
distributions over X × {0, 1}.
• We say that P1 , P2 are F-related if there exists some f ∈ F such that
P1 = f [P2 ] or P2 = f [P1 ].
• We say that two samples are F-related if they are samples from F-related
distributions.

In our learning scenario, we assume the data (both the training samples and
the test examples) are generated by some probability distributions, {Pi : i ≤
n} that are (pairwise) F-related. We assume that the learner knows the set
of indices of the distributions, {1, . . . , n} and the family of functions F but
does not know the data-generating distributions nor which specific function
f relates any given pair of distributions. As input, the learner gets samples,
{Si : i ≤ n}, each Si drawn i.i.d. from Pi . Consequently, the advantage that a
learner can derive for a specific task from access to a sample drawn from some
other F-related task depends of the richness of the family of transformations
F. The larger this set gets, the looser the notion of F- relatedness is.
Title Suppressed Due to Excessive Length 5

Clearly there are many examples of potential applications of simultaneous


learning that do not fit into this model of relatedness. However, there are
various interesting examples where this notion seems to provide a satisfactory
mathematical model of the similarity between the tasks in a set of related
learning problems.
Our framework applies to scenarios in which the learner’s prior knowledge
includes knowledge of some family F of transformations, such that all the tasks
for which this MTL learning approach will be applied are mutually F-related.
One domain in which such F-relatedness prior knowledge may be applied is in
situations where many different sensors collect data for the same classification
problem. For example, consider a set of cameras located in the lobby of some
high security building. Assume that they are all used to automatically detect
unauthorized visitors, based on the images they record. Clearly, each of these
cameras has its own bias, due to a different height, light conditions, angle, etc.
While it may be difficult to determine the exact bias of each camera, it may
be feasible to define mathematically a set of image transformations F such
that the data distributions of images collected by of all these recorders are F-
related. Another area in which such a notion of similarity is applicable is that
of database integration. Suppose there are several databases available, each
of which obtains its information from the same data pool, yet represents its
information with a different database schema. For the purpose of classification
prediction, our results in the next section eliminate the need for the difficult
undertaking of database integration, treating each database as one task in a
multiple task learning problem.

1.3 Learning F -Related Tasks

In this section, we analyze multiple task learning for F-related tasks. Our
main idea is to separate the information contained in the training data into
information that is invariant under transformations from F and data that is
F-sensitive. We utilize the training samples from the extra tasks to learn the
F-invariant aspects of the predictor. For example, if our domain is the two
dimensional Euclidean space, and F is a family of isometries of the plane,
then geometric shapes are F-invariant and can be deduced from translated
images of the original data distributions, while their location in the plane is
F-sensitive. We formalize this basic idea in a way that allows precise quan-
tification of the potential benefits to be drawn from such multi-task training
data. We derive sample complexity bounds that demonstrate the merits of
this algorithmic approach.
We formalize our notion of F-relevant information through an appropriate
partitioning of the learner’s set of potential label predictors. Given hypothesis
space, H, we create a family, H, of hypothesis spaces consisting of sets of
hypotheses in H which are equivalent up to transformations in F. We assume
that F forms a group under function composition and that H is closed under
6 Shai Ben-David and Reba Schuller Borbely

the action of F. As is standard, we will write [h]∼F , or simply [h], to denote


the equivalence class of h under ∼F .

Definition 2. Let F be a set of transformations over a domain set X , and


let H be a hypothesis space over that domain.
• We say that F acts as a group over H, if
1. H is closed under transformations from F. Namely, for every f ∈ F
and every h ∈ H, h ◦ f ∈ H, and
2. F forms a group under function composition. Namely, F is closed un-
der transformation composition and inverses (for every f, g ∈ F , the
inverse transformation, f −1 , and the composition, f ◦ g are also mem-
bers of F).
• When F acts as a group over H, we define equivalence relation ∼F on H
by:
h1 ∼F h2 iff there exists f ∈ F such that h2 = h1 ◦ f .

We shall consider the family of hypothesis spaces, H = {[h] : h ∈ H} - the


family of all equivalence classes of H under ∼F , (equivalently, H = H/ ∼F ).
Our learning paradigm consists of two stages. In the first stage, the learner
considers all of the sample sets and uses them to learn the aspects of the
task that are invariant under F. In our setting, this means finding a ∼F
equivalence class, [h], that is best suited for our prediction. In the second
stage, the learner considers only the training sample that comes from the
distribution of the target task (say, P1 ), to figure out which specific predictor
h0 ∈ [h] to choose as its final hypothesis. The benefit from the extra tasks
examples is therefore realized through the reduction of the hypotheses search
space, from the original H to the subset [h]. The smaller F is, the smaller
each [h] will be (since [h] = {h ◦ f : f ∈ F}), and so the larger is the benefit
of multitasking.
To make this outline more concrete, let us consider again the two dimen-
sional Euclidean space as our domain, X , and let F be a family of isometries
of the plane. Let H be the class of all axis-aligned rectangles. In this case, by
viewing examples generated by distributions Pi that are F-related to some
target P1 , one can learn about the length and width of the best rectangle pre-
dictor, but not about its location in the plane. More formally, in this case, for
every rectangle, h, its ∼F -equivalence class is the set of all rectangles that are
isomorphic to h. It is not hard to see that the VC-dimension of such a class
is 3, which is lower than the VC-dimension of the class of all axis aligned rec-
tangles in the plain (which is 4). In Section 1.5 analyze the VC-dimension of
such classes in arbitrary Euclidean dimensions and observe a similar reduction
(from 2d to b 3d2 c). This reduction in the complexity of the hypothesis class
is where we gain from having samples from extra tasks (i.e., extra F-related
distributions).
Title Suppressed Due to Excessive Length 7

1.4 Generalization error bounds


Following standard notation, we denote the true error, and empirical error,
respectively, of a hypothesis as follows. For distribution P ,

ErP (h) = P ( {(x, b) ∈ X × {0, 1} : h(x) 6= b} ) .

And for a sample, S, of points in X × {0, 1},

ˆ S (h) = |{(x, b) ∈ S : h(x) 6= b}| .


Er
|S|
Lemma 1. Let f : X → X and let P1 and P2 be probability distributions over
X.
For any hypothesis, h : X → {0, 1}, a probability distribution, P , over
X × {0, 1} and f : X → X ,

Erf [P ] (h ◦ f ) = ErP (h). (1.1)

This is an immediate consequence of the the definition of the image, f [P ] of


a distribution P and of the error, ErP (h).
Using this fact, we can deduce that the equivalence classes of H perform
equally well on the different tasks in the following sense.

Definition 3. For any hypothesis space, H, define

ErP (H) = inf ErP (h).


h∈H

Thus, we judge the performance of a hypothesis space on a given task by


the performance of the best hypothesis in the space on that task.

Lemma 2. Let P1 , P2 be F-related distributions and F be a group under


function composition. If H is closed under the action of F then ErP1 (H) =
ErP2 (H).

Proof. We need to show that

inf ErP1 (h) = inf ErP2 (h).


h∈H h∈H

It suffices to show that for every h ∈ H there exist h0 , h00 ∈ H such that
ErP2 (h0 ) ≤ ErP1 (h) and ErP1 (h00 ) ≤ ErP2 (h).
Since P1 , P2 be F-related, and F is a group (so each f ∈ F has its inverse
there) there exist f, f 0 ∈ F such that P1 = f [P2 ] and P2 = f 0 [P1 ]. Since H is
closed under the action of F, both h0 = h ◦ f , and h00 = h0 ◦ f 0 are members of
H. Applying Lemma 1, we get ErP2 (h0 ) = ErP1 (h) and ErP1 (h00 ) = ErP2 (h),
so we are done. t u

Before we continue, we require some background.


8 Shai Ben-David and Reba Schuller Borbely

1.4.1 Background from Baxter [Bax00]


Baxter [Bax00] discusses the following problem. Given a set of tasks and a
set of hypothesis spaces, choose the hypothesis space which performs best on
the set of the tasks. He provides a bound for the generalization error for this
problem in terms of a generalized VC-dimension parameter. In particular,
he bounds the rate of convergence of the average (over all the tasks) of the
empirical errors to the true error.
Baxter’s generalization error bound depends on the following notion of
generalized VC-dimension for families of hypothesis spaces.
Notation:
For function g : Y → Z and y = (y1 , . . . , yn ) ∈ Y n , g(y) will denote
(g(y1 ), . . . , g(yn )) ∈ Z n .
Definition 4. 1. Given a matrix of m×n domain points, for every hypothesis
class H ∈ H we consider the collection of all the {0, 1} matrices that can be
generated by applying n hypotheses from H to the n rows of the matrix (re-
spectively). Formally, denoting for each i ≤ n, xi = (xi,1 , xi,2 , . . . , xi,m ),
  
 h1 (x1,1 ), h1 (x1,2 ), . . . h1 (x1,m )
 

 .
. 
Hn,m (x1 , . . . , xn ) =  .  : h1 , . . . , hn ∈ H .

 

hn (xn,1 ), hn (xn,2 ), . . . hn (xn,m )
2. For family H of hypothesis spaces, we take the union, over all classes
H ∈ H, of these sets of {0, 1} matrices, and count how many matrices
are in that union. Finally, we take the maximum of that number over all
possible choices of the underlying matrix of m × n domain points. Namely,
¯ ¯
¯ [ ¯
¯ ¯
ΠH (n, m) = max m ¯ Hn,m (x1 , . . . , xn )¯ .
x1 ,...,xn ∈X ¯ ¯
H∈H

Definition 5. dH (n) = max{m : ΠH (n, m) = 2nm }.


The following statements follow directly from the above definitions:
Proposition 1. For every family of classes, H, and for every n,
S
1. sup{VC-dim(H) : H ∈ H} ≤ dH (n) ≤ VC-dim( {H : H ∈ H}).
2. In particular, if H consists of just one class, H = {H}, then, for every n,
dH (n) = VC-dim(H).
3. dH (n + 1) ≤ dH (n).
We can now state the relevant result from [Bax00] on multitask learning,
which appears as corollary 13 in [Bax00]. 4
P
4
Note that although [Bax00] only states that n1 n Pi
i=1 Er (hi ) ≤
1
Pn ˆ Si
n i=1 Er (h i ) + ², it is clear from the proofs in [Bax00] that this stronger
form holds.
Title Suppressed Due to Excessive Length 9

Theorem 1. Let H be any permissible boolean hypothesis space family5 , and


let S1 , . . . , Sn be a sequence of random samples from distributions P1 , . . . , Pn
(respectively) on X × {0, 1}. If the number of examples, in each sample Si
satisfies · ¸
88 22 1 4
|Si | ≥ 2 2dH (n) log + log ,
² ² n δ
then with probability at least 1 − δ (over the choice of S1 , . . . , Sn ), for any
H ∈ H, and h1 , . . . , hn ∈ H,
¯ n ¯
¯1 X n
1 X ˆ Si ¯
¯ Pi ¯
¯ Er (hi ) − Er (hi )¯ ≤ ².
¯n n ¯
i=1 i=1

Note that this theorem only bounds the average generalization error over
the different tasks.

1.4.2 Bounding the Generalization Error for Each Task

We are now ready to state and prove one of our main results, which gives
an upper bound on the sample complexity of finding a ∼F -equivalence class
which is near-optimal for each of the tasks. This is significant, since the goal
of multitask learning is to use extra tasks to improve performance on one
particular task.

Theorem 2. Let F be a family of domain transformations of some domain


set X and let H be a family of binary valued function on that domain so that
F acts as a group over H. For any h ∈ H, let [h] denote the equivalence
class of h under the relation ∼F (or, equivalently, the trajectory of h under
the transformations of F) and let H = {[h] : h ∈ H}. Let P1 . . . Pn be a set
of F- related probability distributions over X × {0, 1} and let S1 , . . . , Sn be
a sequence of random samples, each generated i.i.d. from the corresponding
distribution Pi .
Then, if the number of examples, in each sample Si satisfies
· ¸
88 22 1 4
|Si | ≥ 2 2dH (n) log + log , (1.2)
² ² n δ
then with probability at least 1 − δ, for every h ∈ H,
¯ ¯
¯ n
1 X ˆ Si ¯
¯ P1 ¯
¯Er ([h]) − inf Er (hi )¯ ≤ ².
¯ h1 ,...,hn ∈[h] n ¯
i=1
5
Permissibility, introduced by Ben-David in [BEHW89] is a “weak measure-
theoretic condition satisfied by almost all ’real-world’ hypothesis space families”
that is required for the VC type uniform convergence bounds to hold. Throughout
this paper we shall assume that all our classes are permissible.
10 Shai Ben-David and Reba Schuller Borbely

Proof. Observe that lemma 2 implies that for any h ∈ H and any 1 ≤ j ≤ n,
n
1 X Pi
ErPj ([h]∼F ) = inf Er (hi ).
h1 ,...,hn ∈[h]∼F n i=1

The result now follows from theorem 1. u


t

By combining standard generalization error result for single task learning


with theorem 2, we now have an sample complexity bound for our full, two-
stage, learning paradigm.
Definition 6 (MT-ERM Paradigm). Given classes F and H as above,
and a sequence of labeled sample sets, S1 , . . . Sn , the MT-ERM (Multi-Task
Empirical Risk Minimization) paradigm for F and H works in two steps as
follows:
Pn ˆ Si
1. Pick h∗ ∈ H that minimizes inf h1 ,...,hn ∈[h] i=1 Er (hi ) over all [h] ∈
H/ ∼F .
2. Pick h¦ ∈ [h∗ ] that minimizes ErS1 (h0 ) over all h0 ∈ [h∗ ], and output h¦
as the learner’s hypothesis.

Theorem 3. Let (P1 , . . . Pn ), (S1 , . . . Sn ), F and H be as in the previous the-


orem. Let dmax = maxh∈H VC-dim([h]∼F ). Let h¦ be the output of an (F, H)-
MT-ERM algorithm. Then, for every ²1 , ²2 , δ > 0, if

|S1 | ≥ (64/²21 )[2dmax log(12/²1 ) + log(8/δ)]

and, for all i > 1


· ¸
88 22 1 8
|Si | ≥ 2dH (n) log + log
²22 ²2 n δ

then, with probability greater than (1 − δ)

ErP1 (h¦ ) ≤ inf ErP1 (h) + 2(²1 + ²2 )


h∈H

Proof. Let h] be the best P1 label predictor in H. That is, h] = arg minh∈H ErP1 (h).
Let [h∗ ] be the equivalence class picked in the first stage of the MT-ERM
Pn ˆ Si
paradigm. I.e., [h∗ ] is a minimizer of inf h1 ,...,hn ∈[h] i=1 Er (hi ) over all
[h] ∈ H/ ∼F .
By the choice of h∗ ,
n
X n
X
inf ˆ Si (hi ) ≤
Er inf ˆ Si (hi ).
Er
h1 ,...,hn ∈[h∗ ] h1 ,...,hn ∈[h] ]
i=1 i=1

By Theorem 2, with probability greater than (1 − δ/2),


Title Suppressed Due to Excessive Length 11
n
X
inf ˆ Si (hi ) ≤ ErP1 ([h] ]) + ²1 ,
Er
h1 ,...,hn ∈[h] ]
i=1

and also,
n
X
ErP1 ([h∗ ]) ≤ inf ˆ Si (hi ) + ²1 .
Er
h1 ,...,hn ∈[h∗ ]
i=1
Combining these three inequalities, we get that with probability greater than
(1 − δ/2),
ErP1 ([h∗ ]) ≤ ErP1 ([h] ]) + 2²1 .

Finally, the second stage of the MT-ERM algorithm is just a standard ERM
algorithm yielding, with probability greater than (1 − δ/2), a hypothesis h¦ ∈
[h∗ ], whose P1 error is within 2²2 of the best hypothesis there, namely of
ErP1 ([h∗ ]).
Recall, that the common (single task) ERM paradigm requires a sample
of size
|S1 | ≥ (64/²2 )[VC-dim(H log(12/²) + log(4/δ)] (1.3)
to find a hypothesis h◦ that with probability greater than (1 − δ) has
ErP1 (h◦ ) ≤ inf ErP1 (h) + 2²
h∈H

It follows that the extent by which the samples from the extra Pi ’s help
depends on the gap between the parameters dmax and dH and VC-dim(H).
Next we examine what the values of these parameters for the specific case
of learning axis-aligned rectangles in <d . Finally, in section 1.6 we analyze
these parameters for general classes H and F.

1.5 Analysis of Axis-Aligned Rectangles under Euclidean


Shifts
Let X = Rd , and H be the set of characteristic functions of axis-aligned
rectangles, i.e., functions that map to 1 all points within some fixed rectangle
[a1 , a1 + b1 ] × . . . × [ad , ad + bd ] and map to 0 to all other points. Let F be
the set of Euclidean shifts, i.e., functions of the form f (x1 , . . . , xd ) = (x1 +
v1 , . . . , xd + vd ), where v1 , . . . vd ∈ R. As above, we let H denote H/ ∼F . Note
that indeed in this case F acts as a group over H.
Claim. For d > 1 and n > d, dH (n) ≤ d + b d2 c.
Proof. We will see in theorem 5 below, that for H as above and n > d,
dH (n) = max[h]∈H VC-dim([h]).

So, it suffices to show that for any [h] ∈ H, VC-dim([h]) ≤ d + d2 . We prove


this as the following lemma.
12 Shai Ben-David and Reba Schuller Borbely

Lemma 3. Let r be an axis-aligned rectangle in Rd , and let F (r) be the class


of all Euclidean shifts of r. Then VC-dim(F (r)) ≤ d + d2 .

Proof. Suppose [h] shatters set U . (I.e., for any V ⊆ U , there exists h0 ∈ [h]
such that for all x ∈ U , h0 (x) = 1 ⇐⇒ x ∈ V . We say that such an h0 obtains
subset V of U .)
Then, in order to obtain the complements of each of the singleton subsets
of U , each point x ∈ U must have some coordinate kx in which its value is
either the greatest or the least among the kx th coordinate of all points in U .
For a given point, p ∈ Rm , let p(k) denote its kth coordinate.
Assume |U | > d + d2 . Then, there must exist at least d + 1 points p ∈ U
for which kp is unique, i.e., for every other coordinate k, there exist points
y, z ∈ U such that y(k) > p(k) and z(k) < p(k). And since we are in Rd ,
there exist two such points, p and q such that kp = kq and both kp and kq are
unique. Call this coordinate k.
Now, what we have is points p, q ∈ U such that p(k) > x(k), and q(k) >
x(k) for all x ∈ U − {p, q}, and for every k 0 6= k there exist points y, z ∈ S
such that y(k) > p(k), q(k) and z(k) < p(k), q(k).
We proceed to show that no h0 ∈ [h] obtains the subset U − {p, q}.
Since [h] must obtain the subset of S consisting of U itself, the length of
the side in coordinate j for any h0 ∈ [h] must be at least maxx,y∈U |x(j)−y(j)|.
Without loss of generality, let us say that h obtains U . Then, any subset of
U obtained by any h0 ∈ [h] consists of those points in U that remain after
removing axis-parallel slices of h on up to d of its faces, with no two opposing
faces sliced.
However, the only slices that can remove p and q without removing any
other points from S are the two opposing slices in coordinate k, so, indeed,
the subset U − {p, q} cannot be obtained by any h0 ∈ [h] . u t

Note that the VC-dimension of the class of axis-aligned rectangles in Rd


is 2d. Comparing equation 1.2 to the corresponding standard VC-dimension
generalization error bound [VC71] (shown in equation 1.3), we have the fol-
lowing.

Claim. For the purpose of learning the rectangle side lengths, VC-dimension
considerations provide better accuracy guarantees for n shifted samples each
8
of size m than for a single sample of size n( 11 m − c), where c is a constant
depending on the desired accuracy and the Euclidean dimension. Furthermore,
c is small enough so that each sample may be smaller than that needed to
obtain the same guarantees for a single data set of size m.

Previously, [BDGS02] considered the PAC setting, that is, the setting in
which the learner is guaranteed that there exists a hypothesis h ∈ H that
achieves zero error under the data generating distribution (in our case, P1 ).
For that setting, they showed that for the H and F of the example above,
n shifted samples each of size m provide better accuracy guarantees than a
Title Suppressed Due to Excessive Length 13

single sample of size n(m−c0 ), where c0 is a constant depending on the desired


accuracy and the Euclidean dimension. Our analysis here provides nearly as
strong a result for the more realistic ’agnostic’ setting, where the assumption
of the existence of a zero error h is waived .

1.6 Analysis of dH (n)


In this section we investigate the parameters dH that, along with dmax , de-
termines the sample complexity (or, equivalently, the generalization error
bounds) of multi-task learning in our setting of F- related learning tasks
derived in Theorem 3.
By Proposition 1, dH (n + 1) ≤ dH (n) for any n. Thus, we see from eq.
1.2 that once we have committed ourselves to the multitask approach, extra
tasks can only be beneficial.
Note that since our collection, H, is made of the equivalence classes [h]F
(formed by the functions
S of F) over an initial hypotheses space, H, the union
of all these classes, {H : H ∈ H}, equals H. Therefore, by Proposition 1,
dmax ≤ dH (n) ≤ VC-dim(H) (where, as above, dmax = maxh∈H VC-dim([h])).
Thus, the best we can hope for is dH (n) = dmax . We conjecture that for any
H of finite VC-dimension and any F, this lower bound is attained for all
sufficiently large n. The following two theorems support this conjecture.

Notation:

Let |h| denote the cardinality of the support of h, i.e., |h| denotes |{x ∈
X : h(x) = 1}|. Also, for a function h and a vector x = (x1 , . . . , xn )), let
h(x1 ) = (h(x1 ), . . . , h(xn )).
Theorem 4. If there exists M such that |h| ≤ M for all h ∈ H, then there
exists n0 such that for all n ≥ n0 ,
dH (n) = max VC-dim([h]∼F ).
h∈H

(Recall that H denotes H/ ∼F . )


Proof. Assume dH (n) ≥ m, and let x1 , . . . , xn be such that
¯  ¯
¯ h ◦ f1 (x1 ) ¯
¯ ¯
¯  ..  ¯ nm
¯  .  : f1 . . . fn ∈ F, h ∈ H ¯ = 2
¯ 
¯
¯ h ◦ fn (xn ) ¯

Consider h0 ∈ H and f1 , . . . , fn such that


   
h0 ◦ f1 (x1 ) 1 ... 1
 ..   .. . . .. 
 . =
 . . .
h0 ◦ fn (xn ) 1 ... 1
14 Shai Ben-David and Reba Schuller Borbely

Note that for each i, there exists Si ⊆ h0 such that xi is some permutation
of {fi−1 (z) : z ∈ Si }. ³¡ ¢ ´
Say |h0 | = K. Then if n > K m − 1 2m , then there exists S ⊆ h0 and
i1 , . . . i2m such that Sij = S for 1 ≤ j ≤ 2m . Let σ1 , . . . σ2m be the corre-
sponding permutations.
Finally, letting v1 , . . . , v2m be an enumeration of all vectors of length m
over {0, 1}, letting N be any m×n matrix over {0, 1} whose ijth row is σj (vij ),
and letting h∗ and f10 , . . . , fn0 be such that
 
h∗ ◦ f10 (x1 )
 .. 
 .  = N,
h∗ ◦ fn0 (xn )

we see that [h∗ ]∼F shatters S, so m ≤ VC-dim([h∗ ]∼F ). ³ ´


¡M ¢
To eliminate the dependence on |h0 | = K, we set n0 = M/2 − 1 2M ,
³¡ ¢ ´
noting that n0 ≥ K m − 1 2m for all K, m ≤ M . u
t

So, we see that for any class of hypotheses bounded in size (i.e., the size of
their support), for sufficiently large n, dH (n) obtains it’s minimum possible
value of dmax . However, many natural hypothesis spaces consist of hypotheses
that are not only unbounded, but infinite in size. In the following theorem, we
show that this conjecture also holds for a natural hypothesis space consisting
of infinite hypotheses.
Theorem 5. Let X , H, and F be the rectangles with shifts as in section 1.5,
and let H denote H/ ∼F as usual. For n > d,

dH (n) = max VC-dim([h]∼F ).


h∈H
¡ ¢m
Proof. let x1 , . . . , xn ∈ Rd be such that
¯   ¯
¯ h ◦ f1 (x1 )  ¯
¯ ¯
¯  ..  ¯ nm
¯  .  : f1 . . . fn ∈ F, h ∈ H ¯ = 2 .
¯  ¯
¯ h ◦ fn (xn ) ¯

For y ∈ Rd , and 1 ≤ k ≤ d, we will denote by y(k) the kth coordinate of


y.
For k = 1, . . . , d, let wk = max{|y(k) − z(k)| : y, z ∈ xi for some
1 ≤ i ≤ n}. w1 , . . . , wd is the sequence of minimal possible side lengths for
any rectangle h such that
   
h ◦ f1 (x1 ) 1...1
 ..   .. .. 
 . = . . 
h ◦ fn (xn ) 1...1
Title Suppressed Due to Excessive Length 15

Without loss of generality, let us assume that x1 . . . , xn are ordered such


that these maxima are attained within x1 . . . , xd .
Now, for any binary sequence b = (b1 , . . . bm ), there exists some hb ∈ H
such that  
  1 ... 1
hb ◦ f1 (x1 )  . . 
 ..   .. .. 
 . = 
 1 ... 1 
hb ◦ fn (xn )
b1 . . . bm
Clearly, no such hb can have its kth side length less than wk . Furthermore,
there is no advantage in having any kth side length greater than wk . Thus,
we see that if h = [0, w1 ] × . . . × [0, wk ], then [h]∼F shatters xn , so VC-
dim([h]∼F ) ≥ m. u t
So, we see that it is not uncommon for dH (n) to attain its minimum
possible value, dmax . As that value can be significantly less than VC-dim(H),
it is reasonable to expect that, in many cases, the MT-ERM bound of Theorem
3 is significantly less than the standard ERM bound (Equation 1.3). Thus our
bounds can guarantee generalization on the basis of smaller size than the
standard VC-dimension considerations for the single-task approach.

Ben-David, et. al. [BDGS02] provide the following further results on dH (n).
n
Theorem 6. If F is finite and log(n) ≥ VC-dim(H), then

dH (n) ≤ 2 log(|F|).
Note that this result leads us to scenarios under which dH (n) is arbitrarily
smaller than VC-dim(H). Indeed, as long as F is finite, no matter how com-
plex H is, dH (n) remains bounded by 2 log(|F|). Furthermore, in practice, the
requirement that F be finite is not an unreasonable one, since real world prob-
lems come with bounded domains and real world computations have limited
numerical accuracy.
Furthermore, [BDGS02] provides the following generalization of this result.
log k
Theorem 7. If ∼F is of finite index6 , k, and n ≥ 4b log b , then
log k
dH (n) ≤ + 4b log b,
n
where µ ¶
b = max max VC-dim(H), 3 .
H∈H/∼F

This shows that even if F is infinite, dH (n) cannot grow arbitrarily with
increasing complexity of H.
6
The index of an equivalence relation is the number of equivalence classes into
which it partitions its domain.
16 Shai Ben-David and Reba Schuller Borbely

1.7 Conclusions and Future Work

We have presented a useful notion of relatedness between tasks for multiple


task learning. This notion of relatedness provides a natural model for a variety
of real world learning scenarios. We have derived generalization error bounds
for learning of multiple tasks related in this manner. These bounds depend on a
generalized VC-dimension parameter, which can be significantly less than the
ordinary VC-dimension, thus improving on the usual bounds for the single task
approach. We have provided analysis of this parameter and its relationship to
the usual VC-dimension, and we have given precise conditions under which
our multitask approach provides generalization guarantees based on smaller
sample size than the single task approach.
This work is a significant step towards the goal of a full theory of multiple
task learning. With the restriction to a special type of relatedness of tasks,
we have been able to obtain sample size bounds which are significantly better
than previously proven bounds for the learning to learn scenario.
Hopefully, this work will stimulate future work in several directions. There
is room for a more thorough understanding of the conditions under which
multi-task learning is advantageous over the single task approach in our sce-
nario; in particular, a greater understanding of the generalized VC-dimension
parameter would provide such insight. It would also be fruitful to relax the
requirements on the set of transformations through which the tasks are re-
lated, allowing these transformations to be arbitrary rather than bijections,
and perhaps even allowing the actual transformations between the tasks to be
merely approximated by the set of known transformations. Finally, the quest
for further applicable notions of relatedness between tasks remains the key to
a thorough understanding of multiple task learning.
We believe that this work provides convincing evidence that a theoretical
understanding of multiple task learning and its advantage over the single task
approach is a promising research endeavor worth pursuing.
References

[Bax95] Jonathan Baxter. Learning Internal Representations. In COLT:


Proceedings of the Workshop on Computational Learning Theory,
Morgan Kaufmann Publishers, 1995.
[Bax00] Jonathan Baxter. A Model of Inductive Bias Learning. Journal
of Artificial Intelligence Research, 12:149–198, 2000.
[BDGS02] Shai Ben-David, Johannes Gehrke, and Reba Schuller. A The-
oretical Framework for Learning from a Pool of Disparate Data
Sources. In Proceedings of the The Eighth ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining,
2002.
[BDS03] Shai Ben-David, and Reba Schuller. Exploiting Task Relatedness
for Multiple Task Lerning. In Proceedings of the The Sixteenth
Annual Conference on Learning Theory (COLT), 2003.
[BEHW89] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Man-
fred K. Warmuth. Learnability and the Vapnik-Chervonenkis Di-
mension. Journal of the Association for Computing Machinery,
36(4):929–965, 1989.
[BM98] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled
data with co-training. In COLT: Proceedings of the Workshop on
Computational Learning Theory, Morgan Kaufmann Publishers,
1998.
[Car97] Rich Caruana. Multitask Learning. Machine Learning, 28(1):41–
75, 1997.
[Hes98] Tom Heskes. Solving a Huge Number of Similar Tasks: A Com-
bination of Multi-Task Learning and a Hierarchical Bayesian Ap-
proach. In International Conference on Machine Learning, pages
233–241, 1998.
[IE96] N. Intrator and S. Edelman. How to Make a Low-Dimensional
Representation Suitable for Diverse Tasks. Connection Science,
8, 1996.
View publication stats

18 References

[KV97] Michael J. Kearns and Umesh V. Vazirani. An Introduction to


Computational Learning Theory. MIT Press, Cambridge, Massa-
chusetts, 1997.
[Thr96] S. Thrun. Is learning the n-th thing any easier than learning the
first? In D. Touretzky and M Mozer, editors, Advances in Neural
Information Processing Systems (NIPS), pages 640–646, 1996.
[VC71] V. Vapnik and A. Chervonenkis. On the Uniform Convergence
of Relative Frequencies of Events to Their Probabilities. Theoret.
Probl. And Its Appl, 16(2):264–280, 1971.

You might also like