0% found this document useful (0 votes)
16 views17 pages

General Model of Learning From Examples

The document outlines a general model of learning from examples that involves three actors: the environment, the oracle/supervisor, and the learner. It describes the learning task as finding a hypothesis function that best approximates the desired responses from the supervisor by minimizing loss. Several inductive principles are proposed for prescribing the optimal hypothesis based on a training sample, including minimizing empirical risk, selecting the most probable hypothesis given the data, and compressing information in the data. The document discusses analyzing learning performance by making assumptions about the environment, supervisor, hypothesis space, and other factors in the learning problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views17 pages

General Model of Learning From Examples

The document outlines a general model of learning from examples that involves three actors: the environment, the oracle/supervisor, and the learner. It describes the learning task as finding a hypothesis function that best approximates the desired responses from the supervisor by minimizing loss. Several inductive principles are proposed for prescribing the optimal hypothesis based on a training sample, including minimizing empirical risk, selecting the most probable hypothesis given the data, and compressing information in the data. The document discusses analyzing learning performance by making assumptions about the environment, supervisor, hypothesis space, and other factors in the learning problem.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

1.

1 General Model of Learning from Examples

A problem of learning is defined by the following components:

1. A set of three actors:


• The environment: it is supposed to be stationary and it generates data xi drawn
independently and identically distributed (sample i.i.d.) according to a distribution
DX on the space of data X .
• The oracle or supervisor or professor or Nature, who, for each xi return a desired
answer or label ui in agreement with an unknown conditional probability distribution
F (u | x ).
• The learner or learning machine (LM) A able to fulfill a function (not necessarily
deterministic) belonging to a space of functions H such that the exit produced by
LM verifies yi = h ( xi ) for h ∈ H .

1
2. The learning task: LM seek in the space H a function h who as well as possible
approximate the desired response of supervisor. In the case of induction, the distance
between the hypothesis function h and the response of the supervisor is defined by the
mean loss on the possible situations in Z = X × U . Thus, for each entry xi and response
of supervisor ui , one measures the loss or cost l ( ui , h ( xi ) ) evaluating the cost to have
taken the decision yi = h ( xi ) when the desired answer was ui (one will suppose, without
loss of generality, the loss positive or null). The mean cost, or real risk is then:
Rreal ( h ) = ∫ l ( u, h ( xi ) ) dF ( x, u )
Z

It is a statistical measurement that is a function of the functional dependence F ( x, u ) between


the entries x and desired exits u .This dependence can be expressed by a density of joint
probability definite on X × U who is unknown. In other words, it is a question of finding a
hypothesis h near to f in the sense of the loss function, and this is done particularly in the
frequently met areas of the space X . As these areas are not know a priori, it is necessary to
use the training sample to estimate them, and the problem of induction is thus to seek to
minimize the unknown real risk starting from the observation of the training sample S .

2
3. Finally, an inductive principle that prescribe what the sought function h must check,
according at the same time to the concept of proximity evoked above and the observed
training sample S = {( x1 , u1 ) ,..., ( x m , um )} , with the aim of minimizing the real risk.
The inductive principle dictate what the best assumption must check according to the training
sample, the loss function and, possibly, other criteria. It acts of an ideal objective. It should be
distinguished from the learning method (or algorithm) which describes an effective realization
of the inductive principle. For a given inductive principle, there are many learning methods,
which result from different choices of solving the computational problems that are beyond the
scope of the inductive principle. For example, the inductive principle can prescribe that it is
necessary to choose the simplest hypothesis compatible with the training sample. The learning
method must then specify how to seek this hypothesis indeed, or a suboptimal hypothesis if it
is necessary, by satisfying certain constraints of reliability like computational resources. Thus,
for example, the learning method will seek by a gradient method, sub-optimal but easily
controllable, the optimum defined by the inductive principle.
The definition given above is very general: in particular, it does not depend on the selected loss
function. It has the merit to distinguish the principal ingredients of a learning problem that are
often mixed in practical achievements descriptions.

3
1.1.1 The Theory of Inductive Inference
The inductive principle prescribes which assumption one should choose to minimize the real risk
based on the observation of a training sample. However, there is no unique or ideal inductive
principle single or ideal. How to extract, starting from the data, a regularity which has chances to
have a relevance for the future? A certain number of "reasonable" answers were proposed. We
describe the principal ones in a qualitative way here before more formally re-examining them in
this and next chapters.
The choice of the hypothesis minimizing the empirical risk (Empirical Risk Minimization or
the ERM principle). The empirical risk is the average loss measured on the training sample S :
1 m
Remp ( h ) = ∑ l ( ui , h ( xi ) )
m i =1
The idea subjacent of this principle is that the hypothesis, which agrees best to the data, by
supposing that those are representative, is a hypothesis that describes the world correctly in
general.
The ERM principle was, often implicitly, the principle used in the artificial intelligence since the
origin, as well in the connectionism as in the learning symbolic system. What could be more
natural indeed than to consider that a regularity observed on the known data will be still verified
by the phenomenon that produced these data? It is for example the guiding principle of the

4
perceptron algorithm like that of the ARCH system. In these two cases, one seeks a coherent
hypothesis with the examples, i.e. of null empirical risk. It is possible to refine the principle of
the empirical risk minimization while choosing among the optimal hypothesis, either one of
most specific, or one of most general.

5
The choice of the most probable hypothesis being given the training sample. It is the
Bayesian decision principle. The idea is here that it is possible to define a probability
distribution on the hypothesis space and that the knowledge preliminary to the learning can be
expressed in particular in the form of an a priori probability distribution on the hypotheses
space. The sample of learning is then regarded as information modifying the probability
distribution on H (see Figure Error! No text of specified style in document.-1). One can then, or to
choose the most probable a posteriori hypothesis (the maximum likelihood principle) or
Maximum A posteriori (MAP), or to adopt a composite hypothesis resulting from the average of
the hypotheses weighed by their a posteriori probability (true Bayesian approach).

Figure Error! No text of specified style in document.-1 The space of the assumptions H is presumably
provided with a density of probabilities a priori. The learning consists in modifying this density according to

6
the learning example.

7
The choice of a hypothesis that compresses as well as possible the information contained in
the training sample. We will call this precept: the information compression principle. The idea
is to eliminate the redundancies present in the data in order to extract the subjacent regularities
allowing an economic description of the world. It is implied that the regularities discovered in
the data are valid beyond the data and apply to the whole world.
The question is to know if these ideas intuitively tempting make it possible to learn effectively.
More precisely, we would like to obtain answers to a certain number of “naive questions “:
• does the application of the selected inductive principle to minimize the real risk indeed?
• what conditions should be checked for that? Moreover, the conditions must be verified on
the training sample, or on the target functions, or by the supervisor, or on the hypotheses
space.
• how the performance in generalization depends on the information contained in the training
sample, or of its size, etc. ?
• which maximum performance is possible for a given learning problem?
• which is the best LM for a given learning problem?
To answer these questions implies choices that depend partly on the type of inductive principle
used. It is why we made a brief description of it above.

8
1.1.2 How to Analyze the Learning?
We described the learning, at least the inductive learning, like a problem of optimization: to seek
the best hypothesis in the sense of the minimization of the risk mean of a training sample. We
want to now study under which conditions the resolution of such a problem is possible. We want
also to have tools permitting to judge performance of an inductive principle or of a learning
algorithm. This analysis requires additional assumptions, which correspond to options on what is
awaited from the LM.
Thus, a learning problem depends on the environment, which generates data xi according to a
certain unknown distribution DX , of the supervisor, which chooses a target function f , and of
the selected loss function l . The performance of the LM (which depends on the selected
inductive principle and the learning algorithm carrying it out) will be evaluated according to the
choices of each one of these parameters. When we seek to determine the expected performance
of the LM, we must thus discuss the source of these parameters. There are in particular three
possibilities:

1. It is supposed that one does not know anything a priori on the environment, therefore neither
on the distribution of the learning data, nor on the target dependence, but one wants to guard
oneself against the worst possible situations, like if the environment and supervisor were
adversaries. One then seeks to characterize the performance of learning in the worst possible

9
situations, which generally is expressed in intervals of the risk. It is the analysis in the worst
case. One also speaks about the framework of Min Max analysis, by reference to the game
theory. The advantage from this point of view is that the guarantees of possible performances
will be independent of the environment (the real risk being calculated whatever the
distribution of the events) and of supervisor or Nature (i.e. whatever the target function). On
the other hand, the conditions identified to obtain such guarantees will be so strong that they
will be often very far away from the real situations of learning.
2. One can on the contrary want to measure a mean of performance. In this case, it should be
supposed that there is a distribution DX on the learning data, but also a distribution DF on
possible target functions. The analysis that has results is the analysis in the average case.
One also speaks about Bayesian framework. This analysis allows in theory a finer
characterization of the performance, at the price however to have to make a priori
assumptions on the spaces X and F . Unfortunately, it is often very difficult analytically to
obtain guarantees conditions of successful learning, and it is generally necessary to use
methods of approximation, which remove a part of the interest of such an approach.
3. Finally, one could seek to characterize the most favorable case, when environment and
supervisor are benevolent and want to help the LM. But it is difficult to determine the border
between the benevolence, that of a professor for example, and the collusion who would see
the supervisor then acting like an accomplice and coding the target function in a known code
of the learner, which would not be any more learning, but an illicit transmission. This is why
this type of analysis, though interesting, does not have yet a well-established framework.

10
1.1.3 Validity Conditions for the ERM Principle
In this section, we concentrate on the analysis of the inductive principle ERM who prescribes to
choose hypothesis minimizing the empirical risk measured on the learning sample. It is indeed
the most employed rule, and its analysis leads to very general conceptual principles. The ERM
principle has initially been the subject of an analysis in the worst case, which we describe here.
An analysis in the average case, utilizing ideas of statistical physics, also was the object of many
very interesting works. It is however technically definitely more difficult.
Let us recall that the learning consists in seeking a hypothesis h such that it minimizes the
learning average loss. Formally, it is a question of finding an optimal hypothesis h* minimizing
the real risk:
h* = ArgMin Rreal ( h )
h∈H

The problem is that one does not know the real risk attached to each hypothesis h . The natural
idea is thus to select hypothesis h in H who behaves well on the learning data S : it is the
inductive principle of the ERM. We will note hˆS this optimal hypothesis for the empirical risk
measured on the sample S :
hˆ = ArgMin R ( h )
S emp
h∈H

11
This inductive principle will be relevant only if the empirical risk is correlated with the real risk.
Its analysis must thus attempt to study the correlation between the two risks and more
particularly the correlation between the real risk incurred with the selected hypothesis using the
( )
ERM principle, Remp hˆS and the optimal real risk Rreal ( h* )
This correlation will utilize two aspects:
1. The difference (inevitably positive or null) between the real risk of the hypothesis hˆS
selected using the training sample S and the real risk of the optimal hypothesis h* :
( ) ( )
Rreal hˆS − Rreal h* .
2. The probability that this difference is higher than a given bound ε . Being given indeed that
the empirical risk depends on the training sample, the correlation between the measured
empirical risk and the real risk depend on the representativeness of this sample. This is why
( ) ( )
also, when the difference Rreal h* − Rreal hˆS is studied is necessary to take into account
the probability of the training sample being given a certain target function. One cannot be a
good learner of all the situations, but only for the reasonable one (representative training
samples) which are most probable.

12
Thus, let us take again the question of the correlation between the empirical risk and the real
risk. The ERM principle is a valid inductive principle if, the real risk computed with the
hypothesis hˆS that minimize the empirical risk, is guaranteed to be close to the optimal real risk
obtained with the optimal hypothesis h* . This closeness must happen in the large majority of the
situations that can occur, i.e. for the majority of the samples of learning drawn by chance
according to the distribution DX .
In a more formal way, one seeks under which conditions it would be possible to ensure:

( ∀ ) 0 ≤ ε , δ ≤ 1: P ( R ( hˆ ) − R
real S real (h ) ≥ ε ) < δ
*
(1)

Figure Error! No text of specified style in document.-2

13
It is well understood that the correlation between the empirical risk and the real risk depends on
the training sample S and, since this one is drawn randomly, of its size m too. That suggests a
natural application of the law of large numbers according to which, under very general
conditions, the average of a random variable (here Remp ( h ) ) converges towards its mean (here
Rreal ( h ) ) when grows the size m of the sample.
The law of large numbers encourages to want to ensure the inequality (1) by growing the sample
size of the training set S towards ∞ and to ask starting from which size m of the training sample
drawn randomly (according to an unspecified distribution DX ), the inequality (1) is guaranteed:

( ∀ ) 0 ≤ ε , δ ≤ 1: ( ∃) m such that P (R
real ( h ) − R ( hˆ ) ≥ ε ) < δ
*
real Sm

The Figure Error! No text of specified style in document.-3 illustrates the desired convergence of the
empirical risk towards the real risk.

14
Figure Error! No text of specified style in document.-3

15
Definition 1.1 (The consistency of the ERM principle)
It is said that the ERM principle is consistent if the unknown real risk Rreal hˆS ( ) and the

( )
empirical risk Remp hˆS converge towards the same limit Rreal ( h* ) when the size m of the sample
tends towards ∞ (see Figure Error! No text of specified style in document.-4).

Figure Error! No text of specified style in document.-4 Consistency of the ERM principle.

16
Unfortunately, the law of large numbers is not sufficient for our study. Indeed, what this law
affirms, it is that the empirical risk of a given hypotheses data h converge towards its real risk.
However, what we seek is different. We want to ensure that the hypotheses hˆSm taken in H and
who minimizes the empirical risk for the sample S has an associated real risk that converges
towards the optimal real risk obtained for the optimal hypotheses h* independent of S . It is well
necessary to see that in this case the training sample does not play only the role of a test set, but
also the role of a set being used for the choice of the hypothesis. One cannot thus take without
precaution the performance measured on the learning sample as representative of the real
performance.

Indeed one can build hypotheses spaces H such that it is always possible to find a hypothesis
with null empirical risk without that indicate a good general performance. It is sufficient to
imagine a hypothesis that agrees to all the learning data and which randomly draws the label
from the not sights data. This is why it is necessary to generalize the law of large numbers.
This generalization is easy in the case of a finite space of hypotheses functions. It was obtained
only recently by Vapnik and Chervonenkis (1971,1989), within the framework of induction, for
the case of spaces of infinite size.

17

You might also like