4.machine Learning Basics (C)
4.machine Learning Basics (C)
The central challenge in machine learning is that we must perform well on new,
previously unseen inputs—not just those on which our model was trained. The
ability to perform well on previously unobserved inputs is called generalization.
When training a machine learning model, we have access to a training set, we can
compute some error measure on the training set called the training error, and we
reduce this training error. So far, what we have described is simply an
optimization problem. What separates machine learning from optimization is that
we want the generalization error, also called the test error, to be low as well.
The generalization error is defined as the expected value of the error on a new
input. Here the expectation is taken across di fferent possible inputs, drawn from
the distribution of inputs we expect the system to encounter in practice.
In our linear regression example we trained the model by minimizing the training
error
How can
we affect performance on the test set when we get to observe only the training set?
The field of statistical learning theory provides some answers. If the training and the test
set are collected arbitrarily, there is indeed little we can do. If we are allowed to make
some assumptions about how the training and test set are collected, then we can make
some progress.
The train and test data are generated by a probability distribution over datasets called
the data generating process. We typically make a set of assumptions known
collectively as the assumptions.
These assumptions are that the examples in each dataset are independent from each
other, and that the train set and test set are identically distributed, drawn from the
same probability distribution as each other. This assumption allows us to describe the
data generating process with a probability distribution over a single example. The
same distribution is then used to generate every train example and every test example.
We call that shared underlying distribution the data generating distribution, denoted
pdata. This probabilistic framework and the i.e. assumptions allow us to
mathematically study the relationship between training error and test error.
When we use a machine learning algorithm, we do not fix the parameters ahead of
time, then sample both datasets. We sample the training set, then use it to choose
the parameters to reduce training set error, then sample the test set. Under this
process, the expected test error is greater than or equal to the expected value of
training error.
The factors determining how well a machine learning algorithm will perform are its
ability to:
Underfitting occurs when the model is not able to obtain a sufficiently low error
value on the training set.
Overfitting occurs when the gap between the training error and test error is
too large.
We can control whether a model is more likely to overfit or underfit by altering its
capacity.
Models with low capacity may struggle to fit the training set.
Models with high capacity can overfit by memorizing properties of the training set
that do not serve them well on the test set.
One way to control the capacity of a learning algorithm is by choosing its hypothesis
space, the set of functions that the learning algorithm is allowed to select as being the
solution.
For example:
The linear regression algorithm has the set of all linear functions of its input as its
hypothesis space.
A polynomial of degree one gives us the linear regression model with which we are
already familiar with prediction
yˆ = b + wx.
2
ˆ = b + w 1 x + w2 x .
Though this model implements a quadratic function of its input, the output is still a
linear function of the parameters, so we can still use the normal equations to train
the model in closed form. We can continue to add more powers of x as additional
features, for example to obtain a polynomial of degree 9:
Machine learning algorithms will generally perform best when their capacity is
appropriate for the true complexity of the task they need to perform and the amount
of training data they are provided with.
Models with insufficient capacity are unable to solve complex tasks. Models with
high capacity can solve complex tasks, but when their capacity is higher than
needed to solve the present task they may overfit.
We fit three models to this example training set.
The training data was generated synthetically by randomly sampling x values
and choosing y deterministically by evaluating a quadratic function.
(Left)A linear function fit to the data suffers from underfitting - it cannot
capture the curvature that is present in the data.
(Center)A quadratic function fit to the data generalizes well to unseen
points. It does not suffer from a significant amount of overfitting or
underfitting.
(Right)A polynomial of degree 9 fit to the data suffers from overfitting. Here
we used the Moore-Penrose pseudoinverse to solve the underdetermined
normal equations.
Figure: Typical relationship between capacity and error.
Training and test error behave differently. At the left end of the graph,
training error and generalization error are both high. This is the
underfitting regime.
As we increase capacity, training error decreases, but the gap between
training and generalization error increases.
Eventually, the size of this gap outweighs the decrease in training error, and
we enter the overfitting regime, where capacity is too large, above the
optimal capacity.
To logically infer a rule describing every member of a set one must have
information about every member of that set.
Machine learning promises to find rules that are probably correct about
most members of the set they concern.
The no free lunch theorem for machine learning states that averaged over all
possible data generating distributions every classification algorithm has the same
error rate when classifying previously unobserved points.
Figure : The effect of the training dataset size on the train and test error, as well as
on the optimal model capacity.
For each size, we generated 40 different training sets in order to plot error
bars showing 95 percent confidence intervals.
(Top)The MSE on the training and test set for two different models:
o The training error increases as the size of the training set increases. This is
because larger datasets are harder to fit.
o Simultaneously the test error decreases because fewer incorrect hypotheses
are consistent with the training data.
Model with degree chosen to minimize the test error:
o The quadratic model does not have enough capacity to solve the task, so its
test error asymptotes to a high value.
o The test error at optimal capacity asymptotes to the Bayes error.
o The training error can fall below the Bayes error due to the ability of the
training algorithm to memorize specific instances of the training set.
o As the training size increases to infinity, the training error of any fixed-
capacity model (here, the quadratic model) must rise to at least the Bayes
error.
Regularization
The no free lunch theorem implies that we must design our machine
learning algorithms to perform well on a specific task.
The behavior of our algorithm is strongly a ffected not just by how large
we make the set of functions allowed in its hypothesis space, but by the
specific identity of those functions.
These linear functions can be very useful for problems where the
relationship between inputs and outputs truly is close to linear.
They are less useful for problems that behave in a very nonlinear
fashion.
For example,
o we can modify the training criterion for linear regression to include weight
decay.
o where λ is a value chosen ahead of time that controls the strength of our
preference for smaller weights.
o This gives us solutions that have a smaller slope, or put weight on fewer of
the features.