ML Study
ML Study
1. Problems for which existing solutions require a lot of hand-tuning or long lists of
rules: one Machine Learning algorithm can often simplify code and perform better.
2. Complex problems for which there is no good solution at all using a traditional
approach: the best Machine Learning techniques can find a solution.
3. Fluctuating environments: a Machine Learning system can adapt to new data.
4. Getting insights about complex problems and large amounts of data.
Machine Learning systems can be classified according to the amount and type of
supervision they get during training. There are four major categories:
Supervised learning, Unsupervised learning, Semisupervised learning,
and Reinforcement Learning.
Another typical task is to predict a target numeric value, such as the price of a
car, given a set of features (mileage, age, brand, etc.) called predictors. This sort of
task is called regression.
Unsupervised Learning:
Here, the training data is unlabelled.
Semisupervised Learning:
Some algorithms can deal with partially labeled training data, usually a lot of
unlabeled data and a little bit of labeled data. This is called semisupervised learning.
Some photo-hosting services, such as Google Photos, are good examples of
this. Once you upload all your family photos to the service, it automatically
recognizes that the same person A shows up in photos 1, 5, and 11, while another
person B shows up in photos 2, 5, and 7. This is the unsupervised part of the
algorithm (clustering). Now all the system needs is for you to tell it who these
people are. Just one label per person, and it is able to name everyone in every photo,
which is useful for searching photos.
Reinforcement Learning:
Reinforcement Learning is a very different beast.
The learning system, called an agent in this context, can observe the
environment, select and perform actions, and get rewards in return (or penalties in
the form of negative rewards).
It must then learn by itself what is the best strategy, called a policy, to get the
most reward over time. A policy defines what action the agent should choose when
it is in a given situation.
If you want a batch learning system to know about new data (such as a new type of
spam), you need to train a new version of the system from scratch on the full dataset
(not just the new data, but also the old data), then stop the old system and replace it
with the new one.
This solution is simple and often works fine, but training using the full set of data
can take many hours, so you would typically train a new system only every 24 hours
or even just weekly. If your system needs to adapt to rapidly changing data (e.g. to
predict stock prices), then you need a more reactive solution.
Also, training on the full set of data requires a lot of computing resources (CPU,
memory space, disk space, disk I/O, network I/O, etc.). If you have a lot of data and
you automate your system to train from scratch every day, it will end up costing you
a lot of money. If the amount of data is huge, it may even be impossible to use a
batch learning algorithm.
Finally, if your system needs to be able to learn autonomously and it has limited
resources (e.g. a smartphone application or a rover on Mars), then carrying around
large amounts of training data and taking up a lot of resources to train for hours
every day is a showstopper.
Fortunately, a better option in all these cases is to use algorithms that are capable of
learning incrementally.
Online learning algorithms can also be used to train systems on huge datasets that
cannot fit in one machine’s main memory (this is called out-of-core learning). The
algorithm loads part of the data, runs a training step on that data, and repeats the
process until it has run on all of the data.
This whole process is usually done offline (i.e., not on the live system), so
online learning can be a confusing name. Think of it as incremental learning.
One important parameter of online learning systems is how fast they should adapt
to changing data: this is called the learning rate. If you set a high learning rate, then
your system will rapidly adapt to new data, but it will also tend to quickly forget the
old data (you don’t want a spam filter to flag only the latest kinds of spam it was
shown). Conversely, if you set a low learning rate, the system will have more inertia;
that is, it will learn more slowly, but it will also be less sensitive to noise in the new
data or to sequences of nonrepresentative data points.
A big challenge with online learning is that if bad data is fed to the system, the
system’s performance will gradually decline. If we are talking about a live system,
your clients will notice.
One more way to categorize Machine Learning systems is by how they
generalize. Most Machine Learning tasks are about making predictions. This
means that given a number of training examples, the system needs to be able to
generalize to examples it has never seen before. Having a good performance
measure on the training data is good, but insufficient; the true goal is to perform
well on new instances.
This is called instance-based learning: the system learns the examples by heart,
then generalizes to new cases using a similarity measure.
Overfitting happens when the model is too complex relative to the amount and
noisiness of the training data. The possible solutions are:
• To simplify the model by selecting one with fewer parameters (e.g., a linear
model rather than a high-degree polynomial model), by reducing the number of
attributes in the training data or by constraining the model
• To reduce the noise in the training data (e.g., fix data errors and remove
outliers)
Constraining a model to make it simpler and reduce the risk of overfitting is called
regularization.
A better option is to split your data into two sets: the training set and the test set.
As these names imply, you train your model using the training set, and you test it
using the test set. The error rate on new cases is called the generalization error (or
out-of sample error), and by evaluating your model on the test set, you get an
estimation of this error. This value tells you how well your model will perform on
instances it has never seen before.
If the training error is low (i.e., your model makes few mistakes on the training set)
but the generalization error is high, it means that your model is overfitting the
training data.
It is common to use 80% of the data for training and hold out 20% for testing.
So evaluating a model is simple enough: just use a test set. Now suppose you are
hesitating between two models (say a linear model and a polynomial model): how
can you decide? One option is to train both and compare how well they generalize
using the test set.
Now suppose that the linear model generalizes better, but you want to apply some
regularization to avoid overfitting. The question is: how do you choose the value of
the regularization hyperparameter? One option is to train 100 different models
using 100 different values for this hyperparameter. Suppose you find the best
hyperparameter value that produces a model with the lowest generalization error,
say just 5% error.
So you launch this model into production, but unfortunately it does not perform as
well as expected and produces 15% errors. What just happened?
The problem is that you measured the generalization error multiple times on the
test set, and you adapted the model and hyperparameters to produce the best model
for that set. This means that the model is unlikely to perform as well on new data.
A common solution to this problem is to have a second holdout set called the
validation set. You train multiple models with various hyperparameters using
the training set, you select the model and hyperparameters that perform best on the
validation set, and when you’re happy with your model you run a single final test
against the test set to get an estimate of the generalization error.
To avoid “wasting” too much training data in validation sets, a common technique is
to use cross-validation: the training set is split into complementary subsets, and
each model is trained against a different combination of these subsets and validated
against the remaining parts. Once the model type and hyperparameters have been
selected, a final model is trained using these hyperparameters on the full training
set, and the generalized error is measured on the test set.