10: Advice For Applying Machine Learning: Deciding What To Try Next
10: Advice For Applying Machine Learning: Deciding What To Try Next
So, say you've implemented regularized linear regression to predict housing prices
Trained it
But, when you test on new data you find it makes unacceptably large errors in its
predictions
:-(
What should you try next?
There are many things you can do;
Get more training data
Sometimes more data doesn't help
Often it does though, although you should always do some preliminary
testing to make sure more data will actually make a difference (discussed
later)
Try a smaller set a features
Carefully select small subset
You can do this by hand, or use some dimensionality reduction
technique (e.g. PCA - we'll get to this later)
Try getting additional features
Sometimes this isn't helpful
LOOK at the data
Can be very time consuming
Adding polynomial features
You're grasping at straws, aren't you...
Building your own, new, better features based on your knowledge of the
problem
Can be risky if you accidentally over fit your data by creating new
features which are inherently specific/relevant to your training data
Try decreasing or increasing λ
Change how important the regularization term is in your calculations
These changes can become MAJOR projects/headaches (6 months +)
Sadly, most common method for choosing one of these examples is to go by
gut feeling (randomly)
Many times, see people spend huge amounts of time only to discover that the
avenue is fruitless
No apples, pears, or any other fruit. Nada.
There are some simple techniques which can let you rule out half the things on the
list
Save you a lot of time!
Machine learning diagnostics
Tests you can run to see what is/what isn't working for an algorithm
See what you can change to improve an algorithm's performance
These can take time to implement and understand (week)
But, they can also save you spending months going down an avenue which
will never work
Evaluating a hypothesis
When we fit parameters to training data, try and minimize the error
We might think a low error is good - doesn't necessarily mean a good parameter set
Could, in fact, be indicative of overfitting
This means you model will fail to generalize
How do you tell if a hypothesis is overfitting?
Could plot hθ(x)
But with lots of features may be impossible to plot
Standard way to evaluate a hypothesis is
Split data into two portions
1st portion is training set
2nd portion is test set
Typical split might be 70:30 (training:test)
So
Minimize cost function for each of the models as before
Test these hypothesis on the cross validation set to generate
the cross validation error
Pick the hypothesis with the lowest cross validation error
e.g. pick θ5
Finally
Estimate generalization error of model using the test set
Final note
In machine learning as practiced today - many people will select the model using the
test set and then check the model is OK for generalization using the test error (which
we've said is bad because it gives a bias analysis)
With a MASSIVE test set this is maybe OK
But considered much better practice to have separate training and validation sets
The equation above describes fitting a high order polynomial with regularization (used to
keep parameter values small)
Consider three cases
λ = large
All θ values are heavily penalized
So most parameters end up being close to zero
So hypothesis ends up being close to 0
So high bias -> under fitting data
λ = intermediate
Only this values gives the fitting which is reasonable
λ = small
Lambda = 0
So we make the regularization term 0
So high variance -> Get overfitting (minimal regularization means it
obviously doesn't do what it's meant to)
Define cross validation error and test set errors as before (i.e. without regularization
term)
So they are 1/2 average squared error of various sets
Choosing λ
Have a set or range of values to use
Often increment by factors of 2 so
model(1)= λ = 0
model(2)= λ = 0.01
model(3)= λ = 0.02
model(4) = λ = 0.04
model(5) = λ = 0.08
.
.
.
model(p) = λ = 10
This gives a number of models which have different λ
With these models
Take each one (p th)
Minimize the cost function
This will generate some parameter vector
Call this θ(p)
So now we have a set of parameter vectors corresponding to models with
different λ values
Take all of the hypothesis and use the cross validation set to validate them
Measure average squared error on cross validation set
Pick the model which gives the lowest error
Say we pick θ(5)
Finally, take the one we've selected ( θ(5)) and test it with the test set
Bias/variance as a function of λ
Plot λ vs.
Jtrain
When λ is small you get a small value (regularization basically goes to 0)
When λ is large you get a large vale corresponding to high bias
Jcv
When λ is small we see high variance
Too small a value means we over fit the data
When λ is large we end up underfitting, so this is bias
So cross validation error is high
Such a plot can help show you you're picking a good value for λ
Learning curves
A learning curve is often useful to plot for algorithmic sanity checking or improving
performance
What is a learning curve?
Plot Jtrain (average squared error on training set) or Jcv (average squared error on
cross validation set)
Plot against m (number of training examples)
m is a constant
So artificially reduce m and recalculate errors with the smaller training set
sizes
Jtrain
Error on smaller sample sizes is smaller (as less variance to accommodate)
So as m grows error grows
Jcv
Error on cross validation set
When you have a tiny training set your generalize badly
But as training set grows your hypothesis generalize better
So cv error will decrease as m increases
Try adding additional features --> fixes high bias (because hypothesis is too
simple, make hypothesis more specific)