07 Regularization
07 Regularization
To recap, if we have too many features then the learned hypothesis may give a cost function of
exactly zero
But this tries too hard to fit the training set
Fails to provide a general solution - unable to generalize (apply to new examples)
Addressing overfitting
Regularization
Small values for parameters corresponds to a simpler hypothesis (you effectively get rid
of some of the terms)
A simpler hypothesis is less prone to overfitting
Another example
Have 100 features x1, x2, ..., x100
Unlike the polynomial example, we don't know what are the high order terms
How do we pick the ones to pick to shrink?
With regularization, take cost function and modify it to shrink all the parameters
Add a term at the end
This regularization term shrinks every parameter
By convention you don't penalize θ0 - minimization is from θ1 onwards
Previously, gradient descent would repeatedly update the parameters θj, where j = 0,1,2...n
simultaneously
Shown below
The term
We saw earlier that logistic regression can be prone to overfitting with lots of features
Logistic regression cost function is as follows;
To modify it we have to add an extra term
Again, to modify the algorithm we simply need to modify the update rule for θ1, onwards
Looks cosmetically the same as linear regression, except obviously the hypothesis is very
different
use fminunc
Pass it an @costfunction argument
Minimizes in an optimized manner using the cost function
jVal
Need code to compute J(θ)
Need to include regularization term
Gradient
Needs to be the partial derivative of J(θ) with respect to θi
Adding the appropriate term here is also necessary