Regularization
Regularization
Regularization
• Generalizing regression
• Overfitting
• Cross-validation
• L2 and L1 regularization for linear estimators
• A Bayesian interpretation of regularization
• Bias-variance trade-off
0 0
−1 −1
0 x 1 0 x 1
1 M =3 1 M =9
t t
0 0
−1 −1
0 x 1 0 x 1
• The higher the degree of the polynomial M , the more degrees of freedom,
and the more capacity to “overfit” the training data
• Typical overfitting means that error on the training data is very low, but
error on new instances is high
• Assume that the data is drawn from some fixed, unknown probability
distribution
• Every hypothesis has a ”true” error J ∗(h), which is the expected error
when data is drawn from the distribution.
• Because we do not have all the data, we measure the error on the training
set JD (h)
• Suppose we compare hypotheses h1 and h2 on the training set, and
JD (h1) < JD (h2)
• If h2 is ”truly” better, i.e. J ∗(h2) < J ∗(h1), our algorithm is overfitting.
• We need theoretical and empirical methods to guard against it!
ERMS
0.5
0
0 3 M 6 9
• The training error decreases with the degree of the polynomial M , i.e.
the complexity of the hypothesis
• The testing error, measured on independent data, decreases at first, then
starts increasing
• Cross-validation helps us:
– Find a good hypothesis class (M in our case), using a validation set
of data
– Report unbiased results, using a test set, untouched during either
parameter training or validation
y
x x
y
x x
• Note that at this point we do not have one predictor, but several!
• Several methods can then be used to come up with just one predictor
(more on this later)
• A squared penalty on the weights would make the math work nicely in
our case:
1 λ
(Φw − y)T (Φw − y) + wT w
2 2
• This is also known as L2 regularization, or weight decay in neural
networks
• By re-grouping terms, we get:
1
JD (w) = (wT (ΦT Φ + λI)w − wT ΦT y − yT Φw + yT y)
2
w = (ΦT Φ + λI)−1ΦT y
1 λ
arg min (Φw − y)T (Φw − y) + wT w = (ΦT Φ + λI)−1ΦT y
w 2 2
min f (w)
w
such that g(w) = 0
∇f (x)
xA
∇g(x)
g(x) = 0
xA
∇g(x)
g(x) = 0
min f (w)
w
such that g(w) ≥ 0
∇f (x)
xA
∇g(x)
xB
g(x) = 0
g(x) > 0
λ ≥ 0
g(x) ≥ 0
λg(x) = 0
such that wT w ≤ η
w?
w1
w∗ = (ΦT Φ + λI)−1Φy
such that w1 + w2 ≤ η
w1 − w2 ≤ η
−w1 + w2 ≤ η
−w1 − w2 ≤ η
• Solving this program directly can be done for problems with a small
number of inputs
w?
w1
• If there are irrelevant input features, Lasso is likely to make their weights
0, while L2 is likely to just make all weights small
• Lasso is biased towards providing sparse solutions in general
• Lasso optimization is computationally more expensive than L2
• More efficient solution methods have to be used for large numbers of
inputs (e.g. least-angle regression, 2003).
• L1 methods of various types are very popular
• lcavol • • lcavol
• •
•• • •
0.6
0.6
• •
•
• • •
• • •
• • • • •
• •
•
• •
•
• •
0.4
0.4
•
• •
•
• • •• • svi
lweight
svi
• • • • lweight
• • • • • • •• • • • •• • •
• •• • • • • • pgg45 • • • • • pgg45
Coefficients
Coefficients
•
• • •• • •
• • •
• • •• • •• • • • •
• lbph • • • • • • lbph
0.2
0.2
• •
• • • • • • • ••
• • • • • • • • • • •
• •• • • • • •• ••
• • • • • • • •
• • • • • • • • •
• • • • • • • • • • •
•
•
• • •• • • • • •
• •••
•
•• •• •• • •• • • • •
• • • •
• •• • • • • • • •• • • • •
•• • • • • • • • •• •
• • • • • • • • • • • •• • •
0.0
0.0
• • • • • • • • • • • • • • • • • • • • • • • • • • • • •
• • • • gleason • • • gleason
• •
• •• • •
•• •
•• • • • • •
• •• • • •
• • age • • • age
• •
•
-0.2
-0.2
• •
•
•
• lcp • lcp
0 0
−1 −1
0 x 1 0 x 1
1 1
t t
0 0
−1 −1
0 x 1 0 x 1
0 0
−1 −1
0 x 1 0 x 1
1 1
t t
0 0
−1 −1
0 x 1 0 x 1
• Uncertainty estimates, i.e. how sure we are of the value of the function
• These can be used to guide active learning: ask about inputs for which
the uncertainty in the value of the function is very high
• In the limit, Bayesian and maximum likelihood learning converge to the
same answer
• In the short term, one needs a good prior to get good estimates of the
parameters
• Sometimes the prior is overwhelmed by the data likelihood too early.
• Using the Bayesian approach does NOT eliminate the need to do cross-
validation in general
• More on this later...
n
X
E[X] = xiP (xi)
i=1
• Simple algebra:
2 2 2
EP (y − h(x)) |x = EP (h(x)) − 2yh(x) + y |x
2
2
= EP (h(x)) |x + EP y |x − 2EP [y|x]EP [h(x)|x]
0.06
0.03
0
−3 −2 −1 0 1 2
ln λ
• The bias-variance sum approximates well the test error over a set of 1000
points
• x-axis measures the hypothesis complexity (decreasing left-to-right)
• Simple hypotheses usually have high bias (bias will be high at many
points, so it will likely be high for many possible input distributions)
• Complex hypotheses have high variance: the hypothesis is very dependent
on the data set on which it was trained.
• Typically, bias comes from not having good hypotheses in the considered
class
• Variance results from the hypothesis class containing “too many”
hypotheses
• MLE estimation is typically unbiased, but has high variance
• Bayesian estimation is biased, but typically has lower variance
• Hence, we are faced with a trade-off: choose a more expressive class
of hypotheses, which will generate higher variance, or a less expressive
class, which will generate higher bias
• Making the trade-off has to depend on the amount of data available to
fit the parameters (data usually mitigates the variance problem)
1 N = 15 1 N = 100
t t
0 0
−1 −1
0 x 1 0 x 1